When StyleKit launched, it had around 30 styles. Today it has over 130. Scaling a style library by 4x sounds like it should be straightforward -- just add more styles, right? In practice, every new style introduced problems we did not anticipate at 30. Naming collisions, token inconsistencies, quality variance, discovery friction, and the growing difficulty of ensuring that every style actually works well with AI coding tools. This post shares what we learned and the systems we built to keep the library useful as it grew.

The Quality Problem

At 30 styles, quality was easy to maintain because every style was hand-crafted. At 80 styles, we started noticing inconsistencies. Some styles had detailed shadow tokens with small, medium, and large variants. Others had a single shadow value. Some defined hover transitions with specific easing curves. Others just said "ease-in-out." The result was that AI-generated output from well-defined styles looked polished, while output from loosely-defined styles looked generic.

We solved this with a quality scoring system. Every style is evaluated across six dimensions: token completeness (are all six token categories defined?), do/don't list depth (at least 5 items each), recipe coverage (does it have component recipes for card, button, and section?), interaction specificity (are hover, active, and focus states defined with exact values?), visual distinctiveness (does it look meaningfully different from similar styles?), and description quality (is the style philosophy explained clearly enough for an AI to understand the intent?).

Each dimension scores A through D. A style needs at least a B average to ship. When we introduced the scorer at v0.10.0, we retroactively evaluated all existing styles and found that about 40% were below B. We spent a full release cycle upgrading every style to meet the threshold. The improvement in AI output quality was immediate and measurable -- prompts generated from B-grade styles produced significantly more on-target results than those from C-grade styles.

The Discovery Problem

At 30 styles, a simple grid works. Scroll through, find something you like, done. At 130 styles, a grid becomes overwhelming. Users would scroll through dozens of cards, develop decision fatigue, and either pick whatever caught their eye first or give up and use a style they already knew.

We addressed this in three ways. First, scenario-based discovery: instead of browsing by name, users can describe what they are building (SaaS landing page, portfolio, e-commerce store) and get a filtered set of styles that match their use case. This mapping is maintained in a scenario database that links page types to compatible style characteristics.

Second, style combination recommendations: some users want to mix styles -- one for the main app and another for marketing pages, or a primary style with accent elements from a complementary style. The recommendation system analyzes token compatibility to suggest pairings that work well together without clashing.

Third, component patterns gallery: instead of starting from the style level, users can start from a component (pricing table, hero section, feature grid) and see how it renders in different styles. This makes the decision concrete -- you see exactly what your pricing page will look like in Neo-Brutalist vs Glassmorphism, rather than trying to imagine it from a style card preview.

The Consistency Problem

Every style in StyleKit follows the same token structure: border, shadow, typography, spacing, colors, and interaction. This structure is enforced at the type level -- TypeScript will not compile a style definition that is missing a required token category. But structural consistency is not the same as semantic consistency.

For example, we discovered that different style authors interpreted "medium shadow" differently. In some styles, the medium shadow was a subtle 4px blur. In others, it was a prominent 16px spread. When two styles use the same token name for shadows of wildly different intensity, AI tools that interpolate between styles produce jarring results.

We solved this by establishing reference ranges for each token. A "medium shadow" should produce between 6px and 12px of visual depth. A "large" heading should be between 2.5rem and 4rem. A "fast" transition should be between 150ms and 250ms. These ranges are not enforced in code -- they are documented in the style addition checklist and verified during review. Authors can push the boundaries when a style demands it (Neo-Brutalist shadows are deliberately outside the normal range), but they need to justify it.

The Internationalization Challenge

StyleKit serves users in English and Chinese, and each style has localized metadata: name, description, do/don't lists, and AI rules. At 30 styles, maintaining two language versions was manageable. At 130, it became a significant burden. A single update to a style's English do-list meant a corresponding Chinese update, and these frequently fell out of sync.

The solution was adding dedicated English fields (aiRulesEn, doListEn, dontListEn, keywordsEn) as optional overrides alongside the Chinese defaults. The prompt export system checks the user's locale and selects the appropriate version automatically. If an English field is not defined, it falls back to the Chinese version. This approach lets us progressively translate styles without blocking any features on incomplete translations.

What We Would Do Differently

If we were starting over, three things would change. First, we would introduce the quality scorer from day one rather than retrofitting it at 80 styles. Early permissiveness creates technical debt that compounds. Second, we would standardize token value ranges before the first external contribution rather than after discovering inconsistencies. Third, we would build the scenario discovery system earlier -- it turned out to be more important for usability than any individual style improvement.

The meta-lesson is that a design system library is not just a collection of definitions. It is a product with its own UX problems: search, discovery, quality assurance, and consistency. The styles are the content; the systems around them determine whether that content is actually usable. At 130+ styles, the systems matter more than any individual style.

You can explore the full library at stylekit.top and see the quality, discovery, and combination features in action.

The Quality Problem

The Discovery Problem

The Consistency Problem

The Internationalization Challenge

What We Would Do Differently

You can explore the full library at stylekit.top and see the quality, discovery, and combination features in action.

Scaling a Design System to 130+ Styles: What We Learned

The Quality Problem

The Discovery Problem

The Consistency Problem

The Internationalization Challenge

What We Would Do Differently

Scaling a Design System to 130+ Styles: What We Learned

The Quality Problem

The Discovery Problem

The Consistency Problem

The Internationalization Challenge

What We Would Do Differently