Guides

How to scale a system

Scaling is the discipline of removing bottlenecks one at a time so a system serves more load without falling over. This guide explains the core patterns from first principles, then shows the loop for applying them in Skeema.

Vertical vs horizontal scaling

Vertical scaling (scale up): Give one machine more CPU, RAM, or faster disk. Simple, but there’s a ceiling and a single point of failure.
Horizontal scaling (scale out): Add more machines and spread load across them. Nearly unbounded, and redundant by design — but it only works if your services are stateless.

Statelessness is the prerequisite

A service is stateless when any instance can handle any request — no session or data is pinned to a particular box. Push state into a shared store (database, Redis) and you can add or remove instances freely. Almost every scaling pattern below assumes this.

The toolbox, in the order you usually reach for it

1. Load balancing

Put a load balancer in front of multiple instances of a service. It distributes requests and stops sending traffic to unhealthy instances. This is the foundation of horizontal scaling and removes the gateway/app tier as a single point of failure.

2. Caching

A cache (e.g. Redis) stores the results of expensive work so repeat reads are served in microseconds instead of hitting the database. Caching is the highest-leverage fix for read-heavy systems because it removes load from the hardest-to-scale component — the database.

•Cache-aside — the app checks the cache, and on a miss reads the DB and populates it.
•TTL / invalidation — entries expire or are cleared on write to avoid serving stale data.

3. Database scaling

The database is usually the last and hardest thing to scale. In rough order:

1Read replicas — copies that serve reads, taking pressure off the primary (which still handles writes). Great when reads ≫ writes.
2Indexing — the cheapest win: an index turns a full-table scan into a lookup. Index your foreign keys and query filters.
3Partitioning — split one big table by a key (e.g. date) so queries touch less data.
4Sharding — split data across multiple databases by a shard key (e.g. user_id). Powerful but complex — cross-shard queries are hard. Reach for it last.

4. Asynchronous processing

Move work that doesn’t need to finish before responding — sending email, generating thumbnails, AI scoring — onto a queue or event bus (Kafka, SQS). The request returns immediately; workers process the backlog. This decouples services, smooths traffic spikes, and shortens the synchronous path that determines user-facing latency.

5. CDN & edge

A CDN caches static assets (and increasingly API responses) at edge locations near users, cutting latency and slashing egress from your origin. It’s the cheapest way to make a global audience feel fast.

Quick reference: symptom → fix

Symptom	Likely fix
Gateway/app tier saturated	Load balancer + horizontal scaling
Database CPU high on reads	Cache + read replicas
Slow joins / queries	Add indexes (especially on FKs)
One huge table	Partition, then consider sharding
Slow user response from background work	Move it to an async queue
Slow static assets / high egress	Put a CDN in front

The scaling loop in Skeema

1Simulate — run a load simulation at your target user tier. Read the score and the named bottleneck.
2Diagnose — the bottleneck and its root cause tell you which pattern above applies.
3Apply the fix — accept Skeema’s suggested improvement (it adds the load balancer, replica, or queue and wires the edges) or edit by hand.
4Re-simulate — confirm the bottleneck moved and the score improved. Repeat until the design holds at your target scale.

Don’t scale prematurely

Every pattern adds complexity and cost. Add caching, replicas, and sharding when the simulation (or real metrics) shows you need them — not on day one. A single well-built service plus a managed database carries most products surprisingly far.

Key takeaways

✓Horizontal scaling beats vertical for growth and reliability — but requires stateless services.
✓Reach for load balancing, then caching, then database scaling, then async, then a CDN.
✓Read replicas and indexes handle most database pressure; shard only when you must.
✓In Skeema: simulate → find the bottleneck → apply the fix → re-simulate, and repeat.

Try it yourself

Generate a full system from one prompt — free, no card required.

Open the live demo →