How to scale a system
Scaling is the discipline of removing bottlenecks one at a time so a system serves more load without falling over. This guide explains the core patterns from first principles, then shows the loop for applying them in Skeema.
Vertical vs horizontal scaling
- Vertical scaling (scale up)
- Give one machine more CPU, RAM, or faster disk. Simple, but there’s a ceiling and a single point of failure.
- Horizontal scaling (scale out)
- Add more machines and spread load across them. Nearly unbounded, and redundant by design — but it only works if your services are stateless.
The toolbox, in the order you usually reach for it
1. Load balancing
Put a load balancer in front of multiple instances of a service. It distributes requests and stops sending traffic to unhealthy instances. This is the foundation of horizontal scaling and removes the gateway/app tier as a single point of failure.
2. Caching
A cache (e.g. Redis) stores the results of expensive work so repeat reads are served in microseconds instead of hitting the database. Caching is the highest-leverage fix for read-heavy systems because it removes load from the hardest-to-scale component — the database.
- •Cache-aside — the app checks the cache, and on a miss reads the DB and populates it.
- •TTL / invalidation — entries expire or are cleared on write to avoid serving stale data.
3. Database scaling
The database is usually the last and hardest thing to scale. In rough order:
- 1Read replicas — copies that serve reads, taking pressure off the primary (which still handles writes). Great when reads ≫ writes.
- 2Indexing — the cheapest win: an index turns a full-table scan into a lookup. Index your foreign keys and query filters.
- 3Partitioning — split one big table by a key (e.g. date) so queries touch less data.
- 4Sharding — split data across multiple databases by a shard key (e.g. user_id). Powerful but complex — cross-shard queries are hard. Reach for it last.
4. Asynchronous processing
Move work that doesn’t need to finish before responding — sending email, generating thumbnails, AI scoring — onto a queue or event bus (Kafka, SQS). The request returns immediately; workers process the backlog. This decouples services, smooths traffic spikes, and shortens the synchronous path that determines user-facing latency.
5. CDN & edge
A CDN caches static assets (and increasingly API responses) at edge locations near users, cutting latency and slashing egress from your origin. It’s the cheapest way to make a global audience feel fast.
Quick reference: symptom → fix
| Symptom | Likely fix |
|---|---|
| Gateway/app tier saturated | Load balancer + horizontal scaling |
| Database CPU high on reads | Cache + read replicas |
| Slow joins / queries | Add indexes (especially on FKs) |
| One huge table | Partition, then consider sharding |
| Slow user response from background work | Move it to an async queue |
| Slow static assets / high egress | Put a CDN in front |
The scaling loop in Skeema
- 1Simulate — run a load simulation at your target user tier. Read the score and the named bottleneck.
- 2Diagnose — the bottleneck and its root cause tell you which pattern above applies.
- 3Apply the fix — accept Skeema’s suggested improvement (it adds the load balancer, replica, or queue and wires the edges) or edit by hand.
- 4Re-simulate — confirm the bottleneck moved and the score improved. Repeat until the design holds at your target scale.
- ✓Horizontal scaling beats vertical for growth and reliability — but requires stateless services.
- ✓Reach for load balancing, then caching, then database scaling, then async, then a CDN.
- ✓Read replicas and indexes handle most database pressure; shard only when you must.
- ✓In Skeema: simulate → find the bottleneck → apply the fix → re-simulate, and repeat.