How do you scale individual services and find performance bottlenecks?

Question

Accepted Answer

A key microservices benefit is **scaling each service independently** to match its own load, instead of scaling the whole app. Finding bottlenecks is then a matter of measuring per-service and per-hop.

## Scaling techniques

- **Horizontal scaling** — add stateless instances behind a load balancer.
- **Autoscaling** — scale on CPU, memory, queue depth, or custom metrics.
- **Caching** — cut repeated work and downstream load.
- **Async + queues** — absorb spikes; decouple slow work.
- **Data scaling** — read replicas, sharding, per-service stores.

```yaml
# Kubernetes HPA: scale orders on CPU
minReplicas: 3
maxReplicas: 20
metric: cpu
targetUtilization: 70   # add pods when avg CPU > 70%
```

## Finding bottlenecks

```text
1. Metrics: which service has high latency / saturation? (RED/USE)
2. Traces: which SPAN in the request is slow?
3. Drill in: DB query? lock? N+1 calls? GC pause?
```

```text
Gateway ──┤ Orders ──┤ Payments ████████████ ← 80% of latency here
                       Inventory ─┤
```

## Common bottlenecks

```text
⚠️ Chatty synchronous calls (fan-out per request)
⚠️ Shared/overloaded database
⚠️ Missing or cold cache
⚠️ Unbounded retries amplifying load
```

## Pitfall

Scaling a service whose bottleneck is a **shared database** just moves more load onto the DB — scale the actual constraint, not the symptom.

## Why it matters

Independent scaling lets you spend capacity precisely where the load is, which is far cheaper than scaling a monolith wholesale.

But scaling blindly wastes money and can worsen things; measuring per-service metrics and per-hop traces is what tells you the real constraint to fix.