How do you detect problems before users complain?

Question

Accepted Answer

The goal is to catch issues **before the page** — to find degradation while there's still slack to absorb it. That means watching **leading indicators**, defining **SLOs with error budgets**, and actively probing the system rather than waiting for it to fail.

## SLOs and error budgets

An **SLO** turns reliability into a number (e.g. 99.9% of requests succeed). The remaining 0.1% is your **error budget**. Tracking **burn rate** lets you alert when you're spending the budget too fast — long before you've actually breached the SLO and users notice.

```text
SLO 99.9% → 0.1% error budget/month (~43 min of downtime)
burn rate rising fast → you'll exhaust it in 2 days → alert NOW, while it's fixable
```

## Active probing, not just passive metrics

```text
SYNTHETIC MONITORING  scripted checks hit critical paths on a schedule
                      (login, checkout) → fails even at 3am with zero real traffic
HEALTH CHECKS         /healthz endpoints + dependency checks → load balancer
                      pulls bad instances before users hit them
RUM (real-user mon.)  measure latency/errors from actual browsers/devices →
                      catches issues only some users/regions see
```

Synthetic monitoring is powerful because it doesn't wait for a user — it continuously exercises the system, so a broken checkout is found at 3am, not when the morning rush complains.

## Leading indicators and trends

The earliest signs are in resources, not yet in user-facing errors. Alert on the **trend**, not just a static line.

```text
LEADING INDICATORS   saturation (CPU/mem climbing), queue depth growing,
                     connection-pool nearing limit, latency CREEPING up
ANOMALY DETECTION    flag deviation from the normal baseline / seasonality
TREND ALERTS         "disk will fill in 4h at this rate" → act before it's full
```

A slowly rising p99 or a growing queue is a warning shot: by acting on the creep, you prevent the outage that the creep was heading toward.

## Why it matters

Reactive monitoring means users are your alerting system — by the time they complain, the incident is already live and your error budget is spent. Proactive detection (SLO burn rate, synthetics, health checks, RUM, leading indicators, trend/anomaly alerts) buys lead time: you fix a saturating queue or a creeping latency before it becomes a 2am page and an angry customer. That lead time is the difference between a quiet fix and an outage.