The goal is to catch issues before the page — to find degradation while there's still slack to absorb it. That means watching leading indicators, defining SLOs with error budgets, and actively probing the system rather than waiting for it to fail.
The goal is to catch issues before the page — to find degradation while there's still slack to absorb it. That means watching leading indicators, defining SLOs with error budgets, and actively probing the system rather than waiting for it to fail.
An SLO turns reliability into a number (e.g. 99.9% of requests succeed). The remaining 0.1% is your error budget. Tracking burn rate lets you alert when you're spending the budget too fast — long before you've actually breached the SLO and users notice.
SLO 99.9% → 0.1% error budget/month (~43 min of downtime)
burn rate rising fast → you'll exhaust it in 2 days → alert NOW, while it's fixable
SYNTHETIC MONITORING scripted checks hit critical paths on a schedule
(login, checkout) → fails even at 3am with zero real traffic
HEALTH CHECKS /healthz endpoints + dependency checks → load balancer
pulls bad instances before users hit them
RUM (real-user mon.) measure latency/errors from actual browsers/devices →
catches issues only some users/regions see
Synthetic monitoring is powerful because it doesn't wait for a user — it continuously exercises the system, so a broken checkout is found at 3am, not when the morning rush complains.
The earliest signs are in resources, not yet in user-facing errors. Alert on the trend, not just a static line.
LEADING INDICATORS saturation (CPU/mem climbing), queue depth growing,
connection-pool nearing limit, latency CREEPING up
ANOMALY DETECTION flag deviation from the normal baseline / seasonality
TREND ALERTS "disk will fill in 4h at this rate" → act before it's full
A slowly rising p99 or a growing queue is a warning shot: by acting on the creep, you prevent the outage that the creep was heading toward.
Reactive monitoring means users are your alerting system — by the time they complain, the incident is already live and your error budget is spent. Proactive detection (SLO burn rate, synthetics, health checks, RUM, leading indicators, trend/anomaly alerts) buys lead time: you fix a saturating queue or a creeping latency before it becomes a 2am page and an angry customer. That lead time is the difference between a quiet fix and an outage.