How would you design monitoring for an application from scratch?

Question

Accepted Answer

Start **top-down from what users feel**, not bottom-up from infrastructure. The most reliable host fleet is worthless if requests are failing, so begin with user-facing **SLIs** — **latency**, **error rate**, **availability** — then add the four golden signals, then infra metrics last.

## The layering, from user inward

```text
1. USER-FACING SLIs   → what the user experiences (latency, errors, availability)
2. GOLDEN SIGNALS     → latency, traffic, errors, saturation per service
3. INFRA METRICS      → CPU, memory, disk, network (causes, not symptoms)
```

If you only watch CPU and disk (bottom-up), you can be fully green while users get 500s. Watching SLIs first (top-down) means you alert on **symptoms users actually feel**, then drill down into golden signals and infra to find the cause.

## The pipeline: instrument → collect → dashboard → alert

```text
INSTRUMENT  app emits metrics/logs/traces (e.g. request_duration_seconds histogram)
   ↓
COLLECT     a TSDB scrapes/ingests them (Prometheus, Datadog agent)
   ↓
DASHBOARD   visualize SLIs + golden signals (Grafana) for humans to read
   ↓
ALERT       fire on SLO violations / burn rate, routed to on-call
```

## A concrete starting point

```promql
# Availability SLI: fraction of requests that succeed
sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# Latency SLI: p99 request latency
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
```

Define an **SLO** on each SLI (e.g. 99.9% availability, p99 < 300ms), dashboard them, and alert when the SLO is at risk — not on every blip.

## Why it matters

Monitoring built bottom-up tells you a disk is 80% full but not that customers can't check out. Starting from user-facing SLIs ties every dashboard and alert back to real user impact, keeps noise low, and gives a clear drill-down path (symptom → golden signal → infra cause) when something breaks.