What is an observability strategy for logs, metrics, and traces at scale?

Question

Accepted Answer

Observability rests on **three pillars** — **logs**, **metrics**, and **traces** — and the goal is to answer "what's wrong and why" for a system too large to inspect by hand. At scale, the strategy is about correlation, sampling, and cost.

## The three pillars

| Pillar | Answers | Tooling |
|---|---|---|
| Metrics | Is something wrong? (rates, latency) | Prometheus, Grafana |
| Traces | Where in the flow? | OpenTelemetry, Jaeger |
| Logs | What exactly happened? | ELK, Loki |

```text
Metrics alert ─▶ trace pinpoints the slow service ─▶ logs explain the cause
   (broad)              (path)                          (detail)
```

## Make them correlate

The trace/correlation ID must thread through metrics labels, log lines, and spans, so you can pivot between them.

```text
log line:  level=error trace_id=abc123 service=payments msg="gateway timeout"
                       ^^^^^^^^^^^^^^^ same id appears in the trace + metrics
```

## At-scale concerns

```text
✓ Standardize: OpenTelemetry across all services
✓ Use structured (JSON) logs — queryable, not grep-only
✓ Sample traces (e.g. keep all errors + 1% of success) to control cost
✓ Define SLOs and alert on symptoms (latency/error rate), not noise
✓ RED/USE method for dashboards (Rate, Errors, Duration)
```

## Pitfall

Logging everything at 100% is unaffordable and drowns signal. Sample, structure, and alert on SLOs instead.

## Why it matters

With hundreds of services, you can't SSH in and look — observability is the only way to understand production behavior.

The winning strategy is correlated, sampled, and SLO-driven: it surfaces real problems quickly without bankrupting you on telemetry storage or burying on-call in noise.

Metrics	Is something wrong? (rates, latency)	Prometheus, Grafana
Traces	Where in the flow?	OpenTelemetry, Jaeger
Logs	What exactly happened?	ELK, Loki