What are metrics, logs, and traces, and when do you reach for each?

Question

Accepted Answer

These are the **three pillars of observability**. They answer different questions: metrics tell you **that** something is wrong, logs tell you **what** happened, and traces tell you **where** in a distributed flow the time or error went.

## The three pillars

```text
METRICS  aggregate numbers over time (counters, gauges, histograms)
         → cheap, low cardinality, great for trends & ALERTING
         → e.g. error rate = 2%, p99 latency = 800ms

LOGS     discrete, timestamped events with detail (often structured JSON)
         → rich context for DEBUGGING a specific request
         → e.g. {"level":"error","user":123,"msg":"payment declined"}

TRACES   the path of one request across services, with timing per span
         → shows latency BREAKDOWN and where a call fails
         → e.g. checkout 800ms = api 50ms + db 700ms + email 50ms
```

## When to reach for each — one incident

```text
1. METRIC alerts: "checkout p99 latency jumped to 2s"   → you know THERE's a problem
2. TRACE a slow request: 1.8s of 2s is spent in the inventory service
                                                        → you know WHERE it is
3. LOGS of the inventory service at that time: "slow query: missing index"
                                                        → you know WHAT happened
```

Metrics narrow you to a symptom and time window; traces localize it to a service or call; logs give the exact cause. Going straight to logs without metrics means searching blind.

## Cost and cardinality

Metrics are aggregated, so they stay cheap even at scale — ideal for always-on dashboards and alerts. Logs and traces are per-event and expensive, so they are usually **sampled** and queried on demand during investigation.

## Why it matters

Using the wrong pillar wastes time: you can't alert effectively on raw logs (too noisy, too costly), and you can't debug a specific failed request from an aggregate metric. Knowing that metrics detect, traces localize, and logs explain gives you a fast, repeatable path from "something's wrong" to root cause.