在大规模场景下，针对日志、指标和追踪的可观测性策略是什么？

Question

Accepted Answer

可观测性建立在**三大支柱**之上——**日志**、**指标**和**追踪**——其目标是为一个庞大到无法手动逐一检查的系统回答“哪里出了问题，以及为什么”。在大规模场景下，策略的核心在于关联、采样和成本。

## 三大支柱

| 支柱 | 回答的问题 | 工具 |
|---|---|---|
| 指标（Metrics） | 是否有问题？（速率、延迟） | Prometheus、Grafana |
| 追踪（Traces） | 问题出在流程的哪个环节？ | OpenTelemetry、Jaeger |
| 日志（Logs） | 究竟发生了什么？ | ELK、Loki |

```text
Metrics alert ─▶ trace pinpoints the slow service ─▶ logs explain the cause
   (broad)              (path)                          (detail)
```

## 让它们相互关联

追踪/关联 ID 必须贯穿指标标签、日志行和 span，这样你才能在它们之间自由切换查看。

```text
log line:  level=error trace_id=abc123 service=payments msg="gateway timeout"
                       ^^^^^^^^^^^^^^^ same id appears in the trace + metrics
```

## 大规模场景下需关注的问题

```text
✓ Standardize: OpenTelemetry across all services
✓ Use structured (JSON) logs — queryable, not grep-only
✓ Sample traces (e.g. keep all errors + 1% of success) to control cost
✓ Define SLOs and alert on symptoms (latency/error rate), not noise
✓ RED/USE method for dashboards (Rate, Errors, Duration)
```

## 陷阱

以 100% 的比例记录所有内容是负担不起的，而且会淹没有效信号。应改为采样、结构化，并基于 SLO 进行告警。

## 为什么这很重要

面对数百个服务，你无法 SSH 登录逐一查看——可观测性是理解生产环境行为的唯一途径。

制胜策略是关联化、采样化、以 SLO 为驱动：它能快速暴露真正的问题，同时不会因遥测数据存储而让你破产，也不会让值班人员被噪声淹没。

指标（Metrics）	是否有问题？（速率、延迟）	Prometheus、Grafana
追踪（Traces）	问题出在流程的哪个环节？	OpenTelemetry、Jaeger
日志（Logs）	究竟发生了什么？	ELK、Loki