根据Google的SRE书籍,四个黄金信号是延迟、流量、错误和饱和度。如果你只能测量用户面向系统的四个指标,就测量这些——它们共同捕捉绝大多数问题。
四个信号
text
LATENCY how long a request takes
→ split SUCCESSFUL vs FAILED latency (a fast 500 isn't "fast")
→ track percentiles (p50/p95/p99), not averages
TRAFFIC how much demand the system is under
→ requests/sec, transactions/sec, concurrent sessions
ERRORS rate of failing requests
→ explicit (HTTP 500) and implicit (wrong content, too slow)
SATURATION how "full" the system is — its most constrained resource
→ CPU, memory, I/O, queue depth; a leading indicator of trouble
这四个信号为什么能涵盖大多数问题
延迟和错误是用户。流量解释了(10倍流量激增时的延迟尖峰意味着与基线状态不同)。饱和度是——它在延迟和错误出现之前就会上升,给你在用户受伤前发出警告。
