The core rule: alert on symptoms, not causes, and page only on what is actionable and urgent. A noisy alert that fires every night gets muted or ignored — so the real risk isn't a missing alert, it's a desensitized on-call who sleeps through the real one.
Symptoms over causes
Alert on (error rate, latency, availability), not on internal causes like "CPU > 80%". High CPU may be harmless; what matters is whether users are affected.
