During an incident your job is to restore service calmly and coordinate the response, not to be the hero who fixes it alone. Clear roles, calm communication, and a bias to mitigate first separate a smooth response from chaos.
How to run an incident
1. ASSIGN roles — incident commander (coordinates), responders (fix),
comms (updates stakeholders). One person can't do all three.
2. MITIGATE first — stop the bleeding (roll back, feature-flag off)
before chasing root cause.
3. COMMUNICATE on a cadence — even "still investigating" every 15-30 min.
4. STAY CALM — the team mirrors your energy. Blame comes later, or never.
5. After: BLAMELESS post-mortem — fix the system, not the person.
Mitigate before you diagnose
The instinct to find the root cause is strong, resist it. If a rollback restores service, do that , then investigate calmly with the pressure off. Users care about being unblocked, not about your diagnosis.
