First priority is restore service, then find the cause — mitigation comes before diagnosis. I'd declare an incident, assign clear roles, and drive toward the fastest safe recovery, communicating the whole way.
First priority is restore service, then find the cause — mitigation comes before diagnosis. I'd declare an incident, assign clear roles, and drive toward the fastest safe recovery, communicating the whole way.
Silence breeds panic. I send updates on a steady cadence even when there's no news:
[14:05] Investigating — checkout is down, ~40% of users affected. Next update 14:20.
[14:20] Identified: bad deploy. Rolling back now. ETA 10 min.
[14:35] Service restored. Monitoring. Postmortem to follow.
Outages are inevitable; how you run them defines team trust and customer confidence. Calm, role-based coordination plus blameless follow-up turns a bad day into a stronger system — and signals to your engineers that it's safe to move fast because failure is handled as a process, not a witch hunt.