How do you lead during a production incident?

Question

Accepted Answer

During an incident your job is to **restore service calmly and coordinate the response**, not to be the hero who fixes it alone. Clear roles, calm communication, and a bias to mitigate first separate a smooth response from chaos.

## How to run an incident

```text
1. ASSIGN roles — incident commander (coordinates), responders (fix),
   comms (updates stakeholders). One person can't do all three.
2. MITIGATE first — stop the bleeding (roll back, feature-flag off)
   before chasing root cause.
3. COMMUNICATE on a cadence — even "still investigating" every 15-30 min.
4. STAY CALM — the team mirrors your energy. Blame comes later, or never.
5. After: BLAMELESS post-mortem — fix the system, not the person.
```

## Mitigate before you diagnose

The instinct to find the root cause is strong, resist it. If a rollback restores service, do that *first*, then investigate calmly with the pressure off. Users care about being unblocked, not about your diagnosis.

## A concrete example

A deploy breaks checkout. Don't debug the new code live under pressure. Roll back immediately, confirm checkout works, post an update, *then* investigate the bad deploy in calm conditions.

## A pitfall

A blameful post-mortem teaches people to hide problems. Keep it blameless, focus on the systemic gaps (no canary, no alert) that let it happen.

## Why it matters

Incidents are high-stress and high-visibility, how you lead shapes both the outcome and the team's trust in you.

Calm coordination and blameless learning turn an outage into a system that's harder to break next time.