Your production website just went down. How do you handle it?

Question

Accepted Answer

First priority is **restore service, then find the cause** — mitigation comes before diagnosis. I'd declare an incident, assign clear roles, and drive toward the fastest safe recovery, communicating the whole way.

## Stabilize first

- **Declare an incident** and open a single channel (war room / Slack thread) so coordination isn't scattered.
- **Assign roles**: an **Incident Commander** to make decisions, a **comms owner** to update stakeholders, and **responders** to investigate. As Tech Lead I often take IC so engineers can focus on the problem.
- **Reach for the fastest reversible fix.** If a deploy correlates with the outage, **roll back** first and ask questions later — restoring users beats being right.

## Diagnose in parallel

- Check the **obvious signals**: dashboards, error rates, recent deploys, infra changes, traffic spikes, expired certs.
- Form a hypothesis, test the cheapest one first, and **avoid changing five things at once** — you won't know what worked.

## Communicate continuously

Silence breeds panic. I send updates on a steady cadence even when there's no news:

```
[14:05] Investigating — checkout is down, ~40% of users affected. Next update 14:20.
[14:20] Identified: bad deploy. Rolling back now. ETA 10 min.
[14:35] Service restored. Monitoring. Postmortem to follow.
```

## After recovery

- Confirm full recovery, not just "it looks better."
- Run a **blameless postmortem** within a few days: timeline, root cause, what made it slow to detect/fix, and **concrete action items with owners**.
- The output is systemic improvement (better alerts, guardrails, rollback automation) — not a name to blame.

## Pitfalls

- **Debugging before mitigating** while users suffer.
- **No single decision-maker**, so five people guess in parallel.
- **Going dark** on stakeholders.
- **Blaming individuals**, which kills the honesty future incidents depend on.

## Why it matters

Outages are inevitable; how you run them defines team trust and customer confidence. Calm, role-based coordination plus blameless follow-up turns a bad day into a stronger system — and signals to your engineers that it's safe to move fast because failure is handled as a process, not a witch hunt.