At scale, failures are inevitable — servers crash, networks fail, dependencies become unavailable. Designing for failure means building systems that tolerate and recover from failures gracefully rather than assuming everything works. This is essential for reliable systems.
Design for failure (the mindset)
ASSUME things WILL fail → at scale, failures are NORMAL, not exceptional:
→ servers crash, networks partition, disks fail, dependencies go down, traffic spikes
→ design systems to EXPECT and HANDLE failures gracefully (not assume everything works)
→ "everything fails all the time" → build resilience in.
