スケール時には、障害は避けられない — サーバーがクラッシュし、ネットワークが故障し、依存関係が利用できなくなります。障害を想定した設計とは、すべてが動作することを前提とするのではなく、障害に耐え、障害から適切に回復するシステムを構築することです。これは信頼できるシステムに不可欠です。
障害を想定した設計(マインドセット)
ASSUME things WILL fail → at scale, failures are NORMAL, not exceptional:
→ servers crash, networks partition, disks fail, dependencies go down, traffic spikes
→ design systems to EXPECT and HANDLE failures gracefully (not assume everything works)
→ "everything fails all the time" → build resilience in.
レジリエンス技法
✓ REDUNDANCY → multiple instances, no single point of failure (failover to healthy ones)
✓ RETRIES (with backoff) → retry transient failures (with exponential backoff + jitter to
avoid overwhelming a recovering service)
✓ TIMEOUTS → don't wait forever for a failing dependency (fail fast)
✓ CIRCUIT BREAKERS → stop calling a failing service temporarily (prevent cascading failures;
give it time to recover) → fail fast and fall back
✓ GRACEFUL DEGRADATION → reduced functionality vs total failure (e.g. show cached/partial
data if a service is down)
✓ FALLBACKS → a default/alternative when something fails
✓ BULKHEADS / isolation → contain failures (one part failing doesn't sink everything)
