High availability (HA) 是指设计系统在出现故障时保持运营 — 通过冗余、multi-AZ部署、自动恢复和消除单点故障。这是生产系统的根本目标,也是AWS架构的关键领域。
Core HA principles
✓ ELIMINATE SINGLE POINTS OF FAILURE — no single component whose failure takes down
the system → redundancy everywhere (multiple instances, AZs, etc.)
✓ Deploy across MULTIPLE AVAILABILITY ZONES — survive an AZ (data center) failure
✓ AUTOMATIC RECOVERY — detect failures and recover/replace automatically (no manual fix)
✓ DECOUPLE components — failures isolated; one component's failure doesn't cascade
AWS上的HA技术
COMPUTE → Auto Scaling Group across multiple AZs + Load Balancer
→ instances spread across AZs; LB health checks route around failures; ASG replaces
failed instances → survives instance AND AZ failures
DATABASE → RDS Multi-AZ (synchronous standby in another AZ, auto-failover);
read replicas; DynamoDB (multi-AZ by default)
STORAGE → S3 (multi-AZ durable by design); EBS snapshots
DNS → Route 53 failover routing + health checks (route to healthy/backup endpoints)
DECOUPLING → SQS queues (buffer; consumers can fail/retry without losing work)
