可用性(系统处于运行状态且可访问)和可靠性(系统正确工作)是关键的非功能性需求。实现它们涉及冗余、容错、消除单点故障,以及优雅地处理故障。
可用性 vs 可靠性
text
AVAILABILITY → the system is UP and responsive (accessible when needed):
→ measured as uptime % ("nines": 99.9% = ~8.7h/year down; 99.99% = ~52min/year)
RELIABILITY → the system works CORRECTLY (does what it should, without failures/errors):
→ related but distinct (a system can be up but returning wrong results — available but
unreliable)
→ both matter: users need the system available AND working correctly.
实现高可用性
text
✓ REDUNDANCY → multiple instances/copies → no single point of failure (if one fails,
others serve) — the core principle
✓ Spread across AVAILABILITY ZONES / regions → survive data center/region failures
✓ FAILOVER → automatically switch to backups when something fails
✓ LOAD BALANCING + health checks → route around failed instances
✓ Database replication; eliminate SINGLE POINTS OF FAILURE everywhere
