系统设计中的可用性和可靠性是什么？

Question

系统设计中的可用性和可靠性是什么？

Accepted Answer

**可用性**（系统处于运行状态且可访问）和**可靠性**（系统正确工作）是关键的非功能性需求。实现它们涉及冗余、容错、消除单点故障，以及优雅地处理故障。

## 可用性 vs 可靠性

```text
AVAILABILITY → the system is UP and responsive (accessible when needed):
  → measured as uptime % ("nines": 99.9% = ~8.7h/year down; 99.99% = ~52min/year)
RELIABILITY → the system works CORRECTLY (does what it should, without failures/errors):
  → related but distinct (a system can be up but returning wrong results — available but
    unreliable)
→ both matter: users need the system available AND working correctly.
```

## 实现高可用性

```text
✓ REDUNDANCY → multiple instances/copies → no single point of failure (if one fails,
  others serve) — the core principle
✓ Spread across AVAILABILITY ZONES / regions → survive data center/region failures
✓ FAILOVER → automatically switch to backups when something fails
✓ LOAD BALANCING + health checks → route around failed instances
✓ Database replication; eliminate SINGLE POINTS OF FAILURE everywhere
```

## 构建可靠系统

```text
✓ Design for FAILURE → assume things WILL fail; handle it gracefully (failures are normal
  at scale)
✓ FAULT TOLERANCE → continue working despite component failures (retries, fallbacks,
  circuit breakers, graceful degradation)
✓ MONITORING → detect issues; backups/recovery for data; test failure scenarios
✓ Avoid CASCADING failures (one failure triggering others) → isolation, timeouts
```

## 为什么这很重要

理解可用性和可靠性是基础性的，因为它们是**生产系统的关键非功能性需求**，所以为它们进行设计是必要的系统设计知识。**可用性**（系统处于运行状态且可访问，以正常运行时间