分布式系统(多台计算机通过网络协同工作)带来了在单机系统中不存在的重大挑战——网络不可靠性、部分故障、一致性、协调等。理解这些对于大规模系统设计至关重要。
分布式系统为什么很难
text
Multiple machines communicating over a NETWORK introduce fundamental challenges:
→ the NETWORK is unreliable (latency, packet loss, partitions) and not instant
→ PARTIAL FAILURES → some parts fail while others work (vs all-or-nothing on one machine)
→ no shared memory/clock → coordination is hard
→ "the network is reliable" etc. are FALLACIES — distributed systems break these assumptions.
关键挑战
text
NETWORK → unreliable, variable latency, partitions (can't assume messages arrive/are fast)
PARTIAL FAILURE → handle some components failing (detect, retry, recover); is it down or slow?
CONSISTENCY → keeping data consistent across nodes (CAP trade-offs; eventual consistency)
COORDINATION → consensus, distributed agreement, leader election (hard; e.g. Raft/Paxos)
ORDERING/time → no global clock; event ordering across nodes is hard
CONCURRENCY → many things happening at once; race conditions across nodes
IDEMPOTENCY → handle duplicate messages/retries safely
