在生产环境中运行 Kafka 需要关注 集群规模、副本、监控、安全和维护 — Kafka 是一个功能强大但操作复杂的分布式系统。理解这些操作考虑因素(或使用托管 Kafka)对于可靠地运行它至关重要。
集群设置和可靠性
✓ Adequate BROKERS → size the cluster for throughput, storage, and replication needs
✓ REPLICATION → replication factor ≥ 3, min.insync.replicas for durability (no data loss)
✓ Spread across racks/AZs → survive failures (rack awareness)
✓ PARTITIONS → plan partition counts for parallelism and growth (hard to reduce later)
✓ Capacity planning → throughput, retention/storage, growth
监控和维护
✓ MONITOR → consumer lag, broker health, under-replicated partitions, throughput, disk (key!)
✓ Alerting on problems (lag, under-replication, broker down, disk full)
✓ Handle DISK → retention vs storage; disk full is a serious failure
✓ UPGRADES, rebalancing partitions, scaling brokers; backup/disaster recovery
✓ KRaft (or ZooKeeper) cluster management
