Operating Kafka in production requires attention to cluster sizing, replication, monitoring, security, and maintenance — Kafka is a powerful but operationally-complex distributed system. Understanding the operational considerations (or using managed Kafka) is important for running it reliably.
Cluster setup and reliability
✓ Adequate BROKERS → size the cluster for throughput, storage, and replication needs
✓ REPLICATION → replication factor ≥ 3, min.insync.replicas for durability (no data loss)
✓ Spread across racks/AZs → survive failures (rack awareness)
✓ PARTITIONS → plan partition counts for parallelism and growth (hard to reduce later)
✓ Capacity planning → throughput, retention/storage, growth
