处理错误和失败的消息对于可靠的 Kafka consumer 很重要——当消息处理失败时决定做什么(重试、跳过或路由到死信队列)。正确的错误处理可以防止数据丢失和 consumer 卡顿。
问题:处理失败
When a consumer fails to PROCESS a message (bad data, downstream failure, bug):
→ BLOCKING retry forever → the consumer gets STUCK on a "poison" message (can't progress)
→ skipping silently → data LOSS (the message is lost)
→ crashing → consumer restarts, reprocesses, may get stuck again
→ need a deliberate error-handling strategy.
错误处理策略
✓ RETRY (with limits) → retry transient failures (with backoff); but LIMIT retries (don't
retry forever on a permanent failure)
✓ DEAD LETTER QUEUE (DLQ) → after retries fail, send the message to a separate DLQ TOPIC →
the consumer moves on (not stuck); the DLQ is inspected/reprocessed later
→ the standard way to handle messages that can't be processed (avoids blocking + loss)
✓ Distinguish TRANSIENT (retry) vs PERMANENT (DLQ/skip) failures
✓ IDEMPOTENT processing → safe retries/reprocessing (at-least-once → duplicates)
