Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kafka Summit 2023 - Reliable Message Processing Patterns for Kafka

Kafka Summit 2023 - Reliable Message Processing Patterns for Kafka

Failures are inevitable in distributed systems. We often come across unreliable networks, botched-up downstream systems, and rogue message payloads, forcing our applications to detect and handle failures as gracefully as possible.

After accepting a message, Kafka durably stores it in its infrastructure, allowing consumers to process it at their will. After that, the consumer must be responsible for processing the message reliably and efficiently handling failures.

This talk discusses several error-handling patterns you can implement in Kafka consumer applications. We will explore different approaches to handling transient and non-transient errors and highlight the use of dead letter topics in Kafka for message reprocessing. Finally, we will walk through a Spring Kafka application code to showcase blocking and non-blocking message retry scenarios.

Dunith Dhanushka

May 17, 2023
Tweet

More Decks by Dunith Dhanushka

Other Decks in Programming

Transcript

  1. © 2023 REDPANDA DATA A little about me… 2 Dunith

    Dhanushka Senior Developer Advocate, Redpanda Data • Event streaming, real-time analytics, and stream processing enthusiast • Frequent blogger, speaker, and an educator @dunithd linkedin.com/in/dunithd
  2. © 2023 REDPANDA DATA Agenda 1. Use case 2. Transient

    and non-transient errors - overview 3. Dead letter topics 4. Handling transient and non-transient errors 5. Q & A 3
  3. © 2023 REDPANDA DATA What could possibly happen here? 6

    Possible outcomes The happy path • The order will be processed as expected. • Sunny day scenario. Otherwise? • Processing will fail.
  4. © 2023 REDPANDA DATA 7 “Anything that can go wrong

    will go wrong, and at the worst possible time.” Murphy’s law
  5. © 2023 REDPANDA DATA Possible causes for consumer failures Two

    types of errors: 1. Transient errors Unpredicted and short-lived errors in software/hardware/network components. 2. Non-transient errors Errors that persist over time and cannot be easily resolved through automatic recovery or failover mechanisms. 8 Why order processing would fail?
  6. © 2023 REDPANDA DATA Transient errors Temporary errors that occur

    in computer systems or networks, typically caused by: • Temporary disruptions in network connectivity • Hardware failures • Software glitches, or other similar factors. They are recoverable. Short-lived errors that are recoverable 9
  7. © 2023 REDPANDA DATA Non-transient errors Non-transient errors are deterministic

    and always fail when consumed, no matter how many times it is reprocessed. It will produce the same result after reprocessing, causing an infinite loop that wastes precious computational resources. Not recoverable 10
  8. © 2023 REDPANDA DATA Dead Letter Queue DLQ 14 A

    place where you can route failed messages for reprocessing
  9. © 2023 REDPANDA DATA DLQ in the context of Kafka

    There’s no native DLQs in Kafka! 16 • You can appoint a regular Kafka topic as the DLT. • Typically, one DLT per source topics. • Usually the DLT topic name follows the pattern: <source_topic_name>-dlt
  10. © 2023 REDPANDA DATA Malformed message payloads • Errors in

    deserializing string/binary encoded messages at the consumer. E.g XML, JSON, Avro, Protobuf, etc. • Are usually caught early at the processing pipeline by Deserializers. • Errors are logged and message is dropped. 22
  11. © 2023 REDPANDA DATA We should route the malformed messages

    to the DLT! 24 They can be corrected and reprocessed later…
  12. © 2023 REDPANDA DATA Routing malformed messages to the DLT

    How Spring Kafka uses the ErrorHandlingDeserializer to catch deserialization errors? 25
  13. © 2023 REDPANDA DATA Case 1 The message fails the

    rule validation For example: • Missing fields in the payload E.g the customerId is missing in the order. • Validation failures E.g the amount is negative. 28 Although the deserialization succeeds
  14. © 2023 REDPANDA DATA Case 2 Consumer encounters an error

    Although the message is perfect, it might trigger an error in the consumer’s processing logic, causing it to fail the processing. This time, the error is with the consumer. For example, • Consumer throws a NPE. • RuntimeExceptions The fault in the consumer’s processing logic 29
  15. © 2023 REDPANDA DATA We should route them to the

    DLT as well. 30 They can be corrected and reprocessed later…
  16. © 2023 REDPANDA DATA Routing them to DLT Log the

    exception and continue. Let Spring route the message to the DLT. 31 In Spring Kafka, you can use the DeadLetterPublishingRecoverer class to route failed messages to the DLT. Can be configured with a KafkaTemplate.
  17. © 2023 REDPANDA DATA How to reprocess messages in the

    DLT? • Manual recovery with human intervention. • Add more context before sending a message to the DLT. • Producer team should own malformed messages and fix them. E.g The producer might be using an older schema version. • Notify the producer about the failure. Some best practices 32
  18. © 2023 REDPANDA DATA Consumer should retry several times •

    The recommended way to handle a transient error is to retry multiple times, with fixed or incremental intervals in between (back off timestamps). • If all retry attempts fail, you can redirect the message into the DLT and move on. • Retrying can be implemented synchronously or asynchronously at the consumer side. 34 Transient errors are recoverable at the consumer’s end
  19. © 2023 REDPANDA DATA Case 1 Simple blocking retries Suspend

    the consumer thread and reprocessing the failed message without doing calls to Consumer.poll() during the retries. 36
  20. © 2023 REDPANDA DATA Drawbacks • Main consumer thread is

    blocked. • Not ideal for high throughput message processing scenarios. • Waste of computational resources. 37
  21. © 2023 REDPANDA DATA Case 2 Non-blocking retry with a

    single retry topic and fixed backoff 40
  22. © 2023 REDPANDA DATA Case 3 Non-blocking retry with multiple

    retry topics and an exponential back off 42 Inspired by Netflix blog on the same.
  23. © 2023 REDPANDA DATA Takeaways 46 • Consumer failure scenarios

    can be broadly categorized into transient and non-transient errors. • Malformed payloads, business rule validation failures, and consumer errors are possible causes for non-transient errors. • Consumers should detect non-transient errors as early as possible and move them to the DLT for manual reprocessing. • Consumers should implement retry strategies to handle transient errors. • Prefer using asynchronous retrying when the message throughput is high. • If all retry attempts fail, the message can be moved to the DLT.
  24. © 2023 REDPANDA DATA 48 Keep learning Redpanda University https://university.redpanda.com

    Redpanda Docs https://docs.redpanda.com/ Redpanda Blogs https://redpanda.com/blog Redpanda Code https://github.com/redpanda-data
  25. © 2023 REDPANDA DATA Thanks for joining! Let’s keep in

    touch 49 @redpandadata redpanda-data redpanda-data [email protected]