Slide 1

Slide 1 text

CONFIDENTIAL: FOR INTERNAL USE ONLY © 2023 REDPANDA DATA

Slide 2

Slide 2 text

© 2023 REDPANDA DATA A little about me… 2 Dunith Dhanushka Senior Developer Advocate, Redpanda Data ● Event streaming, real-time analytics, and stream processing enthusiast ● Frequent blogger, speaker, and an educator @dunithd linkedin.com/in/dunithd

Slide 3

Slide 3 text

© 2023 REDPANDA DATA Agenda 1. Use case 2. Transient and non-transient errors - overview 3. Dead letter topics 4. Handling transient and non-transient errors 5. Q & A 3

Slide 4

Slide 4 text

© 2023 REDPANDA DATA 4 The problem How not to lose an expensive message?

Slide 5

Slide 5 text

© 2023 REDPANDA DATA Use case Processing an expensive message E-commerce order processing… 5

Slide 6

Slide 6 text

© 2023 REDPANDA DATA What could possibly happen here? 6 Possible outcomes The happy path ● The order will be processed as expected. ● Sunny day scenario. Otherwise? ● Processing will fail.

Slide 7

Slide 7 text

© 2023 REDPANDA DATA 7 “Anything that can go wrong will go wrong, and at the worst possible time.” Murphy’s law

Slide 8

Slide 8 text

© 2023 REDPANDA DATA Possible causes for consumer failures Two types of errors: 1. Transient errors Unpredicted and short-lived errors in software/hardware/network components. 2. Non-transient errors Errors that persist over time and cannot be easily resolved through automatic recovery or failover mechanisms. 8 Why order processing would fail?

Slide 9

Slide 9 text

© 2023 REDPANDA DATA Transient errors Temporary errors that occur in computer systems or networks, typically caused by: ● Temporary disruptions in network connectivity ● Hardware failures ● Software glitches, or other similar factors. They are recoverable. Short-lived errors that are recoverable 9

Slide 10

Slide 10 text

© 2023 REDPANDA DATA Non-transient errors Non-transient errors are deterministic and always fail when consumed, no matter how many times it is reprocessed. It will produce the same result after reprocessing, causing an infinite loop that wastes precious computational resources. Not recoverable 10

Slide 11

Slide 11 text

© 2023 REDPANDA DATA Businesses don’t want to lose messages! Under any circumstances… 11

Slide 12

Slide 12 text

© 2023 REDPANDA DATA Handling consumer failures 12

Slide 13

Slide 13 text

© 2023 REDPANDA DATA 13

Slide 14

Slide 14 text

© 2023 REDPANDA DATA Dead Letter Queue DLQ 14 A place where you can route failed messages for reprocessing

Slide 15

Slide 15 text

© 2023 REDPANDA DATA Dead Letter Queue pattern - overview 15

Slide 16

Slide 16 text

© 2023 REDPANDA DATA DLQ in the context of Kafka There’s no native DLQs in Kafka! 16 ● You can appoint a regular Kafka topic as the DLT. ● Typically, one DLT per source topics. ● Usually the DLT topic name follows the pattern: -dlt

Slide 17

Slide 17 text

© 2023 REDPANDA DATA Handling non-transient errors 17

Slide 18

Slide 18 text

© 2023 REDPANDA DATA General pattern For handling non-transient errors 18

Slide 19

Slide 19 text

© 2023 REDPANDA DATA Spring Kafka consumer with Kafka/Redpanda 19

Slide 20

Slide 20 text

© 2023 REDPANDA DATA Code samples https://github.com/redpanda-data-blog/2022-dead-letter-topics Where to find the code shown in the talk? 20

Slide 21

Slide 21 text

© 2023 REDPANDA DATA Handling malformed payloads Dealing with rogue messages 21

Slide 22

Slide 22 text

© 2023 REDPANDA DATA Malformed message payloads ● Errors in deserializing string/binary encoded messages at the consumer. E.g XML, JSON, Avro, Protobuf, etc. ● Are usually caught early at the processing pipeline by Deserializers. ● Errors are logged and message is dropped. 22

Slide 23

Slide 23 text

© 2023 REDPANDA DATA Deserialization with Spring Kafka consumers 23

Slide 24

Slide 24 text

© 2023 REDPANDA DATA We should route the malformed messages to the DLT! 24 They can be corrected and reprocessed later…

Slide 25

Slide 25 text

© 2023 REDPANDA DATA Routing malformed messages to the DLT How Spring Kafka uses the ErrorHandlingDeserializer to catch deserialization errors? 25

Slide 26

Slide 26 text

© 2023 REDPANDA DATA Routing malformed messages to the DLT Spring Kafka configurations 26

Slide 27

Slide 27 text

© 2023 REDPANDA DATA Handling validation/consumer errors Dealing with business rule violations and consumer failures. 27

Slide 28

Slide 28 text

© 2023 REDPANDA DATA Case 1 The message fails the rule validation For example: ● Missing fields in the payload E.g the customerId is missing in the order. ● Validation failures E.g the amount is negative. 28 Although the deserialization succeeds

Slide 29

Slide 29 text

© 2023 REDPANDA DATA Case 2 Consumer encounters an error Although the message is perfect, it might trigger an error in the consumer’s processing logic, causing it to fail the processing. This time, the error is with the consumer. For example, ● Consumer throws a NPE. ● RuntimeExceptions The fault in the consumer’s processing logic 29

Slide 30

Slide 30 text

© 2023 REDPANDA DATA We should route them to the DLT as well. 30 They can be corrected and reprocessed later…

Slide 31

Slide 31 text

© 2023 REDPANDA DATA Routing them to DLT Log the exception and continue. Let Spring route the message to the DLT. 31 In Spring Kafka, you can use the DeadLetterPublishingRecoverer class to route failed messages to the DLT. Can be configured with a KafkaTemplate.

Slide 32

Slide 32 text

© 2023 REDPANDA DATA How to reprocess messages in the DLT? ● Manual recovery with human intervention. ● Add more context before sending a message to the DLT. ● Producer team should own malformed messages and fix them. E.g The producer might be using an older schema version. ● Notify the producer about the failure. Some best practices 32

Slide 33

Slide 33 text

© 2023 REDPANDA DATA Handling transient errors 33

Slide 34

Slide 34 text

© 2023 REDPANDA DATA Consumer should retry several times ● The recommended way to handle a transient error is to retry multiple times, with fixed or incremental intervals in between (back off timestamps). ● If all retry attempts fail, you can redirect the message into the DLT and move on. ● Retrying can be implemented synchronously or asynchronously at the consumer side. 34 Transient errors are recoverable at the consumer’s end

Slide 35

Slide 35 text

© 2023 REDPANDA DATA Blocking retries Consumer thread is blocked until the retry completes 35

Slide 36

Slide 36 text

© 2023 REDPANDA DATA Case 1 Simple blocking retries Suspend the consumer thread and reprocessing the failed message without doing calls to Consumer.poll() during the retries. 36

Slide 37

Slide 37 text

© 2023 REDPANDA DATA Drawbacks ● Main consumer thread is blocked. ● Not ideal for high throughput message processing scenarios. ● Waste of computational resources. 37

Slide 38

Slide 38 text

© 2023 REDPANDA DATA Non-blocking retries with backoff Consumer thread continues 38

Slide 39

Slide 39 text

© 2023 REDPANDA DATA Retry topics 39

Slide 40

Slide 40 text

© 2023 REDPANDA DATA Case 2 Non-blocking retry with a single retry topic and fixed backoff 40

Slide 41

Slide 41 text

© 2023 REDPANDA DATA Spring Kafka configuration 41

Slide 42

Slide 42 text

© 2023 REDPANDA DATA Case 3 Non-blocking retry with multiple retry topics and an exponential back off 42 Inspired by Netflix blog on the same.

Slide 43

Slide 43 text

© 2023 REDPANDA DATA 43

Slide 44

Slide 44 text

© 2023 REDPANDA DATA Spring Kafa configuration 44

Slide 45

Slide 45 text

© 2023 REDPANDA DATA Summary Things you can take home… 45

Slide 46

Slide 46 text

© 2023 REDPANDA DATA Takeaways 46 ● Consumer failure scenarios can be broadly categorized into transient and non-transient errors. ● Malformed payloads, business rule validation failures, and consumer errors are possible causes for non-transient errors. ● Consumers should detect non-transient errors as early as possible and move them to the DLT for manual reprocessing. ● Consumers should implement retry strategies to handle transient errors. ● Prefer using asynchronous retrying when the message throughput is high. ● If all retry attempts fail, the message can be moved to the DLT.

Slide 47

Slide 47 text

© 2023 REDPANDA DATA Questions? 47

Slide 48

Slide 48 text

© 2023 REDPANDA DATA 48 Keep learning Redpanda University https://university.redpanda.com Redpanda Docs https://docs.redpanda.com/ Redpanda Blogs https://redpanda.com/blog Redpanda Code https://github.com/redpanda-data

Slide 49

Slide 49 text

© 2023 REDPANDA DATA Thanks for joining! Let’s keep in touch 49 @redpandadata redpanda-data redpanda-data [email protected]