Key Insights from Using Kafka in Large-Scale Projects

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

2 WE ARE Follow us on LinkedIn:

Slide 3

Slide 3 text

3 INGREDIENTS OF A WEBSHOP • Connecting Multiple Domains • Complex Integration Problem • 13 development teams • 28 Microservices and growing • > 50 Kafka Topics / Cluster

Slide 4

Slide 4 text

4 ORCHESTRATION VS CHOREOGRAPHY Choreography is decoupled but can make debugging and control flow difficult to follow Orchestration is more observable, debuggable and centralized, but results in a single point of failure. Source: https://www.milanjovanovic.tech/blog/orchestration-vs-choreography

Slide 5

Slide 5 text

5 ARCHITECTURE • Connecting Multiple Domains • Complex Integration Problem • 13 development teams • 28 Microservices and growing • > 50 Kafka Topics / Cluster

Slide 6

Slide 6 text

6 8 TIPS HOW TO MAKE YOUR JOURNEY WITH KAFKA AS UNPLEASANT AS POSSIBLE Tested in the field

Slide 7

Slide 7 text

KAFKA IS LIKE ANY OTHER MESSAGING SYSTEM – HOW HARD CAN IT BE 1

Slide 8

Slide 8 text

8 IS KAFKA LIKE ANY OTHER MESSAGING SYSTEM? • Kafka is very powerful but a complex beast to handle compared to AMQP, PubSub, JMS and co. • Developers have to deal with more low-level details like consumer groups, managing offsets, partitions, and error handling. • Migration from other messaging system takes time and careful planning • Design mistakes are expensive

Slide 9

Slide 9 text

9 WHAT CAN WE DO ABOUT IT SETUP GOLDEN PATH (STANDARDS) • Provide application blueprints • Local playgrounds • Manage schema centrally • Define sensitive defauls LEARNING • Workshops for developers • Share pitfalls and your learnings with the team • Competence center

Slide 10

Slide 10 text

BUILD A DISTRIBUTED MONOLITH 2 Building an event based system but still maintaining strong dependencies between services

Slide 11

Slide 11 text

11 HOW TO BUILD A DISTRIBUTED MONOLITH • Messages are not self-contained • Consumers have strong dependencies on the producers and other services • Tricky error handling (especially with stateless services) • Race conditions are very likely or strict versioning of entities is required • Internal DOS Attacks updateEvent

Slide 12

Slide 12 text

12 WHAT CAN WE DO • Self Contained Messages • To not affect all consumers • Add additional Topics • Introduce additional event types on the same topic • Request-Reply Pattern SVC 1 SVC2 SVC4 SVC4

Slide 13

Slide 13 text

EVERYTHING IN KAFKA IS AN EVENT 3 Current IT trends and literature guides us into thinking that using Kafka is equal to building an event-based system always.

Slide 14

Slide 14 text

14 ARE THERE ONLY EVENTS IN KAFKA? A pure “event-based” system comes with preconditions and design constraints. Example: Rendering Receipt Documents (Invoice) The side effect is - If I just create invoices on event states, I have to recreate the event state to retrigger invoices - This may have side effects on other things and is not clean (because it was an artificial state change) - And it will usually always be created unless I take extra measures - But it is very likely I want to recreate invoices at any time not just for specific events

Slide 15

Slide 15 text

15 WHAT WE CAN DO ABOUT IT From EIP there are four message types to consider • Event Message (1:n) • Document Message (1:1) or (1:n) • Command Message (1:1) • Request-Reply Messages (1:1) And a combination thereof

Slide 16

Slide 16 text

16 HOW WE SOLVED IT • Make deliberate choice what kind of message pattern you want to use (e.g. events, documents, commands) • Make the choice transparent, e.g. in the topic and record name. • Follow a consistent pattern for constructing the schema • Use consistent schema upgrade policy • FORWARD(_TRANSITIVE) is a good fit for ”event” and “document” topics, the producer owns the schema • BACKWARD(_TRANSITIVE) is a good fit for ”command” topics, the consumer owns the schema Make the message intent clear

Slide 17

Slide 17 text

THINK ABOUT KAFKA ERROR HANDLING ONLY IF (WHEN) YOU HAVE ERRORS IN PRODUCTION 4

Slide 18

Slide 18 text

18 • We tend to optimize for the happy path • We assume, if something goes wrong we can fix it manually, like in the good old days with DBMS. • When error handling comes in place, things get more complicated IGNORE ERROR HANDLING

Slide 19

Slide 19 text

19 WHAT WE CAN DO ABOUT IT • Decide on end-to-end delivery semantics early on (at-least-once, exactly-once, etc.) • Distinguish transient errors from permanent errors • Choose an error handling pattern • Is time critical? • Is it acceptable to lose messages? • Is ordering important? Make the message intent clear https://www.confluent.io/blog/error-handling-patterns-in-kafka/

Slide 20

Slide 20 text

20 HOW WE SOLVED IT • Pause container listener and resume if underlying cause has been resolved A transient error, e.g. upstream dependency broken • Dead Letter Queue A permanent error, e.g. message validation fails • Transactional Inbox / Outbox for at-least-once delivery with idempotency (Idempotent receiver) keys and dead letter queue If order has to be preserved and messages must not be lost. • Keep it as simple as possible • Guidelines, not strict rules • Observability is key

Slide 21

Slide 21 text

ONE SCHEMA TO RULE THEM ALL 5 • You don’t need a schema • Just assume a schema will not change ever • Do not define a schema upgrade policy • Do not assign clear responsibilities for topics and schemes

Slide 22

Slide 22 text

22 IS THERE ONLY ONE SCHEMA FOR A TOPIC ALWAYS? • Requirements are likely to change • Inventing schemes later is tricky • Schema registry is an add-on convention, so is it really needed?

Slide 23

Slide 23 text

23 WHAT WE CAN DO ABOUT IT • Clear ownership of Topics / Schemes • All affected parties are visible • Define a strategy for compatible updates • Define a strategy for breaking changes • Set up a playbook • Have multiple stages to test it • Document Failures

Slide 24

Slide 24 text

24 HOW WE SOLVED IT • Any (new) topic is assigned to a product team • This team can create the topic and propose a schema (upgrade) • All affected parties are invited to contribute and have to approve the MR for schema changes • Schema is rolled out to the schema registry and respective Java client libraries are deployed • Run the playbook, depending on the update strategy

Slide 25

Slide 25 text

25 HOW WE SOLVED IT

Slide 26

Slide 26 text

ROLL A DICE TO DECIDE ON THE NUMBER OF PARTITIONS 6 • Better be on the safe side, more partitions are always better • Ignore the limit of 4000 partitions we can have per broker • Always create topics years prior to usage to reserve capacity

Slide 27

Slide 27 text

27 DOES THE NUMBER OF PARTITIONS MATTER? • More partitions mean more overhead in compute (and memory) on the brokers • Can significantly impact costs • Will not automatically increase availability • Your Kafka colleagues will love you

Slide 28

Slide 28 text

28 WHAT CAN WE DO ABOUT IT OPERATIONS • Define and enforce policies (e.g. Policy as Code) • Provide guidelines • Provide upscaling of partitions DEVELOPMENT TEAMS • Max consumers available in consumergroups • Consider data volume • Time criticality • Type of workload • Number of brokers available Uphold friendship with your Kafka Operations Team

Slide 29

Slide 29 text

29 HOW WE SOLVED IT • Used low partition count on test environments • Apply standard setting of three partitions for a productive setup and 3 in-sync replicas (eq number of brokers) Unless • High data volume or dynamic load expected • Compute intense workloads

Slide 30

Slide 30 text

30 HOW WE SOLVED IT To do leader election in a distributed system. The one consumer assigned to the partition is the elected leader. Deliberately choose a single partition

Slide 31

Slide 31 text

NEVER TOUCH DEFAULT SETTINGS 7

Slide 32

Slide 32 text

32 • Leaving everything with default settings, we usually get an idempotent Producer -> exactly once semantics • Periodic Auto Commit Offset on Consumer End (e.g. every 5 seconds) independently of the actual Unit of Work -> something like maybe once or more. DEFAULT SETTINGS

Slide 33

Slide 33 text

33 WHAT WE CAN DO ABOUT IT • Define a set of reasonable defaults for your use • Consider different default settings using different languages and Frameworks Some Hints Consumer: • enable.auto.commit • max.poll.records • max.poll.interval.ms • session.timeout.ms Schema Registry • auto.register.schemas • use.latest.version

Slide 34

Slide 34 text

I DESIGN MY KAFKA MESSAGES SO THAT I CAN LOG IT NICELY 6

Slide 35

Slide 35 text

35 LOGGING STRATEGY • We log all produced events • We log all consumed events • We make sure the payload fits in the log message to not get truncated SVC 1 SVC2 Request: HTTP/POST /do/some/thing Response /do/some/thing SVC3 SVC4

Slide 36

Slide 36 text

36 8) BUT, • It is expensive • It is just replicating the same thing • We litter the logs and blurr the important info

Slide 37

Slide 37 text

37 WHAT WE CAN DO ABOUT IT • Use one id to correlate all logs • Use a tracing system to induce sub spans • Consistently for the scope of one business transaction SVC 1 SVC2 Request: HTTP/2 200 H: x3423452d /do/some/thing Response /do/some/thing SVC3 SVC4 x3423452 d x3423452 d Kafka Header x3423452d x3423452 d

Slide 38

Slide 38 text

38 HOW WE SOLVED IT • Built Observability on • Monitoring • Logging • Traces • Request-id returns all logs accross all systems for a user-session • Trace-id for a request • Span-id enables timings between specific use cases SVC 1 SVC 1 Trace API Establish Context Request-id: 123 TraceId: 567/891 Call Trace API (sampled) Request-id: 123 TraceId: 567/891 Call Trace API (sampled) Request-id: 123 TraceId: 567/999 Call Trace API (sampled) Request-id: 123 TraceId: 567/555 Kafka Header: Request-id: 123 TraceId: 567/999 Monitoring Request-id: 123 TraceId: 567/555 Collect Kafka Metrics

Slide 39

Slide 39 text

39 8) LOGS Logging-System Kafka-UI (AKHQ)

Slide 40

Slide 40 text

40 8) TRACES

Slide 41

Slide 41 text

41 MONITORING Monitoring Consumer Group Offsets for each Topic (including alerts)

Slide 42

Slide 42 text

43 WHAT WE HAD TO LEAVE OUT A LOT MORE THINGS WE HAVE NOT COVERED • SubjectNamingStrategies • Implement Idempotency • Subject References • GDPR Concerns • DeletionPolicy • Certificate Handling • Flink Integration