Key Insights from Using Kafka in Large-Scale Projects

2 WE ARE Follow us on LinkedIn:

3 INGREDIENTS OF A WEBSHOP • Connecting Multiple Domains •
Complex Integration Problem • 13 development teams • 28 Microservices and growing • > 50 Kafka Topics / Cluster

4 ORCHESTRATION VS CHOREOGRAPHY Choreography is decoupled but can make
debugging and control flow difficult to follow Orchestration is more observable, debuggable and centralized, but results in a single point of failure. Source: https://www.milanjovanovic.tech/blog/orchestration-vs-choreography

5 ARCHITECTURE • Connecting Multiple Domains • Complex Integration Problem
• 13 development teams • 28 Microservices and growing • > 50 Kafka Topics / Cluster

6 8 TIPS HOW TO MAKE YOUR JOURNEY WITH KAFKA
AS UNPLEASANT AS POSSIBLE Tested in the field

KAFKA IS LIKE ANY OTHER MESSAGING SYSTEM – HOW HARD
CAN IT BE 1

8 IS KAFKA LIKE ANY OTHER MESSAGING SYSTEM? • Kafka
is very powerful but a complex beast to handle compared to AMQP, PubSub, JMS and co. • Developers have to deal with more low-level details like consumer groups, managing offsets, partitions, and error handling. • Migration from other messaging system takes time and careful planning • Design mistakes are expensive

9 WHAT CAN WE DO ABOUT IT SETUP GOLDEN PATH
(STANDARDS) • Provide application blueprints • Local playgrounds • Manage schema centrally • Define sensitive defauls LEARNING • Workshops for developers • Share pitfalls and your learnings with the team • Competence center

BUILD A DISTRIBUTED MONOLITH 2 Building an event based system
but still maintaining strong dependencies between services

11 HOW TO BUILD A DISTRIBUTED MONOLITH • Messages are
not self-contained • Consumers have strong dependencies on the producers and other services • Tricky error handling (especially with stateless services) • Race conditions are very likely or strict versioning of entities is required • Internal DOS Attacks updateEvent

12 WHAT CAN WE DO • Self Contained Messages •
To not affect all consumers • Add additional Topics • Introduce additional event types on the same topic • Request-Reply Pattern SVC 1 SVC2 SVC4 SVC4

EVERYTHING IN KAFKA IS AN EVENT 3 Current IT trends
and literature guides us into thinking that using Kafka is equal to building an event-based system always.

14 ARE THERE ONLY EVENTS IN KAFKA? A pure “event-based”
system comes with preconditions and design constraints. Example: Rendering Receipt Documents (Invoice) The side effect is - If I just create invoices on event states, I have to recreate the event state to retrigger invoices - This may have side effects on other things and is not clean (because it was an artificial state change) - And it will usually always be created unless I take extra measures - But it is very likely I want to recreate invoices at any time not just for specific events

15 WHAT WE CAN DO ABOUT IT From EIP there
are four message types to consider • Event Message (1:n) • Document Message (1:1) or (1:n) • Command Message (1:1) • Request-Reply Messages (1:1) And a combination thereof

16 HOW WE SOLVED IT • Make deliberate choice what
kind of message pattern you want to use (e.g. events, documents, commands) • Make the choice transparent, e.g. in the topic and record name. • Follow a consistent pattern for constructing the schema • Use consistent schema upgrade policy • FORWARD(_TRANSITIVE) is a good fit for ”event” and “document” topics, the producer owns the schema • BACKWARD(_TRANSITIVE) is a good fit for ”command” topics, the consumer owns the schema Make the message intent clear

THINK ABOUT KAFKA ERROR HANDLING ONLY IF (WHEN) YOU HAVE
ERRORS IN PRODUCTION 4

18 • We tend to optimize for the happy path
• We assume, if something goes wrong we can fix it manually, like in the good old days with DBMS. • When error handling comes in place, things get more complicated IGNORE ERROR HANDLING

19 WHAT WE CAN DO ABOUT IT • Decide on
end-to-end delivery semantics early on (at-least-once, exactly-once, etc.) • Distinguish transient errors from permanent errors • Choose an error handling pattern • Is time critical? • Is it acceptable to lose messages? • Is ordering important? Make the message intent clear https://www.confluent.io/blog/error-handling-patterns-in-kafka/

20 HOW WE SOLVED IT • Pause container listener and
resume if underlying cause has been resolved A transient error, e.g. upstream dependency broken • Dead Letter Queue A permanent error, e.g. message validation fails • Transactional Inbox / Outbox for at-least-once delivery with idempotency (Idempotent receiver) keys and dead letter queue If order has to be preserved and messages must not be lost. • Keep it as simple as possible • Guidelines, not strict rules • Observability is key

ONE SCHEMA TO RULE THEM ALL 5 • You don’t
need a schema • Just assume a schema will not change ever • Do not define a schema upgrade policy • Do not assign clear responsibilities for topics and schemes

22 IS THERE ONLY ONE SCHEMA FOR A TOPIC ALWAYS?
• Requirements are likely to change • Inventing schemes later is tricky • Schema registry is an add-on convention, so is it really needed?

23 WHAT WE CAN DO ABOUT IT • Clear ownership
of Topics / Schemes • All affected parties are visible • Define a strategy for compatible updates • Define a strategy for breaking changes • Set up a playbook • Have multiple stages to test it • Document Failures

24 HOW WE SOLVED IT • Any (new) topic is
assigned to a product team • This team can create the topic and propose a schema (upgrade) • All affected parties are invited to contribute and have to approve the MR for schema changes • Schema is rolled out to the schema registry and respective Java client libraries are deployed • Run the playbook, depending on the update strategy

25 HOW WE SOLVED IT

ROLL A DICE TO DECIDE ON THE NUMBER OF PARTITIONS
6 • Better be on the safe side, more partitions are always better • Ignore the limit of 4000 partitions we can have per broker • Always create topics years prior to usage to reserve capacity

27 DOES THE NUMBER OF PARTITIONS MATTER? • More partitions
mean more overhead in compute (and memory) on the brokers • Can significantly impact costs • Will not automatically increase availability • Your Kafka colleagues will love you

28 WHAT CAN WE DO ABOUT IT OPERATIONS • Define
and enforce policies (e.g. Policy as Code) • Provide guidelines • Provide upscaling of partitions DEVELOPMENT TEAMS • Max consumers available in consumergroups • Consider data volume • Time criticality • Type of workload • Number of brokers available Uphold friendship with your Kafka Operations Team

29 HOW WE SOLVED IT • Used low partition count
on test environments • Apply standard setting of three partitions for a productive setup and 3 in-sync replicas (eq number of brokers) Unless • High data volume or dynamic load expected • Compute intense workloads

30 HOW WE SOLVED IT To do leader election in
a distributed system. The one consumer assigned to the partition is the elected leader. Deliberately choose a single partition

NEVER TOUCH DEFAULT SETTINGS 7

32 • Leaving everything with default settings, we usually get
an idempotent Producer -> exactly once semantics • Periodic Auto Commit Offset on Consumer End (e.g. every 5 seconds) independently of the actual Unit of Work -> something like maybe once or more. DEFAULT SETTINGS

33 WHAT WE CAN DO ABOUT IT • Define a
set of reasonable defaults for your use • Consider different default settings using different languages and Frameworks Some Hints Consumer: • enable.auto.commit • max.poll.records • max.poll.interval.ms • session.timeout.ms Schema Registry • auto.register.schemas • use.latest.version

I DESIGN MY KAFKA MESSAGES SO THAT I CAN LOG
IT NICELY 6

35 LOGGING STRATEGY • We log all produced events •
We log all consumed events • We make sure the payload fits in the log message to not get truncated SVC 1 SVC2 Request: HTTP/POST /do/some/thing Response /do/some/thing SVC3 SVC4

36 8) BUT, • It is expensive • It is
just replicating the same thing • We litter the logs and blurr the important info

37 WHAT WE CAN DO ABOUT IT • Use one
id to correlate all logs • Use a tracing system to induce sub spans • Consistently for the scope of one business transaction SVC 1 SVC2 Request: HTTP/2 200 H: x3423452d /do/some/thing Response /do/some/thing SVC3 SVC4 x3423452 d x3423452 d Kafka Header x3423452d x3423452 d

38 HOW WE SOLVED IT • Built Observability on •
Monitoring • Logging • Traces • Request-id returns all logs accross all systems for a user-session • Trace-id for a request • Span-id enables timings between specific use cases SVC 1 SVC 1 Trace API Establish Context Request-id: 123 TraceId: 567/891 Call Trace API (sampled) Request-id: 123 TraceId: 567/891 Call Trace API (sampled) Request-id: 123 TraceId: 567/999 Call Trace API (sampled) Request-id: 123 TraceId: 567/555 Kafka Header: Request-id: 123 TraceId: 567/999 Monitoring Request-id: 123 TraceId: 567/555 Collect Kafka Metrics

39 8) LOGS Logging-System Kafka-UI (AKHQ)

40 8) TRACES

41 MONITORING Monitoring Consumer Group Offsets for each Topic (including
alerts)

43 WHAT WE HAD TO LEAVE OUT A LOT MORE
THINGS WE HAVE NOT COVERED • SubjectNamingStrategies • Implement Idempotency • Subject References • GDPR Concerns • DeletionPolicy • Certificate Handling • Flink Integration

Key Insights from Using Kafka in Large-Scale Pr...

Key Insights from Using Kafka in Large-Scale Projects

More Decks by Posedio

Other Decks in Programming

Featured

Transcript