Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Kafka, the hard parts
Search
Chris Keathley
January 10, 2019
Programming
2
1.5k
Kafka, the hard parts
This talk tries to summarize a lot of the lessons I've learned building systems on kafka.
Chris Keathley
January 10, 2019
Tweet
Share
More Decks by Chris Keathley
See All by Chris Keathley
Solid code isn't flexible
keathley
4
1k
Building Adaptive Systems
keathley
38
2.4k
Contracts for building reliable systems
keathley
5
780
Building Resilient Elixir Systems
keathley
6
2.1k
Consistent, Distributed Elixir
keathley
5
1.5k
Telling stories with data visualization
keathley
0
570
Easing into continuous deployment
keathley
1
330
Leveling up your git skills
keathley
0
700
Generative Testing in Elixir
keathley
0
470
Other Decks in Programming
See All in Programming
Rubyでつくるパケットキャプチャツール
ydah
0
170
20年もののレガシープロダクトに 0からPHPStanを入れるまで / phpcon2024
hirobe1999
0
1k
カンファレンス動画鑑賞会のススメ / Osaka.swift #1
hironytic
0
170
functionalなアプローチで動的要素を排除する
ryopeko
1
190
CQRS+ES の力を使って効果を感じる / Feel the effects of using the power of CQRS+ES
seike460
PRO
0
240
ATDDで素早く安定した デリバリを実現しよう!
tonnsama
1
1.8k
VisionProで部屋の明るさを反映させるシェーダーを作った話
segur
0
100
asdf-ecspresso作って 友達が増えた話 / Fujiwara Tech Conference 2025
koluku
0
1.4k
PHPで学ぶプログラミングの教訓 / Lessons in Programming Learned through PHP
nrslib
4
1.1k
週次リリースを実現するための グローバルアプリ開発
tera_ny
1
1.2k
Fibonacci Function Gallery - Part 2
philipschwarz
PRO
0
210
DevFest - Serverless 101 with Google Cloud Functions
tunmise
0
140
Featured
See All Featured
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
10
860
Build your cross-platform service in a week with App Engine
jlugia
229
18k
A designer walks into a library…
pauljervisheath
205
24k
How STYLIGHT went responsive
nonsquared
96
5.3k
Why Our Code Smells
bkeepers
PRO
335
57k
Learning to Love Humans: Emotional Interface Design
aarron
274
40k
Facilitating Awesome Meetings
lara
51
6.2k
Rails Girls Zürich Keynote
gr2m
94
13k
How to train your dragon (web standard)
notwaldorf
89
5.8k
Designing for humans not robots
tammielis
250
25k
Building Better People: How to give real-time feedback that sticks.
wjessup
366
19k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
192
16k
Transcript
Kafka The Hard Parts Chris Keathley / @ChrisKeathley / keathley.io
Kafka is great
Kafka is just a log
https://flic.kr/p/9aXr88
https://flic.kr/p/9aXr88 Kafka
Kafka https://flic.kr/p/9aXr88 (metaphor)
Log aggregation Analytics and activity tracking Queuing ETL Messaging Stream
Processing Kafka Uses
Event Sourcing
Log aggregation Analytics and activity tracking Queuing ETL Messaging Stream
Processing Kafka Uses
https://flic.kr/p/hrrbVx
https://flic.kr/p/hrrbVx (still a metaphor) Kafka
Large consequences for failure
Joke about mr. glass
Joke about mr. glass
Iteration Is Hard
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Topic
Topic
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Written to the File system
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Messages are ordered
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Topic
Topic Topic Topic Topic Broker
Broker Broker Broker
None
Replication Leader
Clients Java Client librdkafka
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Order is important User Events
Order is important User Events
Order is important Follow
Order is important Follow
Order is important Follow Message
Order is important Follow Message Unfollow
Order is important Follow Message Unfollow Causal
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Group records based on order
Partitioner to_int(hash(key)) % partitions
Partitioner to_int(hash(user_id)) % partitions
Follow Message Unfollow Grouping Consumers
Follow Message Unfollow Causal Grouping Consumers
Follow Message Unfollow Grouping Consumers Follow Processor Message Processor
Follow Message Unfollow Grouping Consumers Follow Processor Message Processor
Follow Message Unfollow Grouping Consumers Follow Processor Message Processor
Follow Message Unfollow Grouping Consumers User event processor
Follow Message Unfollow Grouping Consumers User event processor
Follow Message Unfollow Grouping Consumers User event processor
User Events Create pipelines User event processor Messages
User Events Create pipelines User event processor Messages Consumes
User Events Create pipelines User event processor Messages Consumes Produces
"Commander: Better Distributed Applications through CQRS and Event Sourcing" by
Bobby Calderwood https://youtu.be/B1-gS0oEtYc
The less dependence you can have between consumers the better
Random partitioning is best if you can avoid ordering
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Errors have the potential to wreck your day
Consumer Errors
Consumer Errors
Consumer Errors
Consumer Errors
Consumer Errors Blocking the head of the line
Consumer What should we do? Errors
Non-Blocking vs. Blocking
Non-Blocking vs. Blocking
Non-Blocking Errors Consumer 42 1337 “Robert’);drop table students;—”
Non-Blocking Errors Consumer 42 1337 “Robert’);drop table students;—” What do
we do?
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer Error Topic
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer
Non-Blocking vs. Blocking
Non-Blocking vs. Blocking
Blocking Errors Database Consumer
Blocking Errors Database Consumer Process messages Store Information
Blocking Errors Database Consumer
Blocking Errors Database Consumer
Blocking Errors Database Consumer What do we do?
Blocking Errors Database Consumer Retry
Blocking Errors Database Consumer Send alerts
Skip non-blocking errors & Retry blocking errors
Design errors out of existence
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Delivery Guarantees
Computer A Communication is hard Computer B What time is
it?
Computer A Communication is hard Computer B
Computer A Communication is hard Computer B Did you get
it?
Computer A Communication is hard Computer B How about now?
Computer A Communication is hard Computer B Now?
0 <= 1 <= n Delivery At least once At
most once Impossible-ish
Consumers should *ALWAYS* assume “At Least Once”
The Joys of Functional Programming
None
You
You Functional Programming
Immutability and Idempotence
Immutability: An immutable object is an object whose state cannot
be modified after it is created.
Idempotence: …the property of certain operations in mathematics and computer
science whereby they can be applied multiple times without changing the result beyond the initial application.
Idempotence: Execute the same operation more than once but only
see the effect once.
Idempotent Operations
Counting comments comment comment comment increment 1
Counting comments comment comment comment increment 1
Counting comments comment comment comment increment 2
Counting comments comment comment comment increment 2
Counting comments comment comment comment increment 3
Counting comments comment comment comment increment 3 Some Error
Counting comments comment comment comment increment 3
Counting comments comment comment comment increment 3
Counting comments comment comment comment increment 4
Counting comments comment comment comment increment 4
Counting comments comment comment comment increment 5
Counting comments comment comment comment increment 5
Counting comments comment comment comment increment 6
Kafka Record { data: {}, type: “comment.created”, }
Kafka Record { data: {}, type: “comment.created”, msg_id: UUIDv4 }
Kafka Record { data: {}, type: “comment.created”, msg_id: UUIDv4 }
Used for managing idempotence
Counting comments comment comment comment increment 1
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2, 3)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2, 3) Some Error
Counting comments comment comment comment id: 1 id: 2 id:
3 Set.add(id) (1, 2, 3)
Counting comments comment comment comment id: 1 id: 2 id:
3 Set.add(id) (1, 2, 3)
Counting comments comment comment comment id: 1 id: 2 id:
3 Set.add(id) (1, 2, 3)
Counting comments (1, 2, 3)
Counting comments cardinality(1, 2, 3)
Counting comments cardinality(1, 2, 3) => 3
Idempotent Side-Effects
smtp send_email Sending Emails email id: 1 email id: 2
email id: 3
smtp send_email Sending Emails email id: 1 email id: 2
email id: 3 What do we do if this fails?
smtp send_email Sending Emails email id: 1 email id: 2
email id: 3 Send at most once
smtp send_email Sending Emails email id: 1
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp id?(1)
Cache send_email Sending Emails email id: 1 smtp id?(1) If
id exists then skip it
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp add(1)
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp
send_email Sending Emails email id: 1
send_email Sending Emails email id: 1 If we see this
message again move it to an audit topic
send_email Sending Emails If we see this message again move
it to an audit topic email id: 1
send_email Sending Emails
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
User Events Teams User event processor Messages Notifications Notifications Notification
Sender
User Events Teams User event processor Messages Notifications Notifications Notification
Sender Teams
Data is the language of the system
{ msg_id: "8700635f-1802-417e-89e7-595ad3600104", type: "comment.created", data: { user_id: 1234, msg:
"This is a super fun conference!" } } Data payloads
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads None of this tells you anything useful about your data
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads What do we do when these things change?
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads What do we do when these things change?
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads Lets just use versions!
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads Lets just use versions! (spoiler: this isn’t great)
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads
{ msg_id: String, type: String, data: { user_id: String, msg:
String }, meta: { version: 2 } } Data payloads
Data Versions Consumer v1 v1 v1 v1 v2
Data Versions Consumer v1 v1 v1 v1 v2 This consumer
needs to understand both versions
Data Versions Consumer v1 v1 v1 v1 v2 This team
needs to know to make these changes
Versioning is broken
(sem)Versioning is broken
Change Growth Breakage
Change Growth Breakage Never do this
Growing schemas should be the default
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads What are these?
Dependent Types
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads What are these?
Norm
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads
UUID = string? & re_matches?(/^[0-9A-F]{8}-[0-9A-F] {4}-4[0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i) ) CommentCreated = schema{
req :msg_id, UUID req :type, lit(“comment.created”) req :data, schema { req :user_id, integer? | UUID req :msg, string? } } Data payloads
json = {type: “comment.created”, msg: “Hello world”} Norm.decode(CommentEvent, json) =>
{:ok, data} Norm.decode(CommentEvent, {}) => {:error, errors} Norm.explain(CommentEvent, {}) => "In :msg_id, val: {} fails spec: required In :type, val: {} fails spec: required In :data, val: {} fails spec: required" Data payloads
Norm is built for extensibility
CommentEvent = schema{ req :type, lit(“comment.created”) req :msg, string? }
json = { type: “comment.created”, msg: “Hello world”, data: { msg: “Hello world” } } Norm.decode(CommentEvent, json) => {:ok, data} Norm is extensible
CommentEvent = schema{ req :type, lit(“comment.created”) req :msg, string? }
json = { type: “comment.created”, msg: “Hello world”, data: { msg: “Hello world” } } Norm.decode(CommentEvent, json) => {:ok, data} Norm is extensible This will still get passed through
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Property Based Testing
Property based testing Database Consumer
Property based testing Database Consumer id: 1 id: 2 id:
3 id: 1
Property based testing Database Consumer id: 1 id: 2 id:
3 id: 1 Information should end up here
Property based testing Database Consumer id: 1 id: 2 id:
3 id: 1 Some combination of these messages causes a failure
Property based testing Database id: 1 id: 1 Consumer
Property based testing Database id: 1 id: 1 Looks like
we aren’t handling duplicates correctly Consumer
Property based testing Database id: 1 id: 1 Consumer
Property based testing Database Consumer id: 1 id: 1 Deterministically
fail this connection
Chaos Engineering
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Monitoring vs. Observability
Monitoring: Figuring out that there’s a problem
Observability: Determining what the problem is.
Goal: Detect lagging or blocked consumers
Wisen
Wisen User Events User Consumer
metadata topic Wisen User Events Checkpoints its position in the
log to an offset topic User Consumer
Wisen metadata topic Wisen User Consumer User Events
Wisen metadata topic Wisen User Consumer User Events Compares farthest
offset from checkpoints over a time-window
Wisen user_consumer_errors Wisen User Consumer User Events
Wisen user_consumer_errors Wisen User Consumer User Events
Wisen user_consumer_errors Wisen User Consumer User Events Alert if we
see a rise in errors
Other useful metrics: Median and Tail latencies Internal buffers DB/Cache/RPC
latencies
OpenTracing
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
This has to be done up-front
Calculating partions messages in the system = arrival rate *
mean time in system
Calculating partions Desired throughput / measured throughput on one partition
=> partitions needed
Calculating partions partitions < 100 x brokers x replication factor
source: https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster
Increasing partitions is tricky if you rely on ordering
to_int(hash(user_id)) % partitions
to_int(hash(user_id)) % partitions Existing data is not reshuffled if partitions
are increased
Data is not forever.
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
CQRS & Event Sourcing
Don’t rush to democratize your data
Embrace data and design
Go forth and build awesome stuff!
Thanks Chris Keathley / @ChrisKeathley / keathley.io