Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Kafka, the hard parts
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Chris Keathley
January 10, 2019
Programming
3
1.9k
Kafka, the hard parts
This talk tries to summarize a lot of the lessons I've learned building systems on kafka.
Chris Keathley
January 10, 2019
Tweet
Share
More Decks by Chris Keathley
See All by Chris Keathley
Solid code isn't flexible
keathley
5
1.1k
Building Adaptive Systems
keathley
44
2.9k
Contracts for building reliable systems
keathley
6
1.1k
Building Resilient Elixir Systems
keathley
7
2.5k
Consistent, Distributed Elixir
keathley
6
1.6k
Telling stories with data visualization
keathley
1
680
Easing into continuous deployment
keathley
2
420
Leveling up your git skills
keathley
0
820
Generative Testing in Elixir
keathley
0
580
Other Decks in Programming
See All in Programming
Basic Architectures
denyspoltorak
0
680
副作用をどこに置くか問題:オブジェクト指向で整理する設計判断ツリー
koxya
1
610
今から始めるClaude Code超入門
448jp
8
9k
16年目のピクシブ百科事典を支える最新の技術基盤 / The Modern Tech Stack Powering Pixiv Encyclopedia in its 16th Year
ahuglajbclajep
5
1k
FOSDEM 2026: STUNMESH-go: Building P2P WireGuard Mesh Without Self-Hosted Infrastructure
tjjh89017
0
170
AIによるイベントストーミング図からのコード生成 / AI-powered code generation from Event Storming diagrams
nrslib
2
1.9k
Lambda のコードストレージ容量に気をつけましょう
tattwan718
0
140
AgentCoreとHuman in the Loop
har1101
5
240
CSC307 Lecture 05
javiergs
PRO
0
500
Amazon Bedrockを活用したRAGの品質管理パイプライン構築
tosuri13
5
780
生成AIを使ったコードレビューで定性的に品質カバー
chiilog
1
270
それ、本当に安全? ファイルアップロードで見落としがちなセキュリティリスクと対策
penpeen
7
3.9k
Featured
See All Featured
How GitHub (no longer) Works
holman
316
140k
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.6k
Deep Space Network (abreviated)
tonyrice
0
64
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
130
Designing Powerful Visuals for Engaging Learning
tmiket
0
240
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
133
19k
Context Engineering - Making Every Token Count
addyosmani
9
660
What does AI have to do with Human Rights?
axbom
PRO
0
2k
New Earth Scene 8
popppiees
1
1.5k
Product Roadmaps are Hard
iamctodd
PRO
55
12k
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
54
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
78
Transcript
Kafka The Hard Parts Chris Keathley / @ChrisKeathley / keathley.io
Kafka is great
Kafka is just a log
https://flic.kr/p/9aXr88
https://flic.kr/p/9aXr88 Kafka
Kafka https://flic.kr/p/9aXr88 (metaphor)
Log aggregation Analytics and activity tracking Queuing ETL Messaging Stream
Processing Kafka Uses
Event Sourcing
Log aggregation Analytics and activity tracking Queuing ETL Messaging Stream
Processing Kafka Uses
https://flic.kr/p/hrrbVx
https://flic.kr/p/hrrbVx (still a metaphor) Kafka
Large consequences for failure
Joke about mr. glass
Joke about mr. glass
Iteration Is Hard
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Topic
Topic
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Written to the File system
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Messages are ordered
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Partition 1 Partition 2 Partition 3 Partition 4 Partition 5
Consumer Consumer
Topic
Topic Topic Topic Topic Broker
Broker Broker Broker
None
Replication Leader
Clients Java Client librdkafka
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Order is important User Events
Order is important User Events
Order is important Follow
Order is important Follow
Order is important Follow Message
Order is important Follow Message Unfollow
Order is important Follow Message Unfollow Causal
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Order is important Follow Message Unfollow Consumer
Group records based on order
Partitioner to_int(hash(key)) % partitions
Partitioner to_int(hash(user_id)) % partitions
Follow Message Unfollow Grouping Consumers
Follow Message Unfollow Causal Grouping Consumers
Follow Message Unfollow Grouping Consumers Follow Processor Message Processor
Follow Message Unfollow Grouping Consumers Follow Processor Message Processor
Follow Message Unfollow Grouping Consumers Follow Processor Message Processor
Follow Message Unfollow Grouping Consumers User event processor
Follow Message Unfollow Grouping Consumers User event processor
Follow Message Unfollow Grouping Consumers User event processor
User Events Create pipelines User event processor Messages
User Events Create pipelines User event processor Messages Consumes
User Events Create pipelines User event processor Messages Consumes Produces
"Commander: Better Distributed Applications through CQRS and Event Sourcing" by
Bobby Calderwood https://youtu.be/B1-gS0oEtYc
The less dependence you can have between consumers the better
Random partitioning is best if you can avoid ordering
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Errors have the potential to wreck your day
Consumer Errors
Consumer Errors
Consumer Errors
Consumer Errors
Consumer Errors Blocking the head of the line
Consumer What should we do? Errors
Non-Blocking vs. Blocking
Non-Blocking vs. Blocking
Non-Blocking Errors Consumer 42 1337 “Robert’);drop table students;—”
Non-Blocking Errors Consumer 42 1337 “Robert’);drop table students;—” What do
we do?
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer Error Topic
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer
Non-Blocking Errors Consumer
Non-Blocking vs. Blocking
Non-Blocking vs. Blocking
Blocking Errors Database Consumer
Blocking Errors Database Consumer Process messages Store Information
Blocking Errors Database Consumer
Blocking Errors Database Consumer
Blocking Errors Database Consumer What do we do?
Blocking Errors Database Consumer Retry
Blocking Errors Database Consumer Send alerts
Skip non-blocking errors & Retry blocking errors
Design errors out of existence
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Delivery Guarantees
Computer A Communication is hard Computer B What time is
it?
Computer A Communication is hard Computer B
Computer A Communication is hard Computer B Did you get
it?
Computer A Communication is hard Computer B How about now?
Computer A Communication is hard Computer B Now?
0 <= 1 <= n Delivery At least once At
most once Impossible-ish
Consumers should *ALWAYS* assume “At Least Once”
The Joys of Functional Programming
None
You
You Functional Programming
Immutability and Idempotence
Immutability: An immutable object is an object whose state cannot
be modified after it is created.
Idempotence: …the property of certain operations in mathematics and computer
science whereby they can be applied multiple times without changing the result beyond the initial application.
Idempotence: Execute the same operation more than once but only
see the effect once.
Idempotent Operations
Counting comments comment comment comment increment 1
Counting comments comment comment comment increment 1
Counting comments comment comment comment increment 2
Counting comments comment comment comment increment 2
Counting comments comment comment comment increment 3
Counting comments comment comment comment increment 3 Some Error
Counting comments comment comment comment increment 3
Counting comments comment comment comment increment 3
Counting comments comment comment comment increment 4
Counting comments comment comment comment increment 4
Counting comments comment comment comment increment 5
Counting comments comment comment comment increment 5
Counting comments comment comment comment increment 6
Kafka Record { data: {}, type: “comment.created”, }
Kafka Record { data: {}, type: “comment.created”, msg_id: UUIDv4 }
Kafka Record { data: {}, type: “comment.created”, msg_id: UUIDv4 }
Used for managing idempotence
Counting comments comment comment comment increment 1
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2, 3)
Counting comments comment comment comment Set.add(id) id: 1 id: 2
id: 3 (1, 2, 3) Some Error
Counting comments comment comment comment id: 1 id: 2 id:
3 Set.add(id) (1, 2, 3)
Counting comments comment comment comment id: 1 id: 2 id:
3 Set.add(id) (1, 2, 3)
Counting comments comment comment comment id: 1 id: 2 id:
3 Set.add(id) (1, 2, 3)
Counting comments (1, 2, 3)
Counting comments cardinality(1, 2, 3)
Counting comments cardinality(1, 2, 3) => 3
Idempotent Side-Effects
smtp send_email Sending Emails email id: 1 email id: 2
email id: 3
smtp send_email Sending Emails email id: 1 email id: 2
email id: 3 What do we do if this fails?
smtp send_email Sending Emails email id: 1 email id: 2
email id: 3 Send at most once
smtp send_email Sending Emails email id: 1
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp id?(1)
Cache send_email Sending Emails email id: 1 smtp id?(1) If
id exists then skip it
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp add(1)
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp
Cache send_email Sending Emails email id: 1 smtp
send_email Sending Emails email id: 1
send_email Sending Emails email id: 1 If we see this
message again move it to an audit topic
send_email Sending Emails If we see this message again move
it to an audit topic email id: 1
send_email Sending Emails
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
User Events Teams User event processor Messages Notifications Notifications Notification
Sender
User Events Teams User event processor Messages Notifications Notifications Notification
Sender Teams
Data is the language of the system
{ msg_id: "8700635f-1802-417e-89e7-595ad3600104", type: "comment.created", data: { user_id: 1234, msg:
"This is a super fun conference!" } } Data payloads
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads None of this tells you anything useful about your data
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads What do we do when these things change?
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads What do we do when these things change?
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads Lets just use versions!
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads Lets just use versions! (spoiler: this isn’t great)
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads
{ msg_id: String, type: String, data: { user_id: String, msg:
String }, meta: { version: 2 } } Data payloads
Data Versions Consumer v1 v1 v1 v1 v2
Data Versions Consumer v1 v1 v1 v1 v2 This consumer
needs to understand both versions
Data Versions Consumer v1 v1 v1 v1 v2 This team
needs to know to make these changes
Versioning is broken
(sem)Versioning is broken
Change Growth Breakage
Change Growth Breakage Never do this
Growing schemas should be the default
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads What are these?
Dependent Types
{ msg_id: String, type: String, data: { user_id: Integer, msg:
String } } Data payloads What are these?
Norm
{ msg_id: String, type: String, data: { user_id: String, msg:
String } } Data payloads
UUID = string? & re_matches?(/^[0-9A-F]{8}-[0-9A-F] {4}-4[0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i) ) CommentCreated = schema{
req :msg_id, UUID req :type, lit(“comment.created”) req :data, schema { req :user_id, integer? | UUID req :msg, string? } } Data payloads
json = {type: “comment.created”, msg: “Hello world”} Norm.decode(CommentEvent, json) =>
{:ok, data} Norm.decode(CommentEvent, {}) => {:error, errors} Norm.explain(CommentEvent, {}) => "In :msg_id, val: {} fails spec: required In :type, val: {} fails spec: required In :data, val: {} fails spec: required" Data payloads
Norm is built for extensibility
CommentEvent = schema{ req :type, lit(“comment.created”) req :msg, string? }
json = { type: “comment.created”, msg: “Hello world”, data: { msg: “Hello world” } } Norm.decode(CommentEvent, json) => {:ok, data} Norm is extensible
CommentEvent = schema{ req :type, lit(“comment.created”) req :msg, string? }
json = { type: “comment.created”, msg: “Hello world”, data: { msg: “Hello world” } } Norm.decode(CommentEvent, json) => {:ok, data} Norm is extensible This will still get passed through
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Property Based Testing
Property based testing Database Consumer
Property based testing Database Consumer id: 1 id: 2 id:
3 id: 1
Property based testing Database Consumer id: 1 id: 2 id:
3 id: 1 Information should end up here
Property based testing Database Consumer id: 1 id: 2 id:
3 id: 1 Some combination of these messages causes a failure
Property based testing Database id: 1 id: 1 Consumer
Property based testing Database id: 1 id: 1 Looks like
we aren’t handling duplicates correctly Consumer
Property based testing Database id: 1 id: 1 Consumer
Property based testing Database Consumer id: 1 id: 1 Deterministically
fail this connection
Chaos Engineering
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Finding Errors Monitoring Capacity Planning #hottakes
Monitoring vs. Observability
Monitoring: Figuring out that there’s a problem
Observability: Determining what the problem is.
Goal: Detect lagging or blocked consumers
Wisen
Wisen User Events User Consumer
metadata topic Wisen User Events Checkpoints its position in the
log to an offset topic User Consumer
Wisen metadata topic Wisen User Consumer User Events
Wisen metadata topic Wisen User Consumer User Events Compares farthest
offset from checkpoints over a time-window
Wisen user_consumer_errors Wisen User Consumer User Events
Wisen user_consumer_errors Wisen User Consumer User Events
Wisen user_consumer_errors Wisen User Consumer User Events Alert if we
see a rise in errors
Other useful metrics: Median and Tail latencies Internal buffers DB/Cache/RPC
latencies
OpenTracing
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
This has to be done up-front
Calculating partions messages in the system = arrival rate *
mean time in system
Calculating partions Desired throughput / measured throughput on one partition
=> partitions needed
Calculating partions partitions < 100 x brokers x replication factor
source: https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster
Increasing partitions is tricky if you rely on ordering
to_int(hash(user_id)) % partitions
to_int(hash(user_id)) % partitions Existing data is not reshuffled if partitions
are increased
Data is not forever.
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
Lets talk about… Kafka Terminology Maintaining Order Errors Distributed Systems
and the joys of functional programming Data Validation Monitoring Capacity Planning #hottakes
CQRS & Event Sourcing
Don’t rush to democratize your data
Embrace data and design
Go forth and build awesome stuff!
Thanks Chris Keathley / @ChrisKeathley / keathley.io