Adopting Kafka
for the #1 job site*
in the world
Yusuke Miyazaki
Staff Site Reliability Engineer, Indeed * ComScore, Total Visits, September 2021
Slide 2
Slide 2 text
Who am I?
● 宮﨑 勇輔 / @ymyzk
● SRE at Indeed since 2018
○ Embedded into Job Search
Backend team
○ SRE lead
○ Python Guild
● Indeed is a bronze sponsor of
SRE NEXT 2022
Slide 3
Slide 3 text
Agenda
● Background of the problem
● Problem we solved
● Why and how we adopted Kafka
● Conclusion
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
Indeed provides job
search in 60+ countries
We have data centers in multiple
countries in multiple continents to
serve traffic from users.
https://jp.indeed.com/worldwide
Slide 6
Slide 6 text
We need to distribute job data between DCs quickly and reliably
Aggregate Enrich job data
Distribute data
to all DCs
Store and provide
job data
Job search apps
and websites
How Job Search Works?
Slide 7
Slide 7 text
Old Way: Record Log
Indeed had been using “record log” to store
and distribute streams of jobs.
● “Append-only” data structure
● Record log is a set of segment files
● Each segment file contains multiple
job data
● We can “rsync” files from producer to
consumers to distribute jobs within DC
and between DCs
Please see a blog post in 2013 and
implementation on GitHub for more details.
0101.seg
0102.seg
0103.seg
● Job A in JP
● Job B in US
● Job C in IE
⋮
⋮
https://engineering.indeedblog.com/blog/2013/10/serving-over-1-billion-documents-per-day
-with-docstore-v2/
https://github.com/indeedeng/lsmtree/tree/master/recordlog
Slide 8
Slide 8 text
Record log had been working well
for many years
Slide 9
Slide 9 text
As the product grows,
we started to see different issues
Slide 10
Slide 10 text
Reliability
Challenges
● Difficulty to “containerize” apps
depending on record log stored in
persistent storage
● Extra storage capacity on each
consumer to store more data
● Difficulty to failover the producer
without complex human intervention
● Replication instability with “rsync”
caused by cross-DC network
instability and bandwidth
Slide 11
Slide 11 text
Product
Requirements
Product side was also looking into scaling
the product further
● Process more job updates
○ Process more number of jobs
○ Update jobs more frequently
● Store more metadata for each job
We need to handle more data in record log
Slide 12
Slide 12 text
Difficulties
in Meeting
Requirements
It was getting difficult to store and
distribute more data using record log
● Each consumer needs TBs of storage
to store a stream of jobs
● Replication between DCs becomes
more unstable and cause more alerts
Slide 13
Slide 13 text
As a cross-functional team,
we decided to replace the record log
to solve both reliability and product
challenges.
Slide 14
Slide 14 text
New Way:
Apache Kafka
Slide 15
Slide 15 text
“
Apache Kafka is an open-source distributed event
streaming platform used by thousands of companies for
high-performance data pipelines, streaming analytics,
data integration, and mission-critical applications.
Apache Kafka
https://kafka.apache.org/
Slide 16
Slide 16 text
Producers and consumers are fully decoupled and agnostic of each other
Kafka at a Glance
Producers Brokers Consumers
Topic 1
Topic 2
Topic 3
Slide 17
Slide 17 text
Why we used
Kafka?
Kafka met our requirements well compared
to other data stores such as RDBMS,
message queue, Pub/Sub, etc.
● Streaming data
● Flexible consumption from topic
○ Run multiple consumers for
different purposes
○ Replay data
● Configurable retention
● Scalability
● Delivery semantics and ordering
Slide 18
Slide 18 text
How we
migrated to
Kafka?
● Migrated to Kafka gradually
○ Make each migration simpler
○ Minimize downtime
○ Control risk when something
goes wrong
● Verified the migration carefully by
writing a few tools to make sure we
will not lose jobs
Slide 19
Slide 19 text
Architecture using Record Log
rsync
between DCs
Consumers
Record
Log (RL)
Producer
RL
RL
RL
Slide 20
Slide 20 text
Populate Kafka using Record Log and Verify
rsync
between DCs
Consumers
Record
Log (RL)
Kafka
Producer
Producer to
populate Kafka
Consumer to verify
ordering of messages in
Kafka and performance
RL
RL
RL
Slide 21
Slide 21 text
Replicate Kafka Topics between DCs
rsync
between DCs
Consumers
Record
Log (RL)
Producer
MirrorMaker to replicate
topics between DCs
RL
RL
RL
Kafka Kafka
Each DC have own
Kafka brokers
Slide 22
Slide 22 text
Verify Kafka topic again
rsync
between DCs
Consumers
Record
Log (RL)
Producer
Compare
Record Log and Kafka
RL
RL
RL
Kafka Kafka
Slide 23
Slide 23 text
Migrate Consumers One by One
rsync
between DCs
Record
Log (RL)
Kafka
Producer
RL
RL
Kafka
Consumers
Migrate
one by one
Slide 24
Slide 24 text
After Migrating All Consumers
Record
Log (RL)
Kafka
Producer
Kafka
Consumers
Slide 25
Slide 25 text
Decommission Record Log
Kafka
Producer
Kafka
Consumers
Slide 26
Slide 26 text
After
the migration
to Kafka
● Freed up persistent storage on each
consumer to store record log
(hundreds of TB in total)
● Consumers are “containerized” and
migrated to Kubernetes
● Producer can failover with minimum
intervention
Slide 27
Slide 27 text
After
the migration
to Kafka
(cont’d)
● Better compression using zstd
provided by Kafka instead of Snappy
(20–50% reduction)
● Saved network bandwidth for
replication between DCs
(~65% reduction)
● Limited performance degradation
compared to record log
(~20% slow down on one consumer)
Slide 28
Slide 28 text
Conclusion
● Identified and solved the problem
from both reliability and product
perspectives
● Adopted Kafka to store and distribute
job updates for Indeed job search
● Kafka can be a powerful solution if a
use case is well-suited
● Working on the project as one
cross-functional team not just SRE
was a key to success
Slide 29
Slide 29 text
We’re hiring!!
● We have many interesting reliability
and product problems to solve
● Open positions in SRE, SWE, TDM,
PdM, etc.
● https://jp.indeed.jobs/
● Feel free to reach out to me for
questions!
Thank you for listening!