Adopting Kafka for the #1 job site in the world

Adopting Kafka for the #1 job site* in the world
Yusuke Miyazaki Staff Site Reliability Engineer, Indeed * ComScore, Total Visits, September 2021

Who am I? • 宮﨑勇輔 / @ymyzk • SRE
at Indeed since 2018 ◦ Embedded into Job Search Backend team ◦ SRE lead ◦ Python Guild • Indeed is a bronze sponsor of SRE NEXT 2022

Agenda • Background of the problem • Problem we solved
• Why and how we adopted Kafka • Conclusion

Indeed provides job search in 60+ countries We have data
centers in multiple countries in multiple continents to serve traﬃc from users. https://jp.indeed.com/worldwide

We need to distribute job data between DCs quickly and
reliably Aggregate Enrich job data Distribute data to all DCs Store and provide job data Job search apps and websites How Job Search Works?

Old Way: Record Log Indeed had been using “record log”
to store and distribute streams of jobs. • “Append-only” data structure • Record log is a set of segment files • Each segment file contains multiple job data • We can “rsync” files from producer to consumers to distribute jobs within DC and between DCs Please see a blog post in 2013 and implementation on GitHub for more details. 0101.seg 0102.seg 0103.seg • Job A in JP • Job B in US • Job C in IE ⋮ ⋮ https://engineering.indeedblog.com/blog/2013/10/serving-over-1-billion-documents-per-day -with-docstore-v2/ https://github.com/indeedeng/lsmtree/tree/master/recordlog

Record log had been working well for many years

As the product grows, we started to see different issues

Reliability Challenges • Diﬃculty to “containerize” apps depending on record
log stored in persistent storage • Extra storage capacity on each consumer to store more data • Diﬃculty to failover the producer without complex human intervention • Replication instability with “rsync” caused by cross-DC network instability and bandwidth

Product Requirements Product side was also looking into scaling the
product further • Process more job updates ◦ Process more number of jobs ◦ Update jobs more frequently • Store more metadata for each job We need to handle more data in record log

Diﬃculties in Meeting Requirements It was getting diﬃcult to store
and distribute more data using record log • Each consumer needs TBs of storage to store a stream of jobs • Replication between DCs becomes more unstable and cause more alerts

As a cross-functional team, we decided to replace the record
log to solve both reliability and product challenges.

New Way: Apache Kafka

“ Apache Kafka is an open-source distributed event streaming platform
used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Apache Kafka https://kafka.apache.org/

Producers and consumers are fully decoupled and agnostic of each
other Kafka at a Glance Producers Brokers Consumers Topic 1 Topic 2 Topic 3

Why we used Kafka? Kafka met our requirements well compared
to other data stores such as RDBMS, message queue, Pub/Sub, etc. • Streaming data • Flexible consumption from topic ◦ Run multiple consumers for different purposes ◦ Replay data • Conﬁgurable retention • Scalability • Delivery semantics and ordering

How we migrated to Kafka? • Migrated to Kafka gradually
◦ Make each migration simpler ◦ Minimize downtime ◦ Control risk when something goes wrong • Veriﬁed the migration carefully by writing a few tools to make sure we will not lose jobs

Architecture using Record Log rsync between DCs Consumers Record Log
(RL) Producer RL RL RL

Populate Kafka using Record Log and Verify rsync between DCs
Consumers Record Log (RL) Kafka Producer Producer to populate Kafka Consumer to verify ordering of messages in Kafka and performance RL RL RL

Replicate Kafka Topics between DCs rsync between DCs Consumers Record
Log (RL) Producer MirrorMaker to replicate topics between DCs RL RL RL Kafka Kafka Each DC have own Kafka brokers

Verify Kafka topic again rsync between DCs Consumers Record Log
(RL) Producer Compare Record Log and Kafka RL RL RL Kafka Kafka

Migrate Consumers One by One rsync between DCs Record Log
(RL) Kafka Producer RL RL Kafka Consumers Migrate one by one

After Migrating All Consumers Record Log (RL) Kafka Producer Kafka
Consumers

Decommission Record Log Kafka Producer Kafka Consumers

After the migration to Kafka • Freed up persistent storage
on each consumer to store record log (hundreds of TB in total) • Consumers are “containerized” and migrated to Kubernetes • Producer can failover with minimum intervention

After the migration to Kafka (cont’d) • Better compression using
zstd provided by Kafka instead of Snappy (20–50% reduction) • Saved network bandwidth for replication between DCs (~65% reduction) • Limited performance degradation compared to record log (~20% slow down on one consumer)

Conclusion • Identiﬁed and solved the problem from both reliability
and product perspectives • Adopted Kafka to store and distribute job updates for Indeed job search • Kafka can be a powerful solution if a use case is well-suited • Working on the project as one cross-functional team not just SRE was a key to success

We’re hiring!! • We have many interesting reliability and product
problems to solve • Open positions in SRE, SWE, TDM, PdM, etc. • https://jp.indeed.jobs/ • Feel free to reach out to me for questions! Thank you for listening!

Adopting Kafka for the #1 job site in the world

Adopting Kafka for the #1 job site in the world

Yusuke Miyazaki

More Decks by Yusuke Miyazaki

Other Decks in Technology

Featured

Transcript