Adopting Kafka for the #1 job site in the world

Slide 1

Slide 1 text

Adopting Kafka for the #1 job site* in the world Yusuke Miyazaki Staff Site Reliability Engineer, Indeed * ComScore, Total Visits, September 2021

Slide 2

Slide 2 text

Who am I? ● 宮﨑勇輔 / @ymyzk ● SRE at Indeed since 2018 ○ Embedded into Job Search Backend team ○ SRE lead ○ Python Guild ● Indeed is a bronze sponsor of SRE NEXT 2022

Slide 3

Slide 3 text

Agenda ● Background of the problem ● Problem we solved ● Why and how we adopted Kafka ● Conclusion

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Indeed provides job search in 60+ countries We have data centers in multiple countries in multiple continents to serve traﬃc from users. https://jp.indeed.com/worldwide

Slide 6

Slide 6 text

We need to distribute job data between DCs quickly and reliably Aggregate Enrich job data Distribute data to all DCs Store and provide job data Job search apps and websites How Job Search Works?

Slide 7

Slide 7 text

Old Way: Record Log Indeed had been using “record log” to store and distribute streams of jobs. ● “Append-only” data structure ● Record log is a set of segment files ● Each segment file contains multiple job data ● We can “rsync” files from producer to consumers to distribute jobs within DC and between DCs Please see a blog post in 2013 and implementation on GitHub for more details. 0101.seg 0102.seg 0103.seg ● Job A in JP ● Job B in US ● Job C in IE ⋮ ⋮ https://engineering.indeedblog.com/blog/2013/10/serving-over-1-billion-documents-per-day -with-docstore-v2/ https://github.com/indeedeng/lsmtree/tree/master/recordlog

Slide 8

Slide 8 text

Record log had been working well for many years

Slide 9

Slide 9 text

As the product grows, we started to see different issues

Slide 10

Slide 10 text

Reliability Challenges ● Diﬃculty to “containerize” apps depending on record log stored in persistent storage ● Extra storage capacity on each consumer to store more data ● Diﬃculty to failover the producer without complex human intervention ● Replication instability with “rsync” caused by cross-DC network instability and bandwidth

Slide 11

Slide 11 text

Product Requirements Product side was also looking into scaling the product further ● Process more job updates ○ Process more number of jobs ○ Update jobs more frequently ● Store more metadata for each job We need to handle more data in record log

Slide 12

Slide 12 text

Diﬃculties in Meeting Requirements It was getting diﬃcult to store and distribute more data using record log ● Each consumer needs TBs of storage to store a stream of jobs ● Replication between DCs becomes more unstable and cause more alerts

Slide 13

Slide 13 text

As a cross-functional team, we decided to replace the record log to solve both reliability and product challenges.

Slide 14

Slide 14 text

New Way: Apache Kafka

Slide 15

Slide 15 text

“ Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Apache Kafka https://kafka.apache.org/

Slide 16

Slide 16 text

Producers and consumers are fully decoupled and agnostic of each other Kafka at a Glance Producers Brokers Consumers Topic 1 Topic 2 Topic 3

Slide 17

Slide 17 text

Why we used Kafka? Kafka met our requirements well compared to other data stores such as RDBMS, message queue, Pub/Sub, etc. ● Streaming data ● Flexible consumption from topic ○ Run multiple consumers for different purposes ○ Replay data ● Conﬁgurable retention ● Scalability ● Delivery semantics and ordering

Slide 18

Slide 18 text

How we migrated to Kafka? ● Migrated to Kafka gradually ○ Make each migration simpler ○ Minimize downtime ○ Control risk when something goes wrong ● Veriﬁed the migration carefully by writing a few tools to make sure we will not lose jobs

Slide 19

Slide 19 text

Architecture using Record Log rsync between DCs Consumers Record Log (RL) Producer RL RL RL

Slide 20

Slide 20 text

Populate Kafka using Record Log and Verify rsync between DCs Consumers Record Log (RL) Kafka Producer Producer to populate Kafka Consumer to verify ordering of messages in Kafka and performance RL RL RL

Slide 21

Slide 21 text

Replicate Kafka Topics between DCs rsync between DCs Consumers Record Log (RL) Producer MirrorMaker to replicate topics between DCs RL RL RL Kafka Kafka Each DC have own Kafka brokers

Slide 22

Slide 22 text

Verify Kafka topic again rsync between DCs Consumers Record Log (RL) Producer Compare Record Log and Kafka RL RL RL Kafka Kafka

Slide 23

Slide 23 text

Migrate Consumers One by One rsync between DCs Record Log (RL) Kafka Producer RL RL Kafka Consumers Migrate one by one

Slide 24

Slide 24 text

After Migrating All Consumers Record Log (RL) Kafka Producer Kafka Consumers

Slide 25

Slide 25 text

Decommission Record Log Kafka Producer Kafka Consumers

Slide 26

Slide 26 text

After the migration to Kafka ● Freed up persistent storage on each consumer to store record log (hundreds of TB in total) ● Consumers are “containerized” and migrated to Kubernetes ● Producer can failover with minimum intervention

Slide 27

Slide 27 text

After the migration to Kafka (cont’d) ● Better compression using zstd provided by Kafka instead of Snappy (20–50% reduction) ● Saved network bandwidth for replication between DCs (~65% reduction) ● Limited performance degradation compared to record log (~20% slow down on one consumer)

Slide 28

Slide 28 text

Conclusion ● Identiﬁed and solved the problem from both reliability and product perspectives ● Adopted Kafka to store and distribute job updates for Indeed job search ● Kafka can be a powerful solution if a use case is well-suited ● Working on the project as one cross-functional team not just SRE was a key to success

Slide 29

Slide 29 text

We’re hiring!! ● We have many interesting reliability and product problems to solve ● Open positions in SRE, SWE, TDM, PdM, etc. ● https://jp.indeed.jobs/ ● Feel free to reach out to me for questions! Thank you for listening!