Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adopting Kafka for the #1 job site in the world

Adopting Kafka for the #1 job site in the world

SRE NEXT 2022 での発表資料です。 https://sre-next.dev/2022/schedule#jp38

世界 No. 1 の求人検索エンジンである Indeed の求人検索のバックエンドにおいて Apache Kafka を採用した事例を紹介します。Kafka はオープンソースの分散イベントストリーミングのためのプラットフォームで、世界中の多くの企業によって採用されています。Indeed でも様々な用途で Kafka が活躍していますが、本発表では求人情報のストリーム処理とデータセンター間のレプリケーションのために利用していたレガシーなシステムを、段階的に Kafka ベースの新しいシステムに移行した事例について紹介します。
レガシーなシステムが抱えていた信頼性・プロダクト上の課題といった背景、どのように Kafka を適用しシステムの移行・検証を行ったのかという技術的な内容、ソフトウェアエンジニアやプロダクトマネージャーを含む職能横断的 (cross-functional) プロダクトチームに埋め込まれた SRE がどのようにプロジェクトを始動し、チームと共にプロジェクトを実行したのかなどをお話しする予定です。


Yusuke Miyazaki

May 14, 2022

More Decks by Yusuke Miyazaki

Other Decks in Technology


  1. Adopting Kafka for the #1 job site* in the world

    Yusuke Miyazaki Staff Site Reliability Engineer, Indeed * ComScore, Total Visits, September 2021
  2. Who am I? • 宮﨑 勇輔 / @ymyzk • SRE

    at Indeed since 2018 ◦ Embedded into Job Search Backend team ◦ SRE lead ◦ Python Guild • Indeed is a bronze sponsor of SRE NEXT 2022
  3. Agenda • Background of the problem • Problem we solved

    • Why and how we adopted Kafka • Conclusion
  4. None
  5. Indeed provides job search in 60+ countries We have data

    centers in multiple countries in multiple continents to serve traffic from users. https://jp.indeed.com/worldwide
  6. We need to distribute job data between DCs quickly and

    reliably Aggregate Enrich job data Distribute data to all DCs Store and provide job data Job search apps and websites How Job Search Works?
  7. Old Way: Record Log Indeed had been using “record log”

    to store and distribute streams of jobs. • “Append-only” data structure • Record log is a set of segment files • Each segment file contains multiple job data • We can “rsync” files from producer to consumers to distribute jobs within DC and between DCs Please see a blog post in 2013 and implementation on GitHub for more details. 0101.seg 0102.seg 0103.seg • Job A in JP • Job B in US • Job C in IE ⋮ ⋮ https://engineering.indeedblog.com/blog/2013/10/serving-over-1-billion-documents-per-day -with-docstore-v2/ https://github.com/indeedeng/lsmtree/tree/master/recordlog
  8. Record log had been working well for many years

  9. As the product grows, we started to see different issues

  10. Reliability Challenges • Difficulty to “containerize” apps depending on record

    log stored in persistent storage • Extra storage capacity on each consumer to store more data • Difficulty to failover the producer without complex human intervention • Replication instability with “rsync” caused by cross-DC network instability and bandwidth
  11. Product Requirements Product side was also looking into scaling the

    product further • Process more job updates ◦ Process more number of jobs ◦ Update jobs more frequently • Store more metadata for each job We need to handle more data in record log
  12. Difficulties in Meeting Requirements It was getting difficult to store

    and distribute more data using record log • Each consumer needs TBs of storage to store a stream of jobs • Replication between DCs becomes more unstable and cause more alerts
  13. As a cross-functional team, we decided to replace the record

    log to solve both reliability and product challenges.
  14. New Way: Apache Kafka

  15. “ Apache Kafka is an open-source distributed event streaming platform

    used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. Apache Kafka https://kafka.apache.org/
  16. Producers and consumers are fully decoupled and agnostic of each

    other Kafka at a Glance Producers Brokers Consumers Topic 1 Topic 2 Topic 3
  17. Why we used Kafka? Kafka met our requirements well compared

    to other data stores such as RDBMS, message queue, Pub/Sub, etc. • Streaming data • Flexible consumption from topic ◦ Run multiple consumers for different purposes ◦ Replay data • Configurable retention • Scalability • Delivery semantics and ordering
  18. How we migrated to Kafka? • Migrated to Kafka gradually

    ◦ Make each migration simpler ◦ Minimize downtime ◦ Control risk when something goes wrong • Verified the migration carefully by writing a few tools to make sure we will not lose jobs
  19. Architecture using Record Log rsync between DCs Consumers Record Log

    (RL) Producer RL RL RL
  20. Populate Kafka using Record Log and Verify rsync between DCs

    Consumers Record Log (RL) Kafka Producer Producer to populate Kafka Consumer to verify ordering of messages in Kafka and performance RL RL RL
  21. Replicate Kafka Topics between DCs rsync between DCs Consumers Record

    Log (RL) Producer MirrorMaker to replicate topics between DCs RL RL RL Kafka Kafka Each DC have own Kafka brokers
  22. Verify Kafka topic again rsync between DCs Consumers Record Log

    (RL) Producer Compare Record Log and Kafka RL RL RL Kafka Kafka
  23. Migrate Consumers One by One rsync between DCs Record Log

    (RL) Kafka Producer RL RL Kafka Consumers Migrate one by one
  24. After Migrating All Consumers Record Log (RL) Kafka Producer Kafka

  25. Decommission Record Log Kafka Producer Kafka Consumers

  26. After the migration to Kafka • Freed up persistent storage

    on each consumer to store record log (hundreds of TB in total) • Consumers are “containerized” and migrated to Kubernetes • Producer can failover with minimum intervention
  27. After the migration to Kafka (cont’d) • Better compression using

    zstd provided by Kafka instead of Snappy (20–50% reduction) • Saved network bandwidth for replication between DCs (~65% reduction) • Limited performance degradation compared to record log (~20% slow down on one consumer)
  28. Conclusion • Identified and solved the problem from both reliability

    and product perspectives • Adopted Kafka to store and distribute job updates for Indeed job search • Kafka can be a powerful solution if a use case is well-suited • Working on the project as one cross-functional team not just SRE was a key to success
  29. We’re hiring!! • We have many interesting reliability and product

    problems to solve • Open positions in SRE, SWE, TDM, PdM, etc. • https://jp.indeed.jobs/ • Feel free to reach out to me for questions! Thank you for listening!