Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adopting Kafka for the #1 job site in the world

Adopting Kafka for the #1 job site in the world

SRE NEXT 2022 での発表資料です。 https://sre-next.dev/2022/schedule#jp38 https://www.youtube.com/watch?v=KOEQ7mqSuW8

世界 No. 1 の求人検索エンジンである Indeed の求人検索のバックエンドにおいて Apache Kafka を採用した事例を紹介します。Kafka はオープンソースの分散イベントストリーミングのためのプラットフォームで、世界中の多くの企業によって採用されています。Indeed でも様々な用途で Kafka が活躍していますが、本発表では求人情報のストリーム処理とデータセンター間のレプリケーションのために利用していたレガシーなシステムを、段階的に Kafka ベースの新しいシステムに移行した事例について紹介します。
レガシーなシステムが抱えていた信頼性・プロダクト上の課題といった背景、どのように Kafka を適用しシステムの移行・検証を行ったのかという技術的な内容、ソフトウェアエンジニアやプロダクトマネージャーを含む職能横断的 (cross-functional) プロダクトチームに埋め込まれた SRE がどのようにプロジェクトを始動し、チームと共にプロジェクトを実行したのかなどをお話しする予定です。

Yusuke Miyazaki

May 14, 2022
Tweet

More Decks by Yusuke Miyazaki

Other Decks in Technology

Transcript

  1. Adopting Kafka
    for the #1 job site*
    in the world
    Yusuke Miyazaki
    Staff Site Reliability Engineer, Indeed * ComScore, Total Visits, September 2021

    View Slide

  2. Who am I?
    ● 宮﨑 勇輔 / @ymyzk
    ● SRE at Indeed since 2018
    ○ Embedded into Job Search
    Backend team
    ○ SRE lead
    ○ Python Guild
    ● Indeed is a bronze sponsor of
    SRE NEXT 2022

    View Slide

  3. Agenda
    ● Background of the problem
    ● Problem we solved
    ● Why and how we adopted Kafka
    ● Conclusion

    View Slide

  4. View Slide

  5. Indeed provides job
    search in 60+ countries
    We have data centers in multiple
    countries in multiple continents to
    serve traffic from users.
    https://jp.indeed.com/worldwide

    View Slide

  6. We need to distribute job data between DCs quickly and reliably
    Aggregate Enrich job data
    Distribute data
    to all DCs
    Store and provide
    job data
    Job search apps
    and websites
    How Job Search Works?

    View Slide

  7. Old Way: Record Log
    Indeed had been using “record log” to store
    and distribute streams of jobs.
    ● “Append-only” data structure
    ● Record log is a set of segment files
    ● Each segment file contains multiple
    job data
    ● We can “rsync” files from producer to
    consumers to distribute jobs within DC
    and between DCs
    Please see a blog post in 2013 and
    implementation on GitHub for more details.
    0101.seg
    0102.seg
    0103.seg
    ● Job A in JP
    ● Job B in US
    ● Job C in IE


    https://engineering.indeedblog.com/blog/2013/10/serving-over-1-billion-documents-per-day
    -with-docstore-v2/
    https://github.com/indeedeng/lsmtree/tree/master/recordlog

    View Slide

  8. Record log had been working well
    for many years

    View Slide

  9. As the product grows,
    we started to see different issues

    View Slide

  10. Reliability
    Challenges
    ● Difficulty to “containerize” apps
    depending on record log stored in
    persistent storage
    ● Extra storage capacity on each
    consumer to store more data
    ● Difficulty to failover the producer
    without complex human intervention
    ● Replication instability with “rsync”
    caused by cross-DC network
    instability and bandwidth

    View Slide

  11. Product
    Requirements
    Product side was also looking into scaling
    the product further
    ● Process more job updates
    ○ Process more number of jobs
    ○ Update jobs more frequently
    ● Store more metadata for each job
    We need to handle more data in record log

    View Slide

  12. Difficulties
    in Meeting
    Requirements
    It was getting difficult to store and
    distribute more data using record log
    ● Each consumer needs TBs of storage
    to store a stream of jobs
    ● Replication between DCs becomes
    more unstable and cause more alerts

    View Slide

  13. As a cross-functional team,
    we decided to replace the record log
    to solve both reliability and product
    challenges.

    View Slide

  14. New Way:
    Apache Kafka

    View Slide


  15. Apache Kafka is an open-source distributed event
    streaming platform used by thousands of companies for
    high-performance data pipelines, streaming analytics,
    data integration, and mission-critical applications.
    Apache Kafka
    https://kafka.apache.org/

    View Slide

  16. Producers and consumers are fully decoupled and agnostic of each other
    Kafka at a Glance
    Producers Brokers Consumers
    Topic 1
    Topic 2
    Topic 3

    View Slide

  17. Why we used
    Kafka?
    Kafka met our requirements well compared
    to other data stores such as RDBMS,
    message queue, Pub/Sub, etc.
    ● Streaming data
    ● Flexible consumption from topic
    ○ Run multiple consumers for
    different purposes
    ○ Replay data
    ● Configurable retention
    ● Scalability
    ● Delivery semantics and ordering

    View Slide

  18. How we
    migrated to
    Kafka?
    ● Migrated to Kafka gradually
    ○ Make each migration simpler
    ○ Minimize downtime
    ○ Control risk when something
    goes wrong
    ● Verified the migration carefully by
    writing a few tools to make sure we
    will not lose jobs

    View Slide

  19. Architecture using Record Log
    rsync
    between DCs
    Consumers
    Record
    Log (RL)
    Producer
    RL
    RL
    RL

    View Slide

  20. Populate Kafka using Record Log and Verify
    rsync
    between DCs
    Consumers
    Record
    Log (RL)
    Kafka
    Producer
    Producer to
    populate Kafka
    Consumer to verify
    ordering of messages in
    Kafka and performance
    RL
    RL
    RL

    View Slide

  21. Replicate Kafka Topics between DCs
    rsync
    between DCs
    Consumers
    Record
    Log (RL)
    Producer
    MirrorMaker to replicate
    topics between DCs
    RL
    RL
    RL
    Kafka Kafka
    Each DC have own
    Kafka brokers

    View Slide

  22. Verify Kafka topic again
    rsync
    between DCs
    Consumers
    Record
    Log (RL)
    Producer
    Compare
    Record Log and Kafka
    RL
    RL
    RL
    Kafka Kafka

    View Slide

  23. Migrate Consumers One by One
    rsync
    between DCs
    Record
    Log (RL)
    Kafka
    Producer
    RL
    RL
    Kafka
    Consumers
    Migrate
    one by one

    View Slide

  24. After Migrating All Consumers
    Record
    Log (RL)
    Kafka
    Producer
    Kafka
    Consumers

    View Slide

  25. Decommission Record Log
    Kafka
    Producer
    Kafka
    Consumers

    View Slide

  26. After
    the migration
    to Kafka
    ● Freed up persistent storage on each
    consumer to store record log
    (hundreds of TB in total)
    ● Consumers are “containerized” and
    migrated to Kubernetes
    ● Producer can failover with minimum
    intervention

    View Slide

  27. After
    the migration
    to Kafka
    (cont’d)
    ● Better compression using zstd
    provided by Kafka instead of Snappy
    (20–50% reduction)
    ● Saved network bandwidth for
    replication between DCs
    (~65% reduction)
    ● Limited performance degradation
    compared to record log
    (~20% slow down on one consumer)

    View Slide

  28. Conclusion
    ● Identified and solved the problem
    from both reliability and product
    perspectives
    ● Adopted Kafka to store and distribute
    job updates for Indeed job search
    ● Kafka can be a powerful solution if a
    use case is well-suited
    ● Working on the project as one
    cross-functional team not just SRE
    was a key to success

    View Slide

  29. We’re hiring!!
    ● We have many interesting reliability
    and product problems to solve
    ● Open positions in SRE, SWE, TDM,
    PdM, etc.
    ● https://jp.indeed.jobs/
    ● Feel free to reach out to me for
    questions!
    Thank you for listening!

    View Slide