Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability infrastructure behind the trillio...

Observability infrastructure behind the trillion-messages scale Kafka platform

Despite Kafka is mature software, at LY scale, we sometimes hit issues that no one have encountered before.
Observability is the core of our platform to detect such kind of issues.
We introduce the excerpt of our observability infrastructure of Kafka platform.

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Transcript

  1. © LY Corporation 2 Self introduction • Haruki Okada (X/GitHub:

    @ocadaruma) • Technical lead of IMF team at LY Corporation • The team is responsible for providing company-wide Kafka platform
  2. © LY Corporation 3 Apache Kafka at LY Corporation •

    Kafka is widely adopted by many services for various purposes
  3. © LY Corporation 4 Apache Kafka at LY Corporation: Scale

    • Peak Messages Throughput: • 31 million messages / sec • Daily incoming messages: 1+ trillion messages • Daily In/Out Bytes: 2.6 petabytes
  4. © LY Corporation 5 Operating Kafka at scale • Kafka

    is mature, battle-tested software • However, at this scale, we sometimes hit issues that no one have encountered before
  5. © LY Corporation 6 Operating Kafka at scale • KAFKA-13403

    • Kafka crash due to the race condition in log deletion logic • KAFKA-15046 • Performance issue under specific condition due to fsync syscall • KIP-764: Configurable backlog size for creating Acceptor • Performance issue on sudden massive connection open due to Linux kernel bug of SYN cookies
  6. © LY Corporation 7 Observability is our core • Identifying

    the root cause of this kind of issues requires maximum observability for any layer of system stack • Kafka • JVM • Linux kernel • RAID/HDD
  7. © LY Corporation 9 ClickHouse use case in our platform

    • ClickHouse is used to store/analyze all Kafka API request logs • API calls for Inter Broker replication from 200+ brokers • Produce/Consume from clients from 25k+ servers • etc • Scale • Inserted Rows: 7 million rows / sec • Total Rows: 4.1 trillion rows
  8. © LY Corporation 11 Why ClickHouse? • Supports SQL •

    Efficient Storage Usage • Great compression ratio thanks to columnar architecture • High Performance • Can Serve this scale of traffic by only 24 servers (2 replicas * 12 shards)
  9. © LY Corporation 12 Case Study: Identifying Kafka’s bug •

    Phenomenon • Produce Requests suddenly started failing on specific partition
  10. © LY Corporation 14 Why Replication Stopped • Kafka leader

    generated non-monotonic message offsets somehow
  11. © LY Corporation 15 Hypothesis: Race condition • By reading

    the code, we found potential race condition which could cause non-monotonic message offsets
  12. © LY Corporation 16 How can we prove?: API request

    log • In fact, there was a concurrent ListOffsets API call when Produce request started failing!
  13. © LY Corporation 18 Conclusion • ClickHouse plays an important

    role in our observability stack • ClickHouse can store and query massive rows efficiently • Kafka API request logs on ClickHouse helps identifying complex issues