Observability infrastructure behind the trillion-messages scale Kafka platform

© LY Corporation Haruki Okada Observability infrastructure behind the trillion-messages
scale Kafka platform LY Corporation IMF Part

© LY Corporation 2 Self introduction • Haruki Okada (X/GitHub:
@ocadaruma) • Technical lead of IMF team at LY Corporation • The team is responsible for providing company-wide Kafka platform

© LY Corporation 3 Apache Kafka at LY Corporation •
Kafka is widely adopted by many services for various purposes

© LY Corporation 4 Apache Kafka at LY Corporation: Scale
• Peak Messages Throughput: • 31 million messages / sec • Daily incoming messages: 1+ trillion messages • Daily In/Out Bytes: 2.6 petabytes

© LY Corporation 5 Operating Kafka at scale • Kafka
is mature, battle-tested software • However, at this scale, we sometimes hit issues that no one have encountered before

© LY Corporation 6 Operating Kafka at scale • KAFKA-13403
• Kafka crash due to the race condition in log deletion logic • KAFKA-15046 • Performance issue under specific condition due to fsync syscall • KIP-764: Configurable backlog size for creating Acceptor • Performance issue on sudden massive connection open due to Linux kernel bug of SYN cookies

© LY Corporation 7 Observability is our core • Identifying
the root cause of this kind of issues requires maximum observability for any layer of system stack • Kafka • JVM • Linux kernel • RAID/HDD

© LY Corporation 9 ClickHouse use case in our platform
• ClickHouse is used to store/analyze all Kafka API request logs • API calls for Inter Broker replication from 200+ brokers • Produce/Consume from clients from 25k+ servers • etc • Scale • Inserted Rows: 7 million rows / sec • Total Rows: 4.1 trillion rows

© LY Corporation 11 Why ClickHouse? • Supports SQL •
Efficient Storage Usage • Great compression ratio thanks to columnar architecture • High Performance • Can Serve this scale of traffic by only 24 servers (2 replicas * 12 shards)

© LY Corporation 12 Case Study: Identifying Kafka’s bug •
Phenomenon • Produce Requests suddenly started failing on specific partition

© LY Corporation 15 Hypothesis: Race condition • By reading
the code, we found potential race condition which could cause non-monotonic message offsets

© LY Corporation 16 How can we prove?: API request
log • In fact, there was a concurrent ListOffsets API call when Produce request started failing!

© LY Corporation 17 Result • https://issues.apache.org/jira/browse/KAFKA-19407 • Reported the
issue to upstream repository

© LY Corporation 18 Conclusion • ClickHouse plays an important
role in our observability stack • ClickHouse can store and query massive rows efficiently • Kafka API request logs on ClickHouse helps identifying complex issues

Observability infrastructure behind the trillio...

Observability infrastructure behind the trillion-messages scale Kafka platform

LINEヤフーTech (LY Corporation Tech) PRO

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Featured

Transcript

© LY Corporation Haruki Okada Observability infrastructure behind the trillion-messages

© LY Corporation 2 Self introduction • Haruki Okada (X/GitHub:

© LY Corporation 3 Apache Kafka at LY Corporation •

© LY Corporation 4 Apache Kafka at LY Corporation: Scale

© LY Corporation 5 Operating Kafka at scale • Kafka

© LY Corporation 6 Operating Kafka at scale • KAFKA-13403

© LY Corporation 7 Observability is our core • Identifying

© LY Corporation 8

© LY Corporation 9 ClickHouse use case in our platform

© LY Corporation 10 Architecture

© LY Corporation 11 Why ClickHouse? • Supports SQL •

© LY Corporation 12 Case Study: Identifying Kafka’s bug •

© LY Corporation 13 Direct Cause: Replication issue

© LY Corporation 14 Why Replication Stopped • Kafka leader

© LY Corporation 15 Hypothesis: Race condition • By reading

© LY Corporation 16 How can we prove?: API request

© LY Corporation 17 Result • https://issues.apache.org/jira/browse/KAFKA-19407 • Reported the

© LY Corporation 18 Conclusion • ClickHouse plays an important