Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Pinot Case Study - Kafka Summit 2020

Neha Pawar
August 25, 2020

Apache Pinot Case Study - Kafka Summit 2020

These are the slides to the talk "Apache Pinot Case Study - Building distributed analytics systems using Apache Kafka", from Kafka Summit 2020

We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Kafka Consumer Groups feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into journey of the Pinot real-time ingestion design. We will talk about the new Partition Level Consumers design, and learn how it is resilient to failures both in Kafka Brokers and Pinot Components. We will discuss how multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.

Neha Pawar

August 25, 2020
Tweet

Other Decks in Technology

Transcript

  1. @apachepinot | @KishoreBytes Apache Pinot @ Other Companies 2.7k Github

    Stars Slack Users Companies 500+ 20+ Community has tripled in the last two quarters Join our growing community on the Apache Pinot Slack Channel https://communityinviter.com/apps/apache-pinot/apache-pinot
  2. @apachepinot | @KishoreBytes User Facing Applications Business Facing Metrics Anomaly

    Detection Time Series Multiple Use Cases: One Platform Kafka 70+ 10k 100k 120k Queries/sec Events/sec 1M+
  3. @apachepinot | @KishoreBytes Challenges of User facing real-time analytics Velocity

    of ingestion High Dimensionality 1000s of QPS Milliseconds Latency Seconds Freshness Highly Available Scalable Cost Effective User-facing real-time analytics system
  4. @apachepinot | @KishoreBytes Pinot Architecture Servers Brokers Queries Scatter Gather

    • Servers - Consuming, indexing, serving • Brokers - Scatter gather
  5. @apachepinot | @KishoreBytes Server 1 Deep Store Pinot Realtime Ingestion

    Basics • Kafka Consumer on Pinot Server • Periodically create “Pinot segment” • Persist to deep store • In memory data - queryable • Continue consumption Kafka Consumer
  6. @apachepinot | @KishoreBytes Kafka Consumer Group based design • Each

    consumer consumes from 1 or more partitions Server 2 Server 1 3 partitions Consumer Group Kafka Consumer Kafka Consumer
  7. @apachepinot | @KishoreBytes Kafka Consumer Group based design • Each

    consumer consumes from 1 or more partitions Server 2 time 3 partitions Consumer Group Kafka Consumer Kafka Consumer • Periodic checkpointing Server1 starts consuming from 0 and 2 Checkpoint 350 Checkpoint 400 seg 1 seg 2 Seg 1 Seg 2
  8. @apachepinot | @KishoreBytes Kafka Consumer Group based design Server 2

    time 3 partitions Consumer Group Kafka Consumer Kafka Consumer • Relied on Kafka Rebalancer for ◦ Initial partition assignment ◦ Rebalancing partitions for node/partition changes Server1 starts consuming from 0 and 2 Checkpoint 350 Checkpoint 400 seg1 seg2 Kafka Rebalancer • Fault tolerant consumption
  9. @apachepinot | @KishoreBytes Challenges with Capacity Expansion Server 2 S1

    Add Server3 time Server 3 3 partitions Kafka Consumer Kafka Consumer Consumer Group Kafka Consumer Checkpoint 350 Checkpoint 400 seg1 seg2 Kafka Rebalancer Server1 starts consuming from 0 and 2
  10. @apachepinot | @KishoreBytes Challenges with Capacity Expansion Server 2 S1

    Add Server3 Partition 2 moves to Server 3 Server3 begins consumption from 400 time Server 3 3 partitions Kafka Consumer Kafka Consumer Consumer Group Kafka Consumer Checkpoint 350 Checkpoint 400 seg1 seg2 Kafka Rebalancer Server1 starts consuming from 0 and 2
  11. @apachepinot | @KishoreBytes Challenges with Capacity Expansion Server 2 S1

    Add Server3 Partition 2 moves to Server 3 Server3 begins consumption from 400 time Server 3 Duplicate Data across Server 1 and Server 3 for Partition 2! 3 partitions Kafka Consumer Kafka Consumer Consumer Group Kafka Consumer Checkpoint 350 Checkpoint 400 seg1 seg2 Kafka Rebalancer Server1 starts consuming from 0 and 2
  12. @apachepinot | @KishoreBytes Multiple Consumer Groups Consumer Group 1 Consumer

    Group 2 3 partitions 2 replicas • Tried multiple consumer groups to solve the issue, but... • No control over partitions assigned to consumer • No control over checkpointing
  13. @apachepinot | @KishoreBytes Deep store Multiple Consumer Groups Consumer Group

    1 Consumer Group 2 3 partitions 2 replicas • Segment disparity • Storage inefficient
  14. @apachepinot | @KishoreBytes Operational Complexity Queries Consumer Group 1 Consumer

    Group 2 3 partitions 2 replicas • Node failure in a consumer group • Cannot use good nodes of Consumer Group 1 and only look for missing data in Consumer Group 2
  15. @apachepinot | @KishoreBytes Operational Complexity Consumer Group 1 Consumer Group

    2 3 partitions 2 replicas • Disable consumer group for node failure/capacity changes
  16. @apachepinot | @KishoreBytes Server 4 Scalability limitation Consumer Group 1

    Consumer Group 2 3 partitions 2 replicas • Scalability limited by #partitions Idle • Cost inefficient
  17. @apachepinot | @KishoreBytes Single node in a Consumer Group •

    Eliminates incorrect results • Reduced operational complexity Server 1 Server 2 • Limited by capacity of 1 node • Storage overhead • Scalability limitation Consumer Group 1 Consumer Group 2 3 partitions 2 replicas The only deployment model that worked
  18. @apachepinot | @KishoreBytes Incorrect Results Operational Complexity Storage overhead Limited

    scalability Expensive Multi-node Consumer Group Y Y Y Y Y Single-node Consumer Group Y Y Y Issues with Kafka Consumer Group based solution
  19. @apachepinot | @KishoreBytes Problem 1 Lack of control with Kafka

    Rebalancer Solution Take control of partition assignment
  20. @apachepinot | @KishoreBytes S1 S3 Partition Level Consumption Pinot Controller

    S2 3 partitions 2 replicas Partition Server State Start offset End offset S1 S2 CONSUMING CONSUMING 20 S3 S1 CONSUMING CONSUMING 20 S2 S3 CONSUMING CONSUMING 20 0 1 2 Cluster State • Single coordinator across all replicas • Creates cluster state - mapping from partition to servers, segment state, offsets Pinot Servers
  21. @apachepinot | @KishoreBytes S1 S3 Partition Level Consumption Pinot Controller

    S2 3 partitions 2 replicas Partition Server State Start offset End offset S1 S2 CONSUMING CONSUMING 20 S3 S1 CONSUMING CONSUMING 20 S2 S3 CONSUMING CONSUMING 20 0 1 2 Cluster State • All actions determined by cluster state • Cluster state tells servers which partitions to consume Pinot Servers
  22. @apachepinot | @KishoreBytes S1 S3 Partition Level Consumption Controller S2

    3 partitions 2 replicas Partition Server State Start offset End offset 0 S1 S2 CONSUMING CONSUMING 20 1 S3 S1 CONSUMING CONSUMING 20 2 S2 S3 CONSUMING CONSUMING 20 Cluster State 80 110 110 • Periodically consuming segments try to commit their segment, by reporting end offset to controller • Thresholds for commit are configurable - time based, rows based, size based
  23. @apachepinot | @KishoreBytes S1 S3 Partition Level Consumption Controller S2

    3 partitions 2 replicas Partition Server State Start offset End offset 0 S1 S2 20 1 S3 S1 CONSUMING CONSUMING 20 2 S2 S3 CONSUMING CONSUMING 20 Cluster State Commit 80 110 110 ONLINE ONLINE • Controller picks 1 winner • Updates cluster state
  24. @apachepinot | @KishoreBytes Deep Store S1 S3 Partition Level Consumption

    Controller S2 3 partitions 2 replicas Partition Server State Start offset End offset 0 S1 S2 20 1 S3 S1 CONSUMING CONSUMING 20 2 S2 S3 CONSUMING CONSUMING 20 Cluster State 110 ONLINE ONLINE • Winner builds segment • Only 1 server persists segment to deep store • Only 1 copy stored
  25. @apachepinot | @KishoreBytes Deep Store S1 S3 Partition Level Consumption

    Controller S2 3 partitions 2 replicas Partition Server State Start offset End offset 0 S1 S2 20 1 S3 S1 CONSUMING CONSUMING 20 2 S2 S3 CONSUMING CONSUMING 20 Cluster State 110 ONLINE ONLINE • All other replicas ◦ Download from deep store ◦ Or build own segment if data is equivalent • Segment equivalence
  26. @apachepinot | @KishoreBytes Deep Store S1 S3 Partition Level Consumption

    Controller S2 3 partitions 2 replicas Partition Server State Start offset End offset 0 S1 S2 ONLINE ONLINE 20 110 1 S3 S1 CONSUMING CONSUMING 20 2 S2 S3 CONSUMING CONSUMING 20 Cluster State 0 S1 S2 CONSUMING CONSUMING 110 • New segment state created • Start where previous segment left off
  27. @apachepinot | @KishoreBytes Deep Store S1 S3 Partition Level Consumption

    Controller S2 3 partitions 2 replicas Partition Server State Start offset End offset 0 S1 S2 ONLINE ONLINE 20 110 1 S3 S1 ONLINE ONLINE 20 120 2 S2 S3 ONLINE ONLINE 20 100 Cluster State 0 S1 S2 CONSUMING CONSUMING 110 1 S3 S1 CONSUMING CONSUMING 120 2 S2 S3 CONSUMING CONSUMING 100 • Same for every partition • Each partition independent of others
  28. @apachepinot | @KishoreBytes Deep Store S1 S3 Capacity expansion Controller

    S2 3 partitions 2 replicas S4 • Consuming segment - Restart consumption using offset in cluster state • Pinot segment - Download from deep store • Easy to handle changes in replication/partitions • No duplicates! • Cluster state table updated
  29. @apachepinot | @KishoreBytes S1 S3 Node failures Controller S2 3

    partitions 2 replicas S4 • At least 1 replica still alive • No complex operations
  30. @apachepinot | @KishoreBytes S1 S3 Scalability Controller S2 3 partitions

    2 replicas S4 • Easily add nodes • Segment equivalence = Smart segment assignment + Smart query routing S6 S5 Completed Servers Consuming Servers
  31. @apachepinot | @KishoreBytes Incorrect Results Operational Complexity Storage overhead Limited

    scalability Expensive Multi-node Consumer Group Y Y Y Y Y Single-node Consumer Group Y Y Y Partition Level Consumers Summary