Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Pinot Case Study - Kafka Summit 2020

Neha Pawar
August 25, 2020

Apache Pinot Case Study - Kafka Summit 2020

These are the slides to the talk "Apache Pinot Case Study - Building distributed analytics systems using Apache Kafka", from Kafka Summit 2020

We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Kafka Consumer Groups feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into journey of the Pinot real-time ingestion design. We will talk about the new Partition Level Consumers design, and learn how it is resilient to failures both in Kafka Brokers and Pinot Components. We will discuss how multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.

Neha Pawar

August 25, 2020
Tweet

Other Decks in Technology

Transcript

  1. @apachepinot | @KishoreBytes
    Apache Pinot Case Study
    Building distributed analytics systems
    using Apache Kafka

    View full-size slide

  2. @apachepinot | @KishoreBytes

    View full-size slide

  3. @apachepinot | @KishoreBytes
    Pinot @LinkedIn

    View full-size slide

  4. @apachepinot | @KishoreBytes
    70+
    Products
    Pinot @ LinkedIn
    User Facing Analytics
    120k+
    queries/sec
    ms - 1s
    latency

    View full-size slide

  5. @apachepinot | @KishoreBytes
    70+
    Products
    Pinot @ LinkedIn
    User Facing Analytics
    120k+
    queries/sec
    ms - 1s
    latency

    View full-size slide

  6. @apachepinot | @KishoreBytes
    Pinot @ LinkedIn
    Business Metrics Analytics
    10k+
    Metrics
    50k+
    Dimensions

    View full-size slide

  7. @apachepinot | @KishoreBytes
    Pinot @ LinkedIn
    ThirdEye: Anomaly detection and root cause analysis
    50+
    Teams
    100K
    Time Series

    View full-size slide

  8. @apachepinot | @KishoreBytes
    Apache Pinot @
    Other Companies
    2.7k
    Github Stars
    Slack Users
    Companies
    500+
    20+
    Community has tripled in the last two quarters
    Join our growing community on the Apache Pinot Slack Channel
    https://communityinviter.com/apps/apache-pinot/apache-pinot

    View full-size slide

  9. @apachepinot | @KishoreBytes
    User Facing
    Applications
    Business Facing
    Metrics
    Anomaly Detection
    Time Series
    Multiple Use Cases:
    One Platform
    Kafka
    70+
    10k
    100k
    120k
    Queries/sec
    Events/sec
    1M+

    View full-size slide

  10. @apachepinot | @KishoreBytes
    Challenges of User facing real-time analytics
    Velocity of
    ingestion
    High
    Dimensionality
    1000s of QPS
    Milliseconds
    Latency
    Seconds
    Freshness
    Highly
    Available Scalable
    Cost
    Effective
    User-facing real-time
    analytics system

    View full-size slide

  11. @apachepinot | @KishoreBytes
    Pinot Real-time Ingestion
    Deep Dive

    View full-size slide

  12. @apachepinot | @KishoreBytes
    Pinot Architecture
    Servers
    Brokers
    Queries
    Scatter Gather
    ● Servers - Consuming,
    indexing, serving
    ● Brokers - Scatter gather

    View full-size slide

  13. @apachepinot | @KishoreBytes
    Server 1
    Deep Store
    Pinot Realtime Ingestion Basics
    ● Kafka Consumer on Pinot Server
    ● Periodically create “Pinot segment”
    ● Persist to deep store
    ● In memory data - queryable
    ● Continue consumption
    Kafka
    Consumer

    View full-size slide

  14. @apachepinot | @KishoreBytes
    Kafka Consumer Groups
    Approach 1

    View full-size slide

  15. @apachepinot | @KishoreBytes
    Kafka Consumer Group based design
    ● Each consumer consumes
    from 1 or more partitions
    Server 2
    Server 1
    3 partitions
    Consumer Group
    Kafka
    Consumer
    Kafka
    Consumer

    View full-size slide

  16. @apachepinot | @KishoreBytes
    Kafka Consumer Group based design
    ● Each consumer consumes
    from 1 or more partitions
    Server 2
    time
    3 partitions
    Consumer Group
    Kafka
    Consumer
    Kafka
    Consumer
    ● Periodic checkpointing
    Server1 starts
    consuming from
    0 and 2
    Checkpoint 350
    Checkpoint 400
    seg
    1
    seg
    2
    Seg 1
    Seg 2

    View full-size slide

  17. @apachepinot | @KishoreBytes
    Kafka Consumer Group based design
    Server 2
    time
    3 partitions
    Consumer Group
    Kafka
    Consumer
    Kafka
    Consumer
    ● Relied on Kafka Rebalancer for
    ○ Initial partition assignment
    ○ Rebalancing partitions for
    node/partition changes
    Server1 starts
    consuming from
    0 and 2
    Checkpoint 350
    Checkpoint 400
    seg1 seg2
    Kafka
    Rebalancer
    ● Fault tolerant consumption

    View full-size slide

  18. @apachepinot | @KishoreBytes
    Challenges with Capacity Expansion
    Server 2
    S1
    Add Server3
    time
    Server 3
    3 partitions
    Kafka
    Consumer
    Kafka
    Consumer
    Consumer Group
    Kafka
    Consumer
    Checkpoint 350
    Checkpoint 400
    seg1 seg2
    Kafka
    Rebalancer
    Server1 starts
    consuming from
    0 and 2

    View full-size slide

  19. @apachepinot | @KishoreBytes
    Challenges with Capacity Expansion
    Server 2
    S1
    Add Server3
    Partition 2 moves
    to Server 3
    Server3 begins consumption from 400
    time
    Server 3
    3 partitions
    Kafka
    Consumer
    Kafka
    Consumer
    Consumer Group
    Kafka
    Consumer
    Checkpoint 350
    Checkpoint 400
    seg1 seg2
    Kafka
    Rebalancer
    Server1 starts
    consuming from
    0 and 2

    View full-size slide

  20. @apachepinot | @KishoreBytes
    Challenges with Capacity Expansion
    Server 2
    S1
    Add Server3
    Partition 2 moves
    to Server 3
    Server3 begins consumption from 400
    time
    Server 3
    Duplicate Data across Server 1 and Server 3 for Partition 2!
    3 partitions
    Kafka
    Consumer
    Kafka
    Consumer
    Consumer Group
    Kafka
    Consumer
    Checkpoint 350
    Checkpoint 400
    seg1 seg2
    Kafka
    Rebalancer
    Server1 starts
    consuming from
    0 and 2

    View full-size slide

  21. @apachepinot | @KishoreBytes
    Multiple Consumer Groups
    Consumer Group 1
    Consumer Group 2
    3 partitions
    2 replicas
    ● Tried multiple consumer
    groups to solve the issue,
    but...
    ● No control over partitions
    assigned to consumer
    ● No control over checkpointing

    View full-size slide

  22. @apachepinot | @KishoreBytes
    Deep store
    Multiple Consumer Groups
    Consumer Group 1
    Consumer Group 2
    3 partitions
    2 replicas
    ● Segment disparity
    ● Storage inefficient

    View full-size slide

  23. @apachepinot | @KishoreBytes
    Operational Complexity
    Queries
    Consumer Group 1
    Consumer Group 2
    3 partitions
    2 replicas
    ● Node failure in a consumer
    group
    ● Cannot use good nodes of
    Consumer Group 1 and only
    look for missing data in
    Consumer Group 2

    View full-size slide

  24. @apachepinot | @KishoreBytes
    Operational Complexity
    Consumer Group 1
    Consumer Group 2
    3 partitions
    2 replicas
    ● Disable consumer group for
    node failure/capacity changes

    View full-size slide

  25. @apachepinot | @KishoreBytes
    Server 4
    Scalability limitation
    Consumer Group 1
    Consumer Group 2
    3 partitions
    2 replicas
    ● Scalability limited by #partitions
    Idle
    ● Cost inefficient

    View full-size slide

  26. @apachepinot | @KishoreBytes
    Single node in a Consumer Group
    ● Eliminates incorrect results
    ● Reduced operational complexity
    Server 1
    Server 2
    ● Limited by capacity of 1 node
    ● Storage overhead
    ● Scalability limitation
    Consumer
    Group 1
    Consumer
    Group 2
    3 partitions
    2 replicas
    The only deployment model that worked

    View full-size slide

  27. @apachepinot | @KishoreBytes
    Incorrect
    Results
    Operational
    Complexity
    Storage
    overhead
    Limited
    scalability
    Expensive
    Multi-node
    Consumer
    Group
    Y Y Y Y Y
    Single-node
    Consumer
    Group
    Y Y Y
    Issues with
    Kafka Consumer Group based solution

    View full-size slide

  28. @apachepinot | @KishoreBytes
    What were the problems?

    View full-size slide

  29. @apachepinot | @KishoreBytes
    Problem 1
    Lack of control with Kafka Rebalancer
    Solution
    Take control of partition assignment

    View full-size slide

  30. @apachepinot | @KishoreBytes
    Problem 2
    Segment Disparity due to checkpointing mechanism
    Solution
    Take control of checkpointing

    View full-size slide

  31. @apachepinot | @KishoreBytes
    Partition Level Consumption
    Approach 2

    View full-size slide

  32. @apachepinot | @KishoreBytes
    S1 S3
    Partition Level Consumption
    Pinot
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    S1
    S2
    CONSUMING
    CONSUMING 20
    S3
    S1
    CONSUMING
    CONSUMING 20
    S2
    S3
    CONSUMING
    CONSUMING 20
    0
    1
    2
    Cluster State
    ● Single coordinator across all
    replicas
    ● Creates cluster state -
    mapping from partition to
    servers, segment state,
    offsets
    Pinot Servers

    View full-size slide

  33. @apachepinot | @KishoreBytes
    S1 S3
    Partition Level Consumption
    Pinot
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    S1
    S2
    CONSUMING
    CONSUMING 20
    S3
    S1
    CONSUMING
    CONSUMING 20
    S2
    S3
    CONSUMING
    CONSUMING 20
    0
    1
    2
    Cluster State
    ● All actions determined by
    cluster state
    ● Cluster state tells servers
    which partitions to consume
    Pinot Servers

    View full-size slide

  34. @apachepinot | @KishoreBytes
    S1 S3
    Partition Level Consumption
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    0
    S1
    S2
    CONSUMING
    CONSUMING 20
    1
    S3
    S1
    CONSUMING
    CONSUMING 20
    2
    S2
    S3
    CONSUMING
    CONSUMING 20
    Cluster State
    80
    110
    110
    ● Periodically consuming
    segments try to commit their
    segment, by reporting end
    offset to controller
    ● Thresholds for commit are
    configurable - time based,
    rows based, size based

    View full-size slide

  35. @apachepinot | @KishoreBytes
    S1 S3
    Partition Level Consumption
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    0
    S1
    S2 20
    1
    S3
    S1
    CONSUMING
    CONSUMING 20
    2
    S2
    S3
    CONSUMING
    CONSUMING 20
    Cluster State
    Commit
    80
    110
    110
    ONLINE
    ONLINE
    ● Controller picks 1 winner
    ● Updates cluster state

    View full-size slide

  36. @apachepinot | @KishoreBytes
    Deep Store
    S1 S3
    Partition Level Consumption
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    0
    S1
    S2 20
    1
    S3
    S1
    CONSUMING
    CONSUMING 20
    2
    S2
    S3
    CONSUMING
    CONSUMING 20
    Cluster State
    110
    ONLINE
    ONLINE
    ● Winner builds segment
    ● Only 1 server persists
    segment to deep store
    ● Only 1 copy stored

    View full-size slide

  37. @apachepinot | @KishoreBytes
    Deep Store
    S1 S3
    Partition Level Consumption
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    0
    S1
    S2 20
    1
    S3
    S1
    CONSUMING
    CONSUMING 20
    2
    S2
    S3
    CONSUMING
    CONSUMING 20
    Cluster State
    110
    ONLINE
    ONLINE
    ● All other replicas
    ○ Download from deep
    store
    ○ Or build own segment
    if data is equivalent
    ● Segment equivalence

    View full-size slide

  38. @apachepinot | @KishoreBytes
    Deep Store
    S1 S3
    Partition Level Consumption
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    0
    S1
    S2
    ONLINE
    ONLINE
    20 110
    1
    S3
    S1
    CONSUMING
    CONSUMING
    20
    2
    S2
    S3
    CONSUMING
    CONSUMING
    20
    Cluster State
    0
    S1
    S2
    CONSUMING
    CONSUMING
    110
    ● New segment state created
    ● Start where previous segment left off

    View full-size slide

  39. @apachepinot | @KishoreBytes
    Deep Store
    S1 S3
    Partition Level Consumption
    Controller
    S2
    3 partitions
    2 replicas
    Partition Server State Start
    offset
    End
    offset
    0
    S1
    S2
    ONLINE
    ONLINE
    20 110
    1
    S3
    S1
    ONLINE
    ONLINE
    20 120
    2
    S2
    S3
    ONLINE
    ONLINE
    20 100
    Cluster State
    0
    S1
    S2
    CONSUMING
    CONSUMING
    110
    1
    S3
    S1
    CONSUMING
    CONSUMING
    120
    2
    S2
    S3
    CONSUMING
    CONSUMING
    100
    ● Same for every partition
    ● Each partition independent
    of others

    View full-size slide

  40. @apachepinot | @KishoreBytes
    Deep Store
    S1 S3
    Capacity expansion
    Controller
    S2
    3 partitions
    2 replicas
    S4
    ● Consuming segment - Restart consumption
    using offset in cluster state
    ● Pinot segment - Download from deep store
    ● Easy to handle changes in
    replication/partitions
    ● No duplicates!
    ● Cluster state table updated

    View full-size slide

  41. @apachepinot | @KishoreBytes
    S1 S3
    Node failures
    Controller
    S2
    3 partitions
    2 replicas
    S4
    ● At least 1 replica still alive
    ● No complex operations

    View full-size slide

  42. @apachepinot | @KishoreBytes
    S1 S3
    Scalability
    Controller
    S2
    3 partitions
    2 replicas
    S4
    ● Easily add nodes
    ● Segment equivalence =
    Smart segment assignment
    + Smart query routing
    S6 S5
    Completed
    Servers
    Consuming
    Servers

    View full-size slide

  43. @apachepinot | @KishoreBytes
    Incorrect
    Results
    Operational
    Complexity
    Storage
    overhead
    Limited
    scalability
    Expensive
    Multi-node
    Consumer
    Group
    Y Y Y Y Y
    Single-node
    Consumer
    Group
    Y Y Y
    Partition
    Level
    Consumers
    Summary

    View full-size slide

  44. @apachepinot | @KishoreBytes
    Q&A
    pinot.apache.org
    @apachepinot

    View full-size slide