$30 off During Our Annual Pro Sale. View Details »

Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree) | RTA Summit 2023

Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree) | RTA Summit 2023

Apache Pinot is a real-time distributed OLAP datastore that powers a variety of analytics use cases, which usually require executing high-throughput queries with low latency. To ensure data completeness, result correctness, and system performance, Pinot needs to execute background operational tasks – e.g. data compaction, GDPR data purging and reindexing after schema evolution etc. However, these operations can be computationally intensive and can easily impact query performance if executed on the same component as query execution.

Pinot leverages Minion, an Pinot native component built upon Apache Helix’s task framework, to execute those computationally intensive operational tasks, thus offloading workloads from the query execution component and avoiding sacrificing the query performance. The Minion component is designed to be easily extensible and pluggable – in addition to addressing the above issue, Minion is also used to build common data ingestion and backfilling pipelines, saving operators time from building customized and ad-hoc ones.

In this talk, we will deep dive into the Minion component and demonstrate how we leverage it in some typical operations tasks. We will also discuss the challenges faced while operating Minion at scale and how we greatly reduced the operational overheads by improving observability and introducing auto-scaling mechanisms.

To summarize, on one hand, Minion takes most of the operational burden in Pinot, helping real-time analytics run smoothly; on the other hand, Minion gives operators flexibility to perform complex operations that were hard (or even impossible) to perform, providing more delightful analytics product experiences.

StarTree
PRO

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Minions to The Rescue - Tackling
    Complex Operations in Apache
    Pinot
    Haitao Zhang, Software Engineer
    Xiaobing Li, Software Engineer

    View Slide

  2. Table of Contents
    01 OPERATIONAL CHALLENGES
    Complex, tedious, costly, error prone…
    02 MINION TO THE RESCUE
    Automatic, hands free, code free…
    03 MINION TASK FRAMEWORK
    Pinot native, easy to operate, easy to extend
    04 FUTURE WORK
    On both framework and functions

    View Slide

  3. Pinot Guiding Design Principles
    ● Low query latency at high QPS
    ● Highly available: no single point of failure
    ● Horizontally scalable: scale by adding new nodes
    ● Easily extensible via plugins
    3

    View Slide

  4. Operations Behind The Scene
    Similar to other OLAP databases, there are several major types of operations to support
    performant queries

    View Slide

  5. Operations Behind The Scene
    Operation Purpose Examples
    data ingestion
    data queryability &
    completeness
    Various data sources:
    Streaming system, data lake, blob
    store, …
    Various data formats:
    avro, csv, json, parquet …

    View Slide

  6. Operations Behind The Scene
    Operation Purpose Examples
    data update data correctness & privacy
    GDPR data purging
    Record update

    View Slide

  7. Operations Behind The Scene
    Operation Purpose Examples
    data reformat query performance
    data compaction
    segment resizing
    data re-partitioning

    View Slide

  8. Perform Above Operations in General
    Data Ingestion
    Pull vs Push
    Data Update
    In-Place vs Re-Ingestion
    Data Reformat
    In-Place vs Re-Ingestion

    View Slide

  9. Support Those Operations w/o Minion
    Operation Realtime Table Offline Table
    data ingestion
    servers pull data build dedicated
    pipeline/workflow using
    spark, hadoop, or standalone
    ingestion framework

    View Slide

  10. Support Those Operations w/o Minion
    Operation Realtime Table Offline Table
    data ingestion
    servers pull data build dedicated
    pipeline/workflow using
    spark, hadoop, or standalone
    ingestion framework
    data update
    servers perform data upsert &
    dedup while ingesting data
    for other updates, need data
    re-ingestion + data
    replacement
    data re-ingestion + data
    replacement

    View Slide

  11. Support Those Operations w/o Minion
    Operation Realtime Table Offline Table
    data ingestion
    servers pull data build dedicated
    pipeline/workflow using
    spark, hadoop, or standalone
    ingestion framework
    data update
    servers perform data upsert &
    dedup while ingesting data
    for other updates, need data
    re-ingestion + data
    replacement
    data re-ingestion + data
    replacement
    data reformat
    data re-ingestion + data replacement

    View Slide

  12. Operation Challenges w/o Minion
    ● Operations are fragmented and complex
    ○ In additional to Pinot, need to set up and maintain other systems
    ● Some operations are manual, tedious, time consuming and error prone
    ○ e.g, ad-hoc data ingestion
    ● Some operations are hard to achieve
    ○ e.g, repartitioning data, segment resizing
    12

    View Slide

  13. 13
    Need Better Solutions

    View Slide

  14. Key Considerations of Ideal Solutions
    ● Ease of operations
    ○ No/minimize extra/external systems/pipelines
    ● Unification of control
    ○ It’s easier to guarantee atomicity, no downtime etc.
    ● Separation of concerns
    ○ Existing Pinot components should not be affected
    14

    View Slide

  15. Minions to the Rescue

    View Slide

  16. Minions to the Rescue
    ● Pinot-native solution for Complex & Generic Operations
    ○ Elastic
    ○ Pluggabble
    16

    View Slide

  17. Use Case: File Ingestion Task
    ● A StarTree extension to ingest files from file
    systems (S3, GCS, ADLS, Local Fs, etc.) to
    Pinot
    ● Key Features
    ○ Exactly-once
    ○ With preprocessing like
    transform/merge/partition/sort etc.
    ● Replaced our earlier solutions based on some
    open source batch systems.
    17

    View Slide

  18. Use Case: File Ingestion Task Cont.
    Work Mode Scenarios
    Bootstrap Input files do not change
    Sync
    existing files are updated regularly and want to keep the segments
    and files in sync
    Incremental Ingestion
    new files can be added during ingestion, and want the tasks to
    detect and ingest new files automatically

    View Slide

  19. Use Case: Segment Refresh Task
    ● A StarTree extension to refresh table segments
    ● Key features
    ○ Automatically detect table config change
    ○ Refresh the old segments atomically
    ● Make impossible possible
    19

    View Slide

  20. Use Case: Segment Refresh Task Cont.
    Supported Operation Explanation
    Time partitioning Re-partition segments to be time partitioned
    Value partitioning Re-partition the segments per the partitioning config
    Merge/Split
    Merge small segments or split large segments (with rollup support) to
    ensure segments are properly sized
    Other table config
    changes
    Change time column
    Change sorted column
    Change column data type
    Change column encoding

    View Slide

  21. Minion Task Framework
    ● Built on Helix Task Framework
    ○ Distributed task execution engine
    ○ Part of Helix cluster mgmt framework
    ● Flexible task definition and scheduling
    ○ Workflow (DAG or FIFO)
    ○ Jobs
    ○ Tasks (most basic exec unit)
    21

    View Slide

  22. Minion Task Framework
    ● But simplified a lot for Pinot
    ○ Two interfaces to extend
    ● No complex job deps
    ○ Just a batch of tasks to run in parallel
    ○ Let Helix schedule and watch them
    ● Failure handling is critical
    ○ We don’t use the in-built retry mechanism
    22

    View Slide

  23. To add new task type
    ● Extend PinotTaskGenerator
    ○ Runs inside Pinot controller process
    ○ The output is a List
    ○ One task instance per PinotTaskConfig
    ● e.g in FileIngestionTask:
    ○ List files in an input folder
    ○ Identify failed tasks and retry their input
    23

    View Slide

  24. To add new task type
    ● Extend PinotTaskExecutor
    ○ Most basic execution unit
    ○ Run on Pinot minion workers
    ○ Each task instance runs in single thread
    ● e.g in FileIngestionTask:
    ○ Fetch files as set in PinotTaskConfig
    ○ Generate segments and upload
    24

    View Slide

  25. Failure handling
    ● Failure handling logic depends on what to achieve
    ○ Helix in-built retry mechanism was disabled
    ○ As it doesn’t know all the context to handle failure properly
    ● e.g. FileIngestionTask requires exactly once ingestion
    ○ Used custom checkpoints to handle task failure
    ● e.g SegmentRefreshTask requires atomic replacement
    ○ Used segment lineage to handle task failure
    25

    View Slide

  26. Minion mgmt UI and APIs
    ● View task complete/failure/running status
    ● Trigger or stop tasks etc.
    ● New task types can reuse those directly
    26

    View Slide

  27. More observability
    ● Other than task status from UI or API
    ● Detailed metrics to help enable ops automation
    ○ Docs to them
    27

    View Slide

  28. Auto scaling
    ● Minion tasks tend to run regularly or occasionally
    ○ No need to keep minion workers around
    ● In StarTree cloud, we have implemented auto-scaling
    ○ Given min/max workers to provision
    ○ StarTree cloud decides to add/rm workers based on pending workload
    28

    View Slide

  29. Future work: better Minion framework
    ● More on auto scaling
    ○ Leverage spot instances for more cost saving
    ○ Scale up instance types based worker resource usage or task types
    29

    View Slide

  30. Future work: better Minion framework
    ● More on auto scaling
    ● Resource isolation among tables
    ○ When tasks are generated and executed
    ○ When tasks are queued up in Helix
    30

    View Slide

  31. Future work: better Minion framework
    ● More on auto scaling
    ● Resource isolation among tables
    ● DAG based scheduling
    ○ For flexible inter-task scheduling
    31

    View Slide

  32. Future work: new Minion tasks
    ● Many existing Minion tasks today:
    ○ Data ingestion: files/objects, databases, data lakes
    ○ Segment mgmt: merge/rollup/refresh/purge
    ● Can use minion to implement materialized view
    ● Use Minion task to trigger external services
    32

    View Slide

  33. Thank You!
    dev.startree.ai
    33

    View Slide

  34. Back up slides starting from here
    34

    View Slide

  35. Operation Challenges
    ● build dedicated pipeline/workflow using other systems
    ○ More systems to maintain
    ○ Time consuming
    ○ Error prone
    ● use realtime table (instead of offline table) to ingest batch data and support
    upsert and dedup
    ○ Need to maintain an extra streaming service
    ○ Need to transform batch data and send it to the streaming service
    ○ Extra resources needed on servers
    ○ Time consuming
    ○ Error prone
    35
    Need to update this page

    View Slide

  36. How to Support Those Operations without Minion
    36
    Operation Realtime Table Offline Table
    data ingestion
    Native support: server ingests data from stream
    services: data become queryable immediately
    Need to build dedicated
    pipeline/workflow
    data deletion
    Need to build dedicated pipeline/workflow Need to build dedicated
    pipeline/workflow
    data update:
    upsert & dedup
    Native support: server performs upsert & dedup
    while ingesting data. A key differentiator of Pinot
    Need to build dedicated
    pipeline/workflow
    data reformat
    Need to build dedicated pipeline/workflow Need to build dedicated
    pipeline/workflow

    View Slide

  37. Operations Behind the Scene
    Similar to other OLAP databases, there are two major types of operations to
    support query
    ● Data ingestion
    ○ Various data sources: Kafka, blob store, data lake, SQL database …
    ○ Various data format: avro, csv, json, parquet …
    ● Data update
    ○ data compaction
    ○ GDPR data purging
    ○ segment resizing
    ○ data re-partitioning
    ○ ...
    37

    View Slide

  38. Operation Challenges for Realtime Tables
    ● What works well
    ○ Realtime data ingestion on server
    ■ data becomes queriable immediately
    ○ Pinot has built-in support for upsert and dedup
    ■ a key differentiator of Pinot
    ● Challenges
    ○ Need to build dedicated pipeline/workflow to support data update
    ■ time consuming and error prone
    38

    View Slide

  39. Operation Challenges for Offline Tables
    ● Need to build dedicated pipeline/workflow to support both batch data
    ingestion and data update
    ○ time consuming and error prone
    ● Can use realtime tables instead to solve data ingestion, upsert and dedup
    issue
    ○ Extra cost to set up streaming service and send data to streams
    39

    View Slide

  40. Use Case: Segment Merge Rollup
    40
    A built-in minion task allowing users to
    ● Merge small segments into larger ones
    ● Rollup values if needed
    "tableName": "myTable_OFFLINE",
    "tableType": "OFFLINE",
    ...
    ...
    "task": {
    "taskTypeConfigsMap": {
    "MergeRollupTask": {
    "1day.mergeType": "concat",
    "1day.bucketTimePeriod": "1d",
    "1day.bufferTimePeriod": "1d"
    }
    }
    }

    View Slide