Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree) | RTA Summit 2023

Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree) | RTA Summit 2023

Apache Pinot is a real-time distributed OLAP datastore that powers a variety of analytics use cases, which usually require executing high-throughput queries with low latency. To ensure data completeness, result correctness, and system performance, Pinot needs to execute background operational tasks – e.g. data compaction, GDPR data purging and reindexing after schema evolution etc. However, these operations can be computationally intensive and can easily impact query performance if executed on the same component as query execution.

Pinot leverages Minion, an Pinot native component built upon Apache Helix’s task framework, to execute those computationally intensive operational tasks, thus offloading workloads from the query execution component and avoiding sacrificing the query performance. The Minion component is designed to be easily extensible and pluggable – in addition to addressing the above issue, Minion is also used to build common data ingestion and backfilling pipelines, saving operators time from building customized and ad-hoc ones.

In this talk, we will deep dive into the Minion component and demonstrate how we leverage it in some typical operations tasks. We will also discuss the challenges faced while operating Minion at scale and how we greatly reduced the operational overheads by improving observability and introducing auto-scaling mechanisms.

To summarize, on one hand, Minion takes most of the operational burden in Pinot, helping real-time analytics run smoothly; on the other hand, Minion gives operators flexibility to perform complex operations that were hard (or even impossible) to perform, providing more delightful analytics product experiences.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Minions to The Rescue - Tackling Complex Operations in Apache

    Pinot Haitao Zhang, Software Engineer Xiaobing Li, Software Engineer
  2. Table of Contents 01 OPERATIONAL CHALLENGES Complex, tedious, costly, error

    prone… 02 MINION TO THE RESCUE Automatic, hands free, code free… 03 MINION TASK FRAMEWORK Pinot native, easy to operate, easy to extend 04 FUTURE WORK On both framework and functions
  3. Pinot Guiding Design Principles • Low query latency at high

    QPS • Highly available: no single point of failure • Horizontally scalable: scale by adding new nodes • Easily extensible via plugins 3
  4. Operations Behind The Scene Similar to other OLAP databases, there

    are several major types of operations to support performant queries
  5. Operations Behind The Scene Operation Purpose Examples data ingestion data

    queryability & completeness Various data sources: Streaming system, data lake, blob store, … Various data formats: avro, csv, json, parquet …
  6. Operations Behind The Scene Operation Purpose Examples data update data

    correctness & privacy GDPR data purging Record update
  7. Operations Behind The Scene Operation Purpose Examples data reformat query

    performance data compaction segment resizing data re-partitioning
  8. Perform Above Operations in General Data Ingestion Pull vs Push

    Data Update In-Place vs Re-Ingestion Data Reformat In-Place vs Re-Ingestion
  9. Support Those Operations w/o Minion Operation Realtime Table Offline Table

    data ingestion servers pull data build dedicated pipeline/workflow using spark, hadoop, or standalone ingestion framework
  10. Support Those Operations w/o Minion Operation Realtime Table Offline Table

    data ingestion servers pull data build dedicated pipeline/workflow using spark, hadoop, or standalone ingestion framework data update servers perform data upsert & dedup while ingesting data for other updates, need data re-ingestion + data replacement data re-ingestion + data replacement
  11. Support Those Operations w/o Minion Operation Realtime Table Offline Table

    data ingestion servers pull data build dedicated pipeline/workflow using spark, hadoop, or standalone ingestion framework data update servers perform data upsert & dedup while ingesting data for other updates, need data re-ingestion + data replacement data re-ingestion + data replacement data reformat data re-ingestion + data replacement
  12. Operation Challenges w/o Minion • Operations are fragmented and complex

    ◦ In additional to Pinot, need to set up and maintain other systems • Some operations are manual, tedious, time consuming and error prone ◦ e.g, ad-hoc data ingestion • Some operations are hard to achieve ◦ e.g, repartitioning data, segment resizing 12
  13. Key Considerations of Ideal Solutions • Ease of operations ◦

    No/minimize extra/external systems/pipelines • Unification of control ◦ It’s easier to guarantee atomicity, no downtime etc. • Separation of concerns ◦ Existing Pinot components should not be affected 14
  14. Minions to the Rescue • Pinot-native solution for Complex &

    Generic Operations ◦ Elastic ◦ Pluggabble 16
  15. Use Case: File Ingestion Task • A StarTree extension to

    ingest files from file systems (S3, GCS, ADLS, Local Fs, etc.) to Pinot • Key Features ◦ Exactly-once ◦ With preprocessing like transform/merge/partition/sort etc. • Replaced our earlier solutions based on some open source batch systems. 17
  16. Use Case: File Ingestion Task Cont. Work Mode Scenarios Bootstrap

    Input files do not change Sync existing files are updated regularly and want to keep the segments and files in sync Incremental Ingestion new files can be added during ingestion, and want the tasks to detect and ingest new files automatically
  17. Use Case: Segment Refresh Task • A StarTree extension to

    refresh table segments • Key features ◦ Automatically detect table config change ◦ Refresh the old segments atomically • Make impossible possible 19
  18. Use Case: Segment Refresh Task Cont. Supported Operation Explanation Time

    partitioning Re-partition segments to be time partitioned Value partitioning Re-partition the segments per the partitioning config Merge/Split Merge small segments or split large segments (with rollup support) to ensure segments are properly sized Other table config changes Change time column Change sorted column Change column data type Change column encoding
  19. Minion Task Framework • Built on Helix Task Framework ◦

    Distributed task execution engine ◦ Part of Helix cluster mgmt framework • Flexible task definition and scheduling ◦ Workflow (DAG or FIFO) ◦ Jobs ◦ Tasks (most basic exec unit) 21
  20. Minion Task Framework • But simplified a lot for Pinot

    ◦ Two interfaces to extend • No complex job deps ◦ Just a batch of tasks to run in parallel ◦ Let Helix schedule and watch them • Failure handling is critical ◦ We don’t use the in-built retry mechanism 22
  21. To add new task type • Extend PinotTaskGenerator ◦ Runs

    inside Pinot controller process ◦ The output is a List<PinotTaskConfig> ◦ One task instance per PinotTaskConfig • e.g in FileIngestionTask: ◦ List files in an input folder ◦ Identify failed tasks and retry their input 23
  22. To add new task type • Extend PinotTaskExecutor ◦ Most

    basic execution unit ◦ Run on Pinot minion workers ◦ Each task instance runs in single thread • e.g in FileIngestionTask: ◦ Fetch files as set in PinotTaskConfig ◦ Generate segments and upload 24
  23. Failure handling • Failure handling logic depends on what to

    achieve ◦ Helix in-built retry mechanism was disabled ◦ As it doesn’t know all the context to handle failure properly • e.g. FileIngestionTask requires exactly once ingestion ◦ Used custom checkpoints to handle task failure • e.g SegmentRefreshTask requires atomic replacement ◦ Used segment lineage to handle task failure 25
  24. Minion mgmt UI and APIs • View task complete/failure/running status

    • Trigger or stop tasks etc. • New task types can reuse those directly 26
  25. More observability • Other than task status from UI or

    API • Detailed metrics to help enable ops automation ◦ Docs to them 27
  26. Auto scaling • Minion tasks tend to run regularly or

    occasionally ◦ No need to keep minion workers around • In StarTree cloud, we have implemented auto-scaling ◦ Given min/max workers to provision ◦ StarTree cloud decides to add/rm workers based on pending workload 28
  27. Future work: better Minion framework • More on auto scaling

    ◦ Leverage spot instances for more cost saving ◦ Scale up instance types based worker resource usage or task types 29
  28. Future work: better Minion framework • More on auto scaling

    • Resource isolation among tables ◦ When tasks are generated and executed ◦ When tasks are queued up in Helix 30
  29. Future work: better Minion framework • More on auto scaling

    • Resource isolation among tables • DAG based scheduling ◦ For flexible inter-task scheduling 31
  30. Future work: new Minion tasks • Many existing Minion tasks

    today: ◦ Data ingestion: files/objects, databases, data lakes ◦ Segment mgmt: merge/rollup/refresh/purge • Can use minion to implement materialized view • Use Minion task to trigger external services 32
  31. Operation Challenges • build dedicated pipeline/workflow using other systems ◦

    More systems to maintain ◦ Time consuming ◦ Error prone • use realtime table (instead of offline table) to ingest batch data and support upsert and dedup ◦ Need to maintain an extra streaming service ◦ Need to transform batch data and send it to the streaming service ◦ Extra resources needed on servers ◦ Time consuming ◦ Error prone 35 Need to update this page
  32. How to Support Those Operations without Minion 36 Operation Realtime

    Table Offline Table data ingestion Native support: server ingests data from stream services: data become queryable immediately Need to build dedicated pipeline/workflow data deletion Need to build dedicated pipeline/workflow Need to build dedicated pipeline/workflow data update: upsert & dedup Native support: server performs upsert & dedup while ingesting data. A key differentiator of Pinot Need to build dedicated pipeline/workflow data reformat Need to build dedicated pipeline/workflow Need to build dedicated pipeline/workflow
  33. Operations Behind the Scene Similar to other OLAP databases, there

    are two major types of operations to support query • Data ingestion ◦ Various data sources: Kafka, blob store, data lake, SQL database … ◦ Various data format: avro, csv, json, parquet … • Data update ◦ data compaction ◦ GDPR data purging ◦ segment resizing ◦ data re-partitioning ◦ ... 37
  34. Operation Challenges for Realtime Tables • What works well ◦

    Realtime data ingestion on server ▪ data becomes queriable immediately ◦ Pinot has built-in support for upsert and dedup ▪ a key differentiator of Pinot • Challenges ◦ Need to build dedicated pipeline/workflow to support data update ▪ time consuming and error prone 38
  35. Operation Challenges for Offline Tables • Need to build dedicated

    pipeline/workflow to support both batch data ingestion and data update ◦ time consuming and error prone • Can use realtime tables instead to solve data ingestion, upsert and dedup issue ◦ Extra cost to set up streaming service and send data to streams 39
  36. Use Case: Segment Merge Rollup 40 A built-in minion task

    allowing users to • Merge small segments into larger ones • Rollup values if needed "tableName": "myTable_OFFLINE", "tableType": "OFFLINE", ... ... "task": { "taskTypeConfigsMap": { "MergeRollupTask": { "1day.mergeType": "concat", "1day.bucketTimePeriod": "1d", "1day.bufferTimePeriod": "1d" } } }