Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree) | RTA Summit 2023

Minions to The Rescue - Tackling Complex Operations in Apache
Pinot Haitao Zhang, Software Engineer Xiaobing Li, Software Engineer

Table of Contents 01 OPERATIONAL CHALLENGES Complex, tedious, costly, error
prone… 02 MINION TO THE RESCUE Automatic, hands free, code free… 03 MINION TASK FRAMEWORK Pinot native, easy to operate, easy to extend 04 FUTURE WORK On both framework and functions

Pinot Guiding Design Principles • Low query latency at high
QPS • Highly available: no single point of failure • Horizontally scalable: scale by adding new nodes • Easily extensible via plugins 3

Operations Behind The Scene Similar to other OLAP databases, there
are several major types of operations to support performant queries

Operations Behind The Scene Operation Purpose Examples data ingestion data
queryability & completeness Various data sources: Streaming system, data lake, blob store, … Various data formats: avro, csv, json, parquet …

Operations Behind The Scene Operation Purpose Examples data update data
correctness & privacy GDPR data purging Record update

Operations Behind The Scene Operation Purpose Examples data reformat query
performance data compaction segment resizing data re-partitioning

Perform Above Operations in General Data Ingestion Pull vs Push
Data Update In-Place vs Re-Ingestion Data Reformat In-Place vs Re-Ingestion

Support Those Operations w/o Minion Operation Realtime Table Ofﬂine Table
data ingestion servers pull data build dedicated pipeline/workﬂow using spark, hadoop, or standalone ingestion framework

data ingestion servers pull data build dedicated pipeline/workﬂow using spark, hadoop, or standalone ingestion framework data update servers perform data upsert & dedup while ingesting data for other updates, need data re-ingestion + data replacement data re-ingestion + data replacement

data ingestion servers pull data build dedicated pipeline/workﬂow using spark, hadoop, or standalone ingestion framework data update servers perform data upsert & dedup while ingesting data for other updates, need data re-ingestion + data replacement data re-ingestion + data replacement data reformat data re-ingestion + data replacement

Operation Challenges w/o Minion • Operations are fragmented and complex
◦ In additional to Pinot, need to set up and maintain other systems • Some operations are manual, tedious, time consuming and error prone ◦ e.g, ad-hoc data ingestion • Some operations are hard to achieve ◦ e.g, repartitioning data, segment resizing 12

13 Need Better Solutions

Key Considerations of Ideal Solutions • Ease of operations ◦
No/minimize extra/external systems/pipelines • Uniﬁcation of control ◦ It’s easier to guarantee atomicity, no downtime etc. • Separation of concerns ◦ Existing Pinot components should not be affected 14

Minions to the Rescue

Minions to the Rescue • Pinot-native solution for Complex &
Generic Operations ◦ Elastic ◦ Pluggabble 16

Use Case: File Ingestion Task • A StarTree extension to
ingest ﬁles from ﬁle systems (S3, GCS, ADLS, Local Fs, etc.) to Pinot • Key Features ◦ Exactly-once ◦ With preprocessing like transform/merge/partition/sort etc. • Replaced our earlier solutions based on some open source batch systems. 17

Use Case: File Ingestion Task Cont. Work Mode Scenarios Bootstrap
Input files do not change Sync existing files are updated regularly and want to keep the segments and files in sync Incremental Ingestion new files can be added during ingestion, and want the tasks to detect and ingest new files automatically

Use Case: Segment Refresh Task • A StarTree extension to
refresh table segments • Key features ◦ Automatically detect table conﬁg change ◦ Refresh the old segments atomically • Make impossible possible 19

Use Case: Segment Refresh Task Cont. Supported Operation Explanation Time
partitioning Re-partition segments to be time partitioned Value partitioning Re-partition the segments per the partitioning conﬁg Merge/Split Merge small segments or split large segments (with rollup support) to ensure segments are properly sized Other table config changes Change time column Change sorted column Change column data type Change column encoding

Minion Task Framework • Built on Helix Task Framework ◦
Distributed task execution engine ◦ Part of Helix cluster mgmt framework • Flexible task deﬁnition and scheduling ◦ Workﬂow (DAG or FIFO) ◦ Jobs ◦ Tasks (most basic exec unit) 21

Minion Task Framework • But simpliﬁed a lot for Pinot
◦ Two interfaces to extend • No complex job deps ◦ Just a batch of tasks to run in parallel ◦ Let Helix schedule and watch them • Failure handling is critical ◦ We don’t use the in-built retry mechanism 22

To add new task type • Extend PinotTaskGenerator ◦ Runs
inside Pinot controller process ◦ The output is a List<PinotTaskConfig> ◦ One task instance per PinotTaskConfig • e.g in FileIngestionTask: ◦ List files in an input folder ◦ Identify failed tasks and retry their input 23

To add new task type • Extend PinotTaskExecutor ◦ Most
basic execution unit ◦ Run on Pinot minion workers ◦ Each task instance runs in single thread • e.g in FileIngestionTask: ◦ Fetch ﬁles as set in PinotTaskConﬁg ◦ Generate segments and upload 24

Failure handling • Failure handling logic depends on what to
achieve ◦ Helix in-built retry mechanism was disabled ◦ As it doesn’t know all the context to handle failure properly • e.g. FileIngestionTask requires exactly once ingestion ◦ Used custom checkpoints to handle task failure • e.g SegmentRefreshTask requires atomic replacement ◦ Used segment lineage to handle task failure 25

Minion mgmt UI and APIs • View task complete/failure/running status
• Trigger or stop tasks etc. • New task types can reuse those directly 26

More observability • Other than task status from UI or
API • Detailed metrics to help enable ops automation ◦ Docs to them 27

Auto scaling • Minion tasks tend to run regularly or
occasionally ◦ No need to keep minion workers around • In StarTree cloud, we have implemented auto-scaling ◦ Given min/max workers to provision ◦ StarTree cloud decides to add/rm workers based on pending workload 28

Future work: better Minion framework • More on auto scaling
◦ Leverage spot instances for more cost saving ◦ Scale up instance types based worker resource usage or task types 29

• Resource isolation among tables ◦ When tasks are generated and executed ◦ When tasks are queued up in Helix 30

• Resource isolation among tables • DAG based scheduling ◦ For ﬂexible inter-task scheduling 31

Future work: new Minion tasks • Many existing Minion tasks
today: ◦ Data ingestion: ﬁles/objects, databases, data lakes ◦ Segment mgmt: merge/rollup/refresh/purge • Can use minion to implement materialized view • Use Minion task to trigger external services 32

Thank You! dev.startree.ai 33

Back up slides starting from here 34

Operation Challenges • build dedicated pipeline/workflow using other systems ◦
More systems to maintain ◦ Time consuming ◦ Error prone • use realtime table (instead of offline table) to ingest batch data and support upsert and dedup ◦ Need to maintain an extra streaming service ◦ Need to transform batch data and send it to the streaming service ◦ Extra resources needed on servers ◦ Time consuming ◦ Error prone 35 Need to update this page

How to Support Those Operations without Minion 36 Operation Realtime
Table Offline Table data ingestion Native support: server ingests data from stream services: data become queryable immediately Need to build dedicated pipeline/workflow data deletion Need to build dedicated pipeline/workflow Need to build dedicated pipeline/workflow data update: upsert & dedup Native support: server performs upsert & dedup while ingesting data. A key differentiator of Pinot Need to build dedicated pipeline/workflow data reformat Need to build dedicated pipeline/workflow Need to build dedicated pipeline/workflow

Operations Behind the Scene Similar to other OLAP databases, there
are two major types of operations to support query • Data ingestion ◦ Various data sources: Kafka, blob store, data lake, SQL database … ◦ Various data format: avro, csv, json, parquet … • Data update ◦ data compaction ◦ GDPR data purging ◦ segment resizing ◦ data re-partitioning ◦ ... 37

Operation Challenges for Realtime Tables • What works well ◦
Realtime data ingestion on server ▪ data becomes queriable immediately ◦ Pinot has built-in support for upsert and dedup ▪ a key differentiator of Pinot • Challenges ◦ Need to build dedicated pipeline/workflow to support data update ▪ time consuming and error prone 38

Operation Challenges for Offline Tables • Need to build dedicated
pipeline/workflow to support both batch data ingestion and data update ◦ time consuming and error prone • Can use realtime tables instead to solve data ingestion, upsert and dedup issue ◦ Extra cost to set up streaming service and send data to streams 39

Use Case: Segment Merge Rollup 40 A built-in minion task
allowing users to • Merge small segments into larger ones • Rollup values if needed "tableName": "myTable_OFFLINE", "tableType": "OFFLINE", ... ... "task": { "taskTypeConfigsMap": { "MergeRollupTask": { "1day.mergeType": "concat", "1day.bucketTimePeriod": "1d", "1day.bufferTimePeriod": "1d" } } }

Minions to The Rescue—Tackling Complex Operatio...

Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree) | RTA Summit 2023

More Decks by StarTree

Other Decks in Technology

Featured

Transcript