Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree)

Minions to The Rescue—Tackling Complex Operations in Apache Pinot (Haitao Zhang & Xiaobing Li, StarTree) | RTA Summit 2023

by StarTree

Slide 1

Slide 1 text

Minions to The Rescue - Tackling Complex Operations in Apache Pinot Haitao Zhang, Software Engineer Xiaobing Li, Software Engineer

Slide 2

Slide 2 text

Table of Contents 01 OPERATIONAL CHALLENGES Complex, tedious, costly, error prone… 02 MINION TO THE RESCUE Automatic, hands free, code free… 03 MINION TASK FRAMEWORK Pinot native, easy to operate, easy to extend 04 FUTURE WORK On both framework and functions

Slide 3

Slide 3 text

Pinot Guiding Design Principles ● Low query latency at high QPS ● Highly available: no single point of failure ● Horizontally scalable: scale by adding new nodes ● Easily extensible via plugins 3

Slide 4

Slide 4 text

Operations Behind The Scene Similar to other OLAP databases, there are several major types of operations to support performant queries

Slide 5

Slide 5 text

Operations Behind The Scene Operation Purpose Examples data ingestion data queryability & completeness Various data sources: Streaming system, data lake, blob store, … Various data formats: avro, csv, json, parquet …

Slide 6

Slide 6 text

Operations Behind The Scene Operation Purpose Examples data update data correctness & privacy GDPR data purging Record update

Slide 7

Slide 7 text

Operations Behind The Scene Operation Purpose Examples data reformat query performance data compaction segment resizing data re-partitioning

Slide 8

Slide 8 text

Perform Above Operations in General Data Ingestion Pull vs Push Data Update In-Place vs Re-Ingestion Data Reformat In-Place vs Re-Ingestion

Slide 9

Slide 9 text

Support Those Operations w/o Minion Operation Realtime Table Ofﬂine Table data ingestion servers pull data build dedicated pipeline/workﬂow using spark, hadoop, or standalone ingestion framework

Slide 10

Slide 10 text

Support Those Operations w/o Minion Operation Realtime Table Ofﬂine Table data ingestion servers pull data build dedicated pipeline/workﬂow using spark, hadoop, or standalone ingestion framework data update servers perform data upsert & dedup while ingesting data for other updates, need data re-ingestion + data replacement data re-ingestion + data replacement

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Operation Challenges w/o Minion ● Operations are fragmented and complex ○ In additional to Pinot, need to set up and maintain other systems ● Some operations are manual, tedious, time consuming and error prone ○ e.g, ad-hoc data ingestion ● Some operations are hard to achieve ○ e.g, repartitioning data, segment resizing 12

Slide 13

Slide 13 text

13 Need Better Solutions

Slide 14

Slide 14 text

Key Considerations of Ideal Solutions ● Ease of operations ○ No/minimize extra/external systems/pipelines ● Uniﬁcation of control ○ It’s easier to guarantee atomicity, no downtime etc. ● Separation of concerns ○ Existing Pinot components should not be affected 14

Slide 15

Slide 15 text

Minions to the Rescue

Slide 16

Slide 16 text

Minions to the Rescue ● Pinot-native solution for Complex & Generic Operations ○ Elastic ○ Pluggabble 16

Slide 17

Slide 17 text

Use Case: File Ingestion Task ● A StarTree extension to ingest ﬁles from ﬁle systems (S3, GCS, ADLS, Local Fs, etc.) to Pinot ● Key Features ○ Exactly-once ○ With preprocessing like transform/merge/partition/sort etc. ● Replaced our earlier solutions based on some open source batch systems. 17

Slide 18

Slide 18 text

Use Case: File Ingestion Task Cont. Work Mode Scenarios Bootstrap Input files do not change Sync existing files are updated regularly and want to keep the segments and files in sync Incremental Ingestion new files can be added during ingestion, and want the tasks to detect and ingest new files automatically

Slide 19

Slide 19 text

Use Case: Segment Refresh Task ● A StarTree extension to refresh table segments ● Key features ○ Automatically detect table conﬁg change ○ Refresh the old segments atomically ● Make impossible possible 19

Slide 20

Slide 20 text

Use Case: Segment Refresh Task Cont. Supported Operation Explanation Time partitioning Re-partition segments to be time partitioned Value partitioning Re-partition the segments per the partitioning conﬁg Merge/Split Merge small segments or split large segments (with rollup support) to ensure segments are properly sized Other table config changes Change time column Change sorted column Change column data type Change column encoding

Slide 21

Slide 21 text

Minion Task Framework ● Built on Helix Task Framework ○ Distributed task execution engine ○ Part of Helix cluster mgmt framework ● Flexible task deﬁnition and scheduling ○ Workﬂow (DAG or FIFO) ○ Jobs ○ Tasks (most basic exec unit) 21

Slide 22

Slide 22 text

Minion Task Framework ● But simpliﬁed a lot for Pinot ○ Two interfaces to extend ● No complex job deps ○ Just a batch of tasks to run in parallel ○ Let Helix schedule and watch them ● Failure handling is critical ○ We don’t use the in-built retry mechanism 22

Slide 23

Slide 23 text

To add new task type ● Extend PinotTaskGenerator ○ Runs inside Pinot controller process ○ The output is a List ○ One task instance per PinotTaskConﬁg ● e.g in FileIngestionTask: ○ List ﬁles in an input folder ○ Identify failed tasks and retry their input 23

Slide 24

Slide 24 text

To add new task type ● Extend PinotTaskExecutor ○ Most basic execution unit ○ Run on Pinot minion workers ○ Each task instance runs in single thread ● e.g in FileIngestionTask: ○ Fetch ﬁles as set in PinotTaskConﬁg ○ Generate segments and upload 24

Slide 25

Slide 25 text

Failure handling ● Failure handling logic depends on what to achieve ○ Helix in-built retry mechanism was disabled ○ As it doesn’t know all the context to handle failure properly ● e.g. FileIngestionTask requires exactly once ingestion ○ Used custom checkpoints to handle task failure ● e.g SegmentRefreshTask requires atomic replacement ○ Used segment lineage to handle task failure 25

Slide 26

Slide 26 text

Minion mgmt UI and APIs ● View task complete/failure/running status ● Trigger or stop tasks etc. ● New task types can reuse those directly 26

Slide 27

Slide 27 text

More observability ● Other than task status from UI or API ● Detailed metrics to help enable ops automation ○ Docs to them 27

Slide 28

Slide 28 text

Auto scaling ● Minion tasks tend to run regularly or occasionally ○ No need to keep minion workers around ● In StarTree cloud, we have implemented auto-scaling ○ Given min/max workers to provision ○ StarTree cloud decides to add/rm workers based on pending workload 28

Slide 29

Slide 29 text

Future work: better Minion framework ● More on auto scaling ○ Leverage spot instances for more cost saving ○ Scale up instance types based worker resource usage or task types 29

Slide 30

Slide 30 text

Future work: better Minion framework ● More on auto scaling ● Resource isolation among tables ○ When tasks are generated and executed ○ When tasks are queued up in Helix 30

Slide 31

Slide 31 text

Future work: better Minion framework ● More on auto scaling ● Resource isolation among tables ● DAG based scheduling ○ For ﬂexible inter-task scheduling 31

Slide 32

Slide 32 text

Future work: new Minion tasks ● Many existing Minion tasks today: ○ Data ingestion: ﬁles/objects, databases, data lakes ○ Segment mgmt: merge/rollup/refresh/purge ● Can use minion to implement materialized view ● Use Minion task to trigger external services 32

Slide 33

Slide 33 text

Thank You! dev.startree.ai 33

Slide 34

Slide 34 text

Back up slides starting from here 34

Slide 35

Slide 35 text

Operation Challenges ● build dedicated pipeline/workflow using other systems ○ More systems to maintain ○ Time consuming ○ Error prone ● use realtime table (instead of offline table) to ingest batch data and support upsert and dedup ○ Need to maintain an extra streaming service ○ Need to transform batch data and send it to the streaming service ○ Extra resources needed on servers ○ Time consuming ○ Error prone 35 Need to update this page

Slide 36

Slide 36 text

How to Support Those Operations without Minion 36 Operation Realtime Table Offline Table data ingestion Native support: server ingests data from stream services: data become queryable immediately Need to build dedicated pipeline/workflow data deletion Need to build dedicated pipeline/workflow Need to build dedicated pipeline/workflow data update: upsert & dedup Native support: server performs upsert & dedup while ingesting data. A key differentiator of Pinot Need to build dedicated pipeline/workflow data reformat Need to build dedicated pipeline/workflow Need to build dedicated pipeline/workflow

Slide 37

Slide 37 text

Operations Behind the Scene Similar to other OLAP databases, there are two major types of operations to support query ● Data ingestion ○ Various data sources: Kafka, blob store, data lake, SQL database … ○ Various data format: avro, csv, json, parquet … ● Data update ○ data compaction ○ GDPR data purging ○ segment resizing ○ data re-partitioning ○ ... 37

Slide 38

Slide 38 text

Operation Challenges for Realtime Tables ● What works well ○ Realtime data ingestion on server ■ data becomes queriable immediately ○ Pinot has built-in support for upsert and dedup ■ a key differentiator of Pinot ● Challenges ○ Need to build dedicated pipeline/workflow to support data update ■ time consuming and error prone 38

Slide 39

Slide 39 text

Operation Challenges for Offline Tables ● Need to build dedicated pipeline/workflow to support both batch data ingestion and data update ○ time consuming and error prone ● Can use realtime tables instead to solve data ingestion, upsert and dedup issue ○ Extra cost to set up streaming service and send data to streams 39

Slide 40

Slide 40 text

Use Case: Segment Merge Rollup 40 A built-in minion task allowing users to ● Merge small segments into larger ones ● Rollup values if needed "tableName": "myTable_OFFLINE", "tableType": "OFFLINE", ... ... "task": { "taskTypeConfigsMap": { "MergeRollupTask": { "1day.mergeType": "concat", "1day.bucketTimePeriod": "1d", "1day.bufferTimePeriod": "1d" } } }