Get Your Data into Apache Pinot Faster with StarTree’s Data Manager (Tim Santos & Seunghyun Lee, StarTree)

Get Your Data into Apache Pinot Faster with StarTree’s Data Manager (Tim Santos & Seunghyun Lee, StarTree) | RTA Summit 2023

by StarTree

Slide 1

Slide 1 text

Get Your Data into Apache Pinot Faster with StarTree’s Data Manager Tim Santos & Seunghyun Lee Software Engineers

Slide 2

Slide 2 text

ABOUT US

Slide 3

Slide 3 text

Overview 01 Current Pinot Table Creation Process Why the traditional way of creating your table can be difﬁcult 02 Introducing Data Manager How Data Manager was built and how it can beneﬁt you 04 Challenges Faced & Future Improvements What did we learn and what we want to improve 03 Demo Watch Data Manager in action

Slide 4

Slide 4 text

Apache Pinot ● OLAP Datastore ● Columnar, indexed storage ● Low latency analytics ● Distributed - highly available, reliable, scalable ● SQL interface ● Lambda architecture

Slide 5

Slide 5 text

Traditional Pinot Table Creation Process Define your schema Column names and types Define your table configuration Configure your ingestion, indexing, etc Use REST APIs or Pinot Admin scripts to create your table and start data ingestion

Slide 6

Slide 6 text

Deﬁne your schema ● Dimensions and Metrics ● Data types ● Time format Imagine having to manually deﬁne a schema for hundreds of dimensions… { "schemaName":"transcript", "dimensionFieldSpecs":[ { "name":"studentID", "dataType":"INT" }, { "name":"name", "dataType":"STRING" }, { "name":"gender", "dataType":"STRING" } ], "metricFieldSpecs":[ { "name":"score", "dataType":"FLOAT" } ], "dateTimeFieldSpecs":[ { "name":"timestampInEpoch", "dataType":"LONG", "format":"1:MILLISECONDS:EPOCH", "granularity":"1:MILLISECONDS" } ] }

Slide 7

Slide 7 text

Define your table config So many things to configure! ● Indexing strategy ● Ingestion ● Tenants ● Routing ● And much more… Table Config & Schema provides the maximum flexibility. But… { "tableName":"transcript", "segmentsConfig":{ "timeColumnName":"timestampInEpoch", "timeType":"MILLISECONDS", "replication":"1", "schemaName":"transcript" }, "tableIndexConfig":{ "invertedIndexColumns":[ "name" ], "loadMode":"MMAP" }, "tenants":{ "broker":"DefaultTenant", "server":"DefaultTenant" }, "tableType":"OFFLINE", "metadata":{} }

Slide 8

Slide 8 text

What could go wrong? ● Invalid schema ● Wrong credentials ● Incorrect column type or format ● Syntax errors ● Incompatible conﬁguration

Slide 9

Slide 9 text

Troubleshooting Cycle ● Would work for experienced Pinot users but Intimidating for an first time user ● Need to go through multiple iterations of the troubleshooting cycle ● Can take 3 to 5 days to get it right Define schema Define table config Start ingestion Discover error

Slide 10

Slide 10 text

Introducing Data Manager

Slide 11

Slide 11 text

What is needed to ingest data? Connection Data Reader How to access the data source? ● Source related access credentials How to read the source data? ● Data format ● Source schema Field Mapping How to map the source data to the destination Pinot table? ● Adding derived columns, dropping columns ● Transformation ● Time column handling

Slide 12

Slide 12 text

What is Data Manager? Connection Data Reader Field Mapping Data Manager Pinot Table & Schema Ingestion Workﬂow Intuitive UI/UX Validation Data Modeling Index & Advanced Conﬁg Ingestion Customization Metrics and Diagnostics

Slide 13

Slide 13 text

Intuitive UI/UX ● Complex Pinot concepts are abstracted away ● Predeﬁned user ﬂow that can guide people with no prior knowledge of Pinot

Slide 14

Slide 14 text

Intuitive UI/UX ● Recently revamped user experience ○ Similar look as other Startree Product ○ Separation of connection and table creation ○ Enhanced Startree index creation process

Slide 15

Slide 15 text

Validation ● User input is validated at each step of the dataset creation process ● Reduces chances of syntax and semantic errors

Slide 16

Slide 16 text

Validation ● Data preview gives the user conﬁdence that the correct data will be ingested

Slide 17

Slide 17 text

Data Modeling Schema inference Student Name Average Score Date Bob 98.63 2022-10-22’T’12:30:00 Stacy 99.85 2022-10-22’T’12:30:00 Frank 97.49 2022-10-22’T’12:30:00

Slide 18

Slide 18 text

Data Modeling Schema inference Student Name Average Score Date Bob 98.63 2022-10-22’T’12:30:00 Stacy 99.85 2022-10-22’T’12:30:00 Frank 97.49 2022-10-22’T’12:30:00

Slide 19

Slide 19 text

Data Modeling Schema inference Student Name Average Score Date Bob 98.63 2022-10-22’T’12:30:00 Stacy 99.85 2022-10-22’T’12:30:00 Frank 97.49 2022-10-22’T’12:30:00

Slide 20

Slide 20 text

Index & Encoding Conﬁg { "tableName": "testData_OFFLINE", "tableType": "OFFLINE", "segmentsConfig": { "deletedSegmentsRetentionPeriod": "7d", "segmentPushType": "APPEND", "replication": "1", "minimizeDataMovement": false, "schemaName": "testData", "timeColumnName": "timestamp", "retentionTimeUnit": "DAYS", "retentionTimeValue": "180" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "aggregateMetrics": false, "autoGeneratedInvertedIndex": false, "loadMode": "MMAP", "invertedIndexColumns": [ "firstName", "gender", "lastName", ... ], "varLengthDictionaryColumns": [ "firstName", "gender", "lastName", ... ] ... }, ... }

Slide 21

Slide 21 text

Advanced Conﬁg (Startree Index) { ... "tableIndexConfig": { "aggregateMetrics": false, "autoGeneratedInvertedIndex": false, "loadMode": "MMAP", "invertedIndexColumns": [ "firstName", "lastName", "gender", ... ], ... "starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "firstName", "lastName", "gender" ], "functionColumnPairs": [ "COUNT__*" ], "skipStarNodeCreationForDimensions": [], "maxLeafRecords": 10000 } ] }, ... }

Slide 22

Slide 22 text

Ingestion Customization Batch Ingest files from your cloud storage S3 Streaming Stream data into a Pinot realtime table Kafka/Confluent, Kinesis Data Warehouse Data ingestion through JDBC Big Query, Snowflake Write API Coming soon!

Slide 23

Slide 23 text

Adding a New Connector ● Implement the DataSourceExplorer interface ○ Test connection ○ Explore source (e.g. Kafka topics, S3 directories) ○ Extract source data / schema ● Rest of code path similar across connectors ○ Pinot schema inference ○ Index recommendation and Pinot conﬁg

Slide 24

Slide 24 text

What happens after Data Manager S3 BigQuery / Snowflake Pinot Minion Pinot Controller Pinot Server Segment Store 1. Create Schema and Table Config 2. Task generator schedules the job based on the ingestion task config 3. Minion Task executor picks up the ingestion task and ingests data from the source and generates Pinot segments. 4. Pinot segments gets copied to the segment store and get registered to the table. 5. Controller notifies servers that the new segment is available. 6. Servers download the pinot segments and serve the data for queries. Troubleshooting Pinot cluster can be complex!

Slide 25

Slide 25 text

Metrics and Diagnostics ● Quickly see data ingestion progress ● Surface any data ingestion related errors

Slide 26

Slide 26 text

Metrics and Diagnostics ● Quickly see data ingestion progress ● Surface any data ingestion related errors

Slide 27

Slide 27 text

Demo Time!

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Challenges we faced Date Time Value Format 2023/04/26 yyyy/MM/dd 2023-04-26 yyyy-MM-dd 2023-04-26 12:00AM yyyy-MM-dd hh:mmaa 2023-04-26T00:00:00Z yyyy-MM-dd'T'HH:mm:ssZ 1682467200000 Epoch milliseconds 1682467200 Epoch seconds These all represent the same day!

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Challenges we faced ● Support for different authentication methods ● Examples ○ S3: access key & secret key, S3 bucket policy, IAM Role… ○ Kafka: SSL/TLS, SASL (PLAINTEXT, SCRAM, OAUTH…) Challenges we faced

Slide 32

Slide 32 text

Lessons we learned ● Working with dirty data is expected ● Building a robust ingestion pipeline is challenging! ● Scalability issues from source data ● Data Manager should set the optimal Pinot conﬁgs to ensure ingestion happens smoothly ● Continuous improvements and quick feedback/development cycle

Slide 33

Slide 33 text

Future improvements ● Major cloud object stores and streaming sources ○ Google Cloud Storage, Google Pub/Sub, Azure Blob store, Azure Event Hub, Apache Pulsar etc ● Data Lake Management Systems ○ Delta Lake, Iceberg, Apache Hudi ● CDC Support via Debezium ○ PostgreSQL, MySQL ● Write API

Slide 34

Slide 34 text

Future improvements ● Transformation function enhancements ○ Registered function support ○ Code completion ● Null handling ● Error message improvement ● Data Flattening ● Audit logs / RBAC

Slide 35

Slide 35 text

Thank you! Visit https://startree.ai/ to learn how you can try out Data Manager!