Get Your Data into Apache Pinot Faster with StarTree’s Data Manager (Tim Santos & Seunghyun Lee, StarTree) | RTA Summit 2023

Get Your Data into Apache Pinot Faster with StarTree’s Data
Manager Tim Santos & Seunghyun Lee Software Engineers

ABOUT US

Overview 01 Current Pinot Table Creation Process Why the traditional
way of creating your table can be difﬁcult 02 Introducing Data Manager How Data Manager was built and how it can beneﬁt you 04 Challenges Faced & Future Improvements What did we learn and what we want to improve 03 Demo Watch Data Manager in action

Apache Pinot • OLAP Datastore • Columnar, indexed storage •
Low latency analytics • Distributed - highly available, reliable, scalable • SQL interface • Lambda architecture

Traditional Pinot Table Creation Process Define your schema Column names
and types Define your table configuration Configure your ingestion, indexing, etc Use REST APIs or Pinot Admin scripts to create your table and start data ingestion

Deﬁne your schema • Dimensions and Metrics • Data types
• Time format Imagine having to manually deﬁne a schema for hundreds of dimensions… { "schemaName":"transcript", "dimensionFieldSpecs":[ { "name":"studentID", "dataType":"INT" }, { "name":"name", "dataType":"STRING" }, { "name":"gender", "dataType":"STRING" } ], "metricFieldSpecs":[ { "name":"score", "dataType":"FLOAT" } ], "dateTimeFieldSpecs":[ { "name":"timestampInEpoch", "dataType":"LONG", "format":"1:MILLISECONDS:EPOCH", "granularity":"1:MILLISECONDS" } ] }

Define your table config So many things to configure! •
Indexing strategy • Ingestion • Tenants • Routing • And much more… Table Config & Schema provides the maximum flexibility. But… { "tableName":"transcript", "segmentsConfig":{ "timeColumnName":"timestampInEpoch", "timeType":"MILLISECONDS", "replication":"1", "schemaName":"transcript" }, "tableIndexConfig":{ "invertedIndexColumns":[ "name" ], "loadMode":"MMAP" }, "tenants":{ "broker":"DefaultTenant", "server":"DefaultTenant" }, "tableType":"OFFLINE", "metadata":{} }

What could go wrong? • Invalid schema • Wrong credentials
• Incorrect column type or format • Syntax errors • Incompatible conﬁguration

Troubleshooting Cycle • Would work for experienced Pinot users but
Intimidating for an first time user • Need to go through multiple iterations of the troubleshooting cycle • Can take 3 to 5 days to get it right Define schema Define table config Start ingestion Discover error

Introducing Data Manager

What is needed to ingest data? Connection Data Reader How
to access the data source? • Source related access credentials How to read the source data? • Data format • Source schema Field Mapping How to map the source data to the destination Pinot table? • Adding derived columns, dropping columns • Transformation • Time column handling

What is Data Manager? Connection Data Reader Field Mapping Data
Manager Pinot Table & Schema Ingestion Workﬂow Intuitive UI/UX Validation Data Modeling Index & Advanced Conﬁg Ingestion Customization Metrics and Diagnostics

Intuitive UI/UX • Complex Pinot concepts are abstracted away •
Predeﬁned user ﬂow that can guide people with no prior knowledge of Pinot

Intuitive UI/UX • Recently revamped user experience ◦ Similar look
as other Startree Product ◦ Separation of connection and table creation ◦ Enhanced Startree index creation process

Validation • User input is validated at each step of
the dataset creation process • Reduces chances of syntax and semantic errors

Validation • Data preview gives the user conﬁdence that the
correct data will be ingested

Data Modeling Schema inference Student Name Average Score Date Bob
98.63 2022-10-22’T’12:30:00 Stacy 99.85 2022-10-22’T’12:30:00 Frank 97.49 2022-10-22’T’12:30:00

Index & Encoding Conﬁg { "tableName": "testData_OFFLINE", "tableType": "OFFLINE", "segmentsConfig":
{ "deletedSegmentsRetentionPeriod": "7d", "segmentPushType": "APPEND", "replication": "1", "minimizeDataMovement": false, "schemaName": "testData", "timeColumnName": "timestamp", "retentionTimeUnit": "DAYS", "retentionTimeValue": "180" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "aggregateMetrics": false, "autoGeneratedInvertedIndex": false, "loadMode": "MMAP", "invertedIndexColumns": [ "firstName", "gender", "lastName", ... ], "varLengthDictionaryColumns": [ "firstName", "gender", "lastName", ... ] ... }, ... }

Advanced Conﬁg (Startree Index) { ... "tableIndexConfig": { "aggregateMetrics": false,
"autoGeneratedInvertedIndex": false, "loadMode": "MMAP", "invertedIndexColumns": [ "firstName", "lastName", "gender", ... ], ... "starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "firstName", "lastName", "gender" ], "functionColumnPairs": [ "COUNT__*" ], "skipStarNodeCreationForDimensions": [], "maxLeafRecords": 10000 } ] }, ... }

Ingestion Customization Batch Ingest files from your cloud storage S3
Streaming Stream data into a Pinot realtime table Kafka/Confluent, Kinesis Data Warehouse Data ingestion through JDBC Big Query, Snowflake Write API Coming soon!

Adding a New Connector • Implement the DataSourceExplorer interface ◦
Test connection ◦ Explore source (e.g. Kafka topics, S3 directories) ◦ Extract source data / schema • Rest of code path similar across connectors ◦ Pinot schema inference ◦ Index recommendation and Pinot conﬁg

What happens after Data Manager S3 BigQuery / Snowflake Pinot
Minion Pinot Controller Pinot Server Segment Store 1. Create Schema and Table Config 2. Task generator schedules the job based on the ingestion task config 3. Minion Task executor picks up the ingestion task and ingests data from the source and generates Pinot segments. 4. Pinot segments gets copied to the segment store and get registered to the table. 5. Controller notifies servers that the new segment is available. 6. Servers download the pinot segments and serve the data for queries. Troubleshooting Pinot cluster can be complex!

Metrics and Diagnostics • Quickly see data ingestion progress •
Surface any data ingestion related errors

Demo Time!

Challenges we faced Date Time Value Format 2023/04/26 yyyy/MM/dd 2023-04-26
yyyy-MM-dd 2023-04-26 12:00AM yyyy-MM-dd hh:mmaa 2023-04-26T00:00:00Z yyyy-MM-dd'T'HH:mm:ssZ 1682467200000 Epoch milliseconds 1682467200 Epoch seconds These all represent the same day!

Challenges we faced • Support for different authentication methods •
Examples ◦ S3: access key & secret key, S3 bucket policy, IAM Role… ◦ Kafka: SSL/TLS, SASL (PLAINTEXT, SCRAM, OAUTH…) Challenges we faced

Lessons we learned • Working with dirty data is expected
• Building a robust ingestion pipeline is challenging! • Scalability issues from source data • Data Manager should set the optimal Pinot conﬁgs to ensure ingestion happens smoothly • Continuous improvements and quick feedback/development cycle

Future improvements • Major cloud object stores and streaming sources
◦ Google Cloud Storage, Google Pub/Sub, Azure Blob store, Azure Event Hub, Apache Pulsar etc • Data Lake Management Systems ◦ Delta Lake, Iceberg, Apache Hudi • CDC Support via Debezium ◦ PostgreSQL, MySQL • Write API

Future improvements • Transformation function enhancements ◦ Registered function support
◦ Code completion • Null handling • Error message improvement • Data Flattening • Audit logs / RBAC

Thank you! Visit https://startree.ai/ to learn how you can try
out Data Manager!

Get Your Data into Apache Pinot Faster with Sta...

Get Your Data into Apache Pinot Faster with StarTree’s Data Manager (Tim Santos & Seunghyun Lee, StarTree) | RTA Summit 2023

More Decks by StarTree

Other Decks in Technology

Featured

Transcript