Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Get Your Data into Apache Pinot Faster with StarTree’s Data Manager (Tim Santos & Seunghyun Lee, StarTree) | RTA Summit 2023

Get Your Data into Apache Pinot Faster with StarTree’s Data Manager (Tim Santos & Seunghyun Lee, StarTree) | RTA Summit 2023

Apache Pinot supports data ingestion from a wide variety of sources such as streaming (eg: Kafka, Pulsar, Kinesis), batch (eg: S3, GCS, HDFS) as well as data warehouses (eg: Snowflake, BigQuery). Configuring ingestion properties for all these data sources within the Pinot table config can be tedious. In addition, users also have to specify additional settings such as data partitioning, column indexes, retention, quotas and so on which makes it cumbersome for beginners to start onboarding Pinot tables.

StarTree’s Data Manager is a no-code, self-service tool that helps users of all calibers quickly get started with Pinot. It recently has undergone a revamp that provides a pleasant step by step experience to connect to your data source and start ingesting data. In addition, Data Manager uses data sampling and data preview techniques to ensure that your data model and Pinot indexes are configured for ingestion. With Data Manager, StarTree users are able to start querying data in Pinot faster than ever.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Get Your Data into Apache Pinot Faster with StarTree’s Data

    Manager Tim Santos & Seunghyun Lee Software Engineers
  2. Overview 01 Current Pinot Table Creation Process Why the traditional

    way of creating your table can be difficult 02 Introducing Data Manager How Data Manager was built and how it can benefit you 04 Challenges Faced & Future Improvements What did we learn and what we want to improve 03 Demo Watch Data Manager in action
  3. Apache Pinot • OLAP Datastore • Columnar, indexed storage •

    Low latency analytics • Distributed - highly available, reliable, scalable • SQL interface • Lambda architecture
  4. Traditional Pinot Table Creation Process Define your schema Column names

    and types Define your table configuration Configure your ingestion, indexing, etc Use REST APIs or Pinot Admin scripts to create your table and start data ingestion
  5. Define your schema • Dimensions and Metrics • Data types

    • Time format Imagine having to manually define a schema for hundreds of dimensions… { "schemaName":"transcript", "dimensionFieldSpecs":[ { "name":"studentID", "dataType":"INT" }, { "name":"name", "dataType":"STRING" }, { "name":"gender", "dataType":"STRING" } ], "metricFieldSpecs":[ { "name":"score", "dataType":"FLOAT" } ], "dateTimeFieldSpecs":[ { "name":"timestampInEpoch", "dataType":"LONG", "format":"1:MILLISECONDS:EPOCH", "granularity":"1:MILLISECONDS" } ] }
  6. Define your table config So many things to configure! •

    Indexing strategy • Ingestion • Tenants • Routing • And much more… Table Config & Schema provides the maximum flexibility. But… { "tableName":"transcript", "segmentsConfig":{ "timeColumnName":"timestampInEpoch", "timeType":"MILLISECONDS", "replication":"1", "schemaName":"transcript" }, "tableIndexConfig":{ "invertedIndexColumns":[ "name" ], "loadMode":"MMAP" }, "tenants":{ "broker":"DefaultTenant", "server":"DefaultTenant" }, "tableType":"OFFLINE", "metadata":{} }
  7. What could go wrong? • Invalid schema • Wrong credentials

    • Incorrect column type or format • Syntax errors • Incompatible configuration
  8. Troubleshooting Cycle • Would work for experienced Pinot users but

    Intimidating for an first time user • Need to go through multiple iterations of the troubleshooting cycle • Can take 3 to 5 days to get it right Define schema Define table config Start ingestion Discover error
  9. What is needed to ingest data? Connection Data Reader How

    to access the data source? • Source related access credentials How to read the source data? • Data format • Source schema Field Mapping How to map the source data to the destination Pinot table? • Adding derived columns, dropping columns • Transformation • Time column handling
  10. What is Data Manager? Connection Data Reader Field Mapping Data

    Manager Pinot Table & Schema Ingestion Workflow Intuitive UI/UX Validation Data Modeling Index & Advanced Config Ingestion Customization Metrics and Diagnostics
  11. Intuitive UI/UX • Complex Pinot concepts are abstracted away •

    Predefined user flow that can guide people with no prior knowledge of Pinot
  12. Intuitive UI/UX • Recently revamped user experience ◦ Similar look

    as other Startree Product ◦ Separation of connection and table creation ◦ Enhanced Startree index creation process
  13. Validation • User input is validated at each step of

    the dataset creation process • Reduces chances of syntax and semantic errors
  14. Data Modeling Schema inference Student Name Average Score Date Bob

    98.63 2022-10-22’T’12:30:00 Stacy 99.85 2022-10-22’T’12:30:00 Frank 97.49 2022-10-22’T’12:30:00
  15. Data Modeling Schema inference Student Name Average Score Date Bob

    98.63 2022-10-22’T’12:30:00 Stacy 99.85 2022-10-22’T’12:30:00 Frank 97.49 2022-10-22’T’12:30:00
  16. Data Modeling Schema inference Student Name Average Score Date Bob

    98.63 2022-10-22’T’12:30:00 Stacy 99.85 2022-10-22’T’12:30:00 Frank 97.49 2022-10-22’T’12:30:00
  17. Index & Encoding Config { "tableName": "testData_OFFLINE", "tableType": "OFFLINE", "segmentsConfig":

    { "deletedSegmentsRetentionPeriod": "7d", "segmentPushType": "APPEND", "replication": "1", "minimizeDataMovement": false, "schemaName": "testData", "timeColumnName": "timestamp", "retentionTimeUnit": "DAYS", "retentionTimeValue": "180" }, "tenants": { "broker": "DefaultTenant", "server": "DefaultTenant" }, "tableIndexConfig": { "aggregateMetrics": false, "autoGeneratedInvertedIndex": false, "loadMode": "MMAP", "invertedIndexColumns": [ "firstName", "gender", "lastName", ... ], "varLengthDictionaryColumns": [ "firstName", "gender", "lastName", ... ] ... }, ... }
  18. Advanced Config (Startree Index) { ... "tableIndexConfig": { "aggregateMetrics": false,

    "autoGeneratedInvertedIndex": false, "loadMode": "MMAP", "invertedIndexColumns": [ "firstName", "lastName", "gender", ... ], ... "starTreeIndexConfigs": [ { "dimensionsSplitOrder": [ "firstName", "lastName", "gender" ], "functionColumnPairs": [ "COUNT__*" ], "skipStarNodeCreationForDimensions": [], "maxLeafRecords": 10000 } ] }, ... }
  19. Ingestion Customization Batch Ingest files from your cloud storage S3

    Streaming Stream data into a Pinot realtime table Kafka/Confluent, Kinesis Data Warehouse Data ingestion through JDBC Big Query, Snowflake Write API Coming soon!
  20. Adding a New Connector • Implement the DataSourceExplorer interface ◦

    Test connection ◦ Explore source (e.g. Kafka topics, S3 directories) ◦ Extract source data / schema • Rest of code path similar across connectors ◦ Pinot schema inference ◦ Index recommendation and Pinot config
  21. What happens after Data Manager S3 BigQuery / Snowflake Pinot

    Minion Pinot Controller Pinot Server Segment Store 1. Create Schema and Table Config 2. Task generator schedules the job based on the ingestion task config 3. Minion Task executor picks up the ingestion task and ingests data from the source and generates Pinot segments. 4. Pinot segments gets copied to the segment store and get registered to the table. 5. Controller notifies servers that the new segment is available. 6. Servers download the pinot segments and serve the data for queries. Troubleshooting Pinot cluster can be complex!
  22. Challenges we faced Date Time Value Format 2023/04/26 yyyy/MM/dd 2023-04-26

    yyyy-MM-dd 2023-04-26 12:00AM yyyy-MM-dd hh:mmaa 2023-04-26T00:00:00Z yyyy-MM-dd'T'HH:mm:ssZ 1682467200000 Epoch milliseconds 1682467200 Epoch seconds These all represent the same day!
  23. Challenges we faced studentID,firstName,lastName,gender,address,courses,extracurricular,scores,timestamp 205,Natalie,Jones,Female,22 Baker St,History,pingpong,3.8,1571900400000 211,John,Doe,Male,22 Baker St,History,music,3.8,1572678000000

    studentID|firstName|lastName|gender|address|courses|extracurricular|scores|timestamp 205|Natalie|Jones|Female|22 Baker St|History|pingpong|3.8|1571900400000 211|John|Doe|Male|22 Baker St|History|music|3.8|1572678000000 studentID;firstName;lastName;gender;address;courses;extracurricular;scores;timestamp 205;Natalie;Jones;Female;22 Baker St;History;pingpong;3.8;1571900400000 211;John;Doe;Male;22 Baker St;History;music;3.8;1572678000000 205;Natalie;Jones;Female;22 Baker St;History;pingpong;3.8;1571900400000 211;John;Doe;Male;22 Baker St;History;music;3.8;1572678000000 We want to get your data into Pinot faster than ever!
  24. Challenges we faced • Support for different authentication methods •

    Examples ◦ S3: access key & secret key, S3 bucket policy, IAM Role… ◦ Kafka: SSL/TLS, SASL (PLAINTEXT, SCRAM, OAUTH…) Challenges we faced
  25. Lessons we learned • Working with dirty data is expected

    • Building a robust ingestion pipeline is challenging! • Scalability issues from source data • Data Manager should set the optimal Pinot configs to ensure ingestion happens smoothly • Continuous improvements and quick feedback/development cycle
  26. Future improvements • Major cloud object stores and streaming sources

    ◦ Google Cloud Storage, Google Pub/Sub, Azure Blob store, Azure Event Hub, Apache Pulsar etc • Data Lake Management Systems ◦ Delta Lake, Iceberg, Apache Hudi • CDC Support via Debezium ◦ PostgreSQL, MySQL • Write API
  27. Future improvements • Transformation function enhancements ◦ Registered function support

    ◦ Code completion • Null handling • Error message improvement • Data Flattening • Audit logs / RBAC