Slide 1

Slide 1 text

Gently Down the Stream with Apache Pinot Navina Ramesh Software Engineer, StarTree

Slide 2

Slide 2 text

Why Realtime Analytics Matters? 2

Slide 3

Slide 3 text

Why Realtime Analytics Matters? ● Analytics helps make better decisions by providing insights into data. 3 Events Insight Action

Slide 4

Slide 4 text

Why Realtime Analytics Matters? 4 ● Realtime analytics reduces the time to go from data to insights and actions Time Value Realtime

Slide 5

Slide 5 text

Examples 5

Slide 6

Slide 6 text

Total users 700 Million QPS 10000+ Latency SLA < 100 ms p99th Freshness Seconds Examples of Realtime Analytics

Slide 7

Slide 7 text

Examples of Real-Time Analytics Missed orders Inaccurate orders Downtime Top selling items Menu item Feedback Total users 500,000+ QPS 100s Latency SLA < 100 ms p99th Freshness Seconds - Minutes

Slide 8

Slide 8 text

Characteristics of a Realtime Analytics Platform ● Query latency ~sub-second to seconds ● Data freshness ~seconds ● QPS ~ order of thousands 8 1000s of QPS Sub-second Latency Seconds Freshness Highly Available Scalable Cost Effective Realtime Ingestion Velocity High Dimensionality

Slide 9

Slide 9 text

From Stream to Pinot 9

Slide 10

Slide 10 text

10 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 11

Slide 11 text

Realtime Ingestion in Apache Pinot 11

Slide 12

Slide 12 text

Realtime Ingestion in Apache Pinot ● Partition-level, Replicated Consumption Model ● Periodic Segment Creation based on size/time thresholds ● Data ready for query as soon as ingestion 12

Slide 13

Slide 13 text

High Level Architecture 13 Pinot Table Pinot Data Segments ColN .. … … … … … . . . + Indexes

Slide 14

Slide 14 text

Realtime Ingestion 14 Server-1 Server-2 Server-3 Zookeeper Controller Consuming seg0_0 seg1_0 seg2_0 seg0_0 seg1_0 seg2_0 Replicated & Partitioned Consumption Brokers Query consuming segments Seg0_0 -> Server-1,Server-2 Seg1_0 -> Server-2,Server-3 Seg2_0 -> Server-3,Server-1

Slide 15

Slide 15 text

15 Server-1 Server-2 Server-3 Zookeeper Controller seg0_0 seg1_0 seg2_0 seg0_0 seg1_0 seg2_0 Realtime Ingestion Deep Store seg0_0 seg1_0 seg2_0 Seal segment (immutable)

Slide 16

Slide 16 text

16 Server-1 Server-2 Server-3 Zookeeper Controller seg0_0 seg1_0 seg2_0 seg0_0 seg1_0 seg2_0 Deep Store seg0_0 Realtime Ingestion seg0_1 seg0_1 seg2_1 seg1_1 seg1_0 seg2_0 seg2_1 seg1_1 Seg0_0 -> Server-1,Server-2 Seg1_0 -> Server-2,Server-3 Seg2_0 -> Server-3,Server-1 Seg0_1 -> Server-1,Server-2 Seg1_1 -> Server-2,Server-3 Seg2_1 -> Server-3,Server-1 Consuming, Indexing & Serving

Slide 17

Slide 17 text

Peek into Realtime Engine 17

Slide 18

Slide 18 text

Apache Pinot Realtime Engine ● Pluggable Design ● Components ○ Stream Consumer ○ Decoder, Transform Pipeline, Indexer ○ Upsert Metadata Manager ○ Segment Completion Handler ● Components allow Pinot to: ○ Fetch data from Source ○ Parse and apply transforms and aggregations on data ○ Apply transforms based on data semantics ○ Build and store Segments 18

Slide 19

Slide 19 text

Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper Realtime Engine Component - Stream Consumer Fetches from a data source

Slide 20

Slide 20 text

Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper Realtime Engine Component - Decoder Deserialize data into Pinot row Stream Consumer ● CSV ● JSON ● Avro ● Protobuf ● … ● Debezium Decoder

Slide 21

Slide 21 text

Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper Realtime Engine Component - Transform Pipeline Applies transformation functions Filtering Flattening Aggregation

Slide 22

Slide 22 text

Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper Realtime Engine Components - Indexer Index data in-memory; Ready to query

Slide 23

Slide 23 text

Stream Consumer Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper Realtime Engine Components - Upsert Manager Maintains metadata for upsert semantics

Slide 24

Slide 24 text

Decode Transform Index Segment Completion Handler Upsert Manager Controller Zookeeper Realtime Engine Components Stream Consumer Message Processor

Slide 25

Slide 25 text

25 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 26

Slide 26 text

Data Transformation & Derived Data 26

Slide 27

Slide 27 text

Transform Functions in Apache Pinot 27 ● Large library of in-built transform functions ● Define custom transforms ○ Using UDFs ○ Using Groovy ○ Using @Scalar Annotation on Java transform methods

Slide 28

Slide 28 text

Transform Function 28 Number of clicks per hour on a given day SELECT SUM(clicks) FROM ads WHERE toEpochDays(timestamp) = ‘2023-03-30 00:00:00’ GROUP BY toEpochHours(timestamp) Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01

Slide 29

Slide 29 text

29 Number of clicks per hour on a given day SELECT SUM(clicks) FROM ads WHERE toEpochDays(timestamp) = ‘2023-03-30 00:00:00’ GROUP BY toEpochHours(timestamp) Transform functions - toEpochDays, toEpochHours, SUM Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01

Slide 30

Slide 30 text

30 Average conversion rate since a given day SELECT AVG(click/views) FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01

Slide 31

Slide 31 text

31 Average conversion rate since a given day SELECT AVG(click/views) FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ Transform functions - toEpochDays, AVG Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01

Slide 32

Slide 32 text

32 Transform Functions in Apache Pinot ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01 Transform functions are convenient. But it comes at a cost of query performance! ● Needs extra calculations per record ● Cannot apply index on transformation results Use Ingestion Transforms in Pinot!

Slide 33

Slide 33 text

33 Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”: “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }] ... Ingestion Transforms in Apache Pinot ● Pre-materialize transformation result as a dedicated pinot column ● Index can be applied on the pre-materialized column!

Slide 34

Slide 34 text

34 ad_id views clicks timestamp 1 7 4 2023-03-30 00:00:00 2 10 2 2023-03-30 00:00:00 3 1 2 2023-03-30 00:00:01 1 4 1 2023-03-30 00:00:01 4 2 1 2023-03-30 00:00:01 Number of clicks per hour on a given day SELECT SUM(clicks) FROM ads WHERE toEpochDays(timestamp) = ‘2023-03-30 00:00:00’ GROUP BY toEpochHours(timestamp) SELECT SUM(clicks) FROM ads WHERE daysTs = ‘2023-03-30 00:00:00’ GROUP BY hoursTs Ingestion Transforms in Apache Pinot

Slide 35

Slide 35 text

35 Average conversion rate since a given day SELECT AVG(click/views) FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ SELECT AVG(click/views) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’ Ingestion Transforms in Apache Pinot Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”: “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }] ...

Slide 36

Slide 36 text

36 Average conversion rate since a given day SELECT AVG(click/views) FROM ads WHERE toEpochDays(timestamp) >= ‘2023-03-01 00:00:00’ SELECT AVG(click/views) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’ Ingestion Transforms in Apache Pinot Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”: “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }] ...

Slide 37

Slide 37 text

37 Derived Columns in Apache Pinot ● Column with its value calculated from other columns ● Generated on-the-fly during ingestion ● Backfill existing segments with reload of the segment on the server

Slide 38

Slide 38 text

38 Table Config: “ingestionConfig”: { “transformConfigs”: [{ “columnName”: “daysTS”, “transformFunction”: “toEpochDays(timestamp)” }, { “columnName”: “hoursTs”, “transformFunction”: “toEpochHours(timestamp)” }, { “columnName”: “conversionRate”, “transformFunction”: “clicks/views” } ] ... Derived Columns in Apache Pinot Average conversion rate since a given day SELECT AVG(click/views) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’ SELECT AVG(conversionRate) FROM ads WHERE daysTs >= ‘2023-03-30 00:00:00’

Slide 39

Slide 39 text

39 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 40

Slide 40 text

40 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 41

Slide 41 text

Complex Types & Nested Fields 41

Slide 42

Slide 42 text

42 Complex Types { "id": "17244207357", "type": "PullRequestEvent", "repo": { "id": 326604100, "name": "jthegedus/svelte-adapter-firebase" }, "payload": { "number": 120, "pull_request": { "id": 694073641, "user": { "login": "github-actions[bot]", "id": 41898282, "type": "Bot" }, "body": ":robot: I have created a release ...", "created_at": "2021-07-21T05:59:14Z", "labels": [ { "id": 3184575526, "name": "autorelease: pending" }, { "id": 2412457281, "name": "autorelease: testing" } ] } } } Nested Fields Complex Types / Nested Fields Nested Fields Array

Slide 43

Slide 43 text

43 Complex Types { "id": "17244207357", "type": "PullRequestEvent", "repo": { "id": 326604100, "name": "jthegedus/svelte-adapter-firebase" }, "payload": { "number": 120, "pull_request": { "id": 694073641, "user": { "login": "github-actions[bot]", "id": 41898282, "type": "Bot" }, "body": ":robot: I have created a release ...", "created_at": "2021-07-21T05:59:14Z", "labels": [ { "id": 3184575526, "name": "autorelease: pending" }, { "id": 2412457281, "name": "autorelease: testing" } ] } } } Nested Fields Complex Types / Nested Fields Nested Fields Array

Slide 44

Slide 44 text

44 ● Store top-level fields as JSON string and extract during query time Complex Types / Nested Fields SELECT id, type, json_extract_scalar("repo", '$.name', 'String', 'null') repo_name, json_extract_scalar("payload", '$.number', INT, 'null') payload_number, json_extract_scalar(json_extract_scalar("payload", '$.pull_request', 'String', 'null'), '$.labels', 'STRING', 'null') payload_pullrequest_labels, json_extract_scalar("payload", '$.created_at', 'String', 'null') payload_created_at FROM PullRequestEvent

Slide 45

Slide 45 text

45 ● Store top-level fields as JSON string and extract during query time Complex Types / Nested Fields SELECT id, type, json_extract_scalar("repo", '$.name', 'String', 'null') repo_name, json_extract_scalar("payload", '$.number', INT, 'null') payload_number, json_extract_scalar(json_extract_scalar("payload", '$.pull_request', 'String', 'null'), '$.labels', 'STRING', 'null') payload_pullrequest_labels, json_extract_scalar("payload", '$.created_at', 'String', 'null') payload_created_at FROM PullRequestEvent ● Not optimized for query - ○ Transform functions applied at query-time ○ For every record, json object is constructed from the stored json string ● Cannot index on nested json type ● Storage overhead - storing entire json string

Slide 46

Slide 46 text

46 ● Use Ingestion Transforms! ○ Specify complexTypeConfig with delimiter to flatten the maps recursively Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels", "dataType": "STRING"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.” }, … }

Slide 47

Slide 47 text

47 ● Use Ingestion Transforms! ○ Specify complexTypeConfig with delimiter to flatten the maps recursively Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels", "dataType": "STRING"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.” }, … } id type repo.name payload.pull_re quest.number payload.pull_request.labels payload.pull_requ est.created_at 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 [ { "id": 3184575526, "name": "autorelease: pending"}, { "id": 2412457281, "name": "autorelease: testing"}] 2021-07-21T05:5 9:14Z

Slide 48

Slide 48 text

48 ● Use Ingestion Transforms! ○ Specify complexTypeConfig with delimiter to flatten the maps recursively Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels", "dataType": "STRING"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.” }, … } id type repo.name payload.pull_re quest.number payload.pull_request.labels payload.pull_requ est.created_at 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 [ { "id": 3184575526, "name": "autorelease: pending"}, { "id": 2412457281, "name": "autorelease: testing"}] 2021-07-21T05:5 9:14Z

Slide 49

Slide 49 text

49 ● Use Ingestion Transform! ○ Specify fieldsToUnnest within complexTypeConfig to unnest the array Complex Types / Nested Fields Schema: "dimensionFieldSpecs": [ { "name": "id", "dataType": "INT" }, { "name": "type", "dataType": "STRING”}, { "name": "repo.name", "dataType": "STRING" }, { "name": "payload.pull_request.id", "dataType": "INT"}, { "name": "payload.pull_request.labels.name", "dataType": "JSON"} ], “dateTimeFieldSpecs”: [ { “name”: “payload.pull_request.created_at, "dataType": "STRING", “format": "1:SECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd'T'HH:mm:ss'Z'” } ] Table Config: "ingestionConfig": { "complexTypeConfig": { “delimiter”: “.”, “fieldsToUnnest”: [“payload.pull_request.labels”] }, … } id type repo.name payload.pull_re quest.number payload.pull_request.labels payload.pull_request.cr eated_at 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 autorelease: pending 2021-07-21T05:59:14Z 17244207357 PullRequestEvent jthegedus/svelte-adap ter-firebase 120 autorelease: testing 2021-07-21T05:59:14Z

Slide 50

Slide 50 text

50 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 51

Slide 51 text

51 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 52

Slide 52 text

Data Granularity 52

Slide 53

Slide 53 text

53 Dealing with Data Granularity Data Granularity customer_id product_id price timestamp 205 car 1500 1571900400000 206 truck 2200 1571900400000 207 car 1300 1571900400000 208 truck 700 1572418800000 209 car 1100 1572505200000 210 car 2100 1572505200000 milliseconds product_id price timestamp car 2800 10-24-2019 00:00:00 truck 2200 10-24-2019 00:00:00 truck 700 10-30-2019 00:00:00 car 3200 10-31-2019 00:00:00 daily

Slide 54

Slide 54 text

Dealing with Data Granularity 54 customer_id product_id price timestamp 205 car 1500 1571900400000 206 truck 2200 1571900400000 207 car 1300 1571900400000 208 truck 700 1572418800000 209 car 1100 1572505200000 210 car 2100 1572505200000 211 truck 800 1572678000000 212 car 800 1572678000000 213 car 1900 1572678000000 214 car 1000 1572678000000 Daily sales value per product SELECT COUNT(*), SUM(price) FROM sales GROUP BY product_id, toEpochDays(timestamp) Computed at query time Aggregated at query time

Slide 55

Slide 55 text

Ingestion Aggregation 55 ● Aggregate Data during Ingestion time Table Config: { "transformConfigs": [{ "columnName": "daysSinceEpoch", "transformFunction": "toEpochDays(timestamp)" }], "aggregationConfigs": [{ "columnName": "total_sales", "aggregationFunction": "SUM(price)" }, { "columnName": "sales_count", "aggregationFunction": "COUNT(*)" }] }, "tableIndexConfig": { "noDictionaryColumns": [ "sales_count", "total_sales" ]}} product_id sales_count total_sales day car 2 2800 10-24-2019 00:00:00 truck 1 2200 10-24-2019 00:00:00 truck 1 700 10-30-2019 00:00:00 car 2 3300 10-31-2019 00:00:00 truck 1 800 11-02-2019 00:00:00 car 3 3700 11-02-2019 00:00:00 Daily sales value per product SELECT sales_count, total_sales FROM sales GROUP BY product_id, day

Slide 56

Slide 56 text

Ingestion Aggregation 56 ● Use ingestion time aggregation when only aggregated data is required and not the entire raw data ● Allowed Aggregate functions: ● SUM, MIN, MAX, COUNT, DISTINCTCOUNTHLL Caveats: ● All metric columns must have aggregation functions defined.

Slide 57

Slide 57 text

57 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 58

Slide 58 text

58 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 59

Slide 59 text

Data semantics 59

Slide 60

Slide 60 text

Data Semantics ● Data Semantics ○ Change Data Capture events - READ, CREATE, INSERT, UPDATE, DELETE ○ Multiple events for same record key MySql Mutable records Immutable records Debezium

Slide 61

Slide 61 text

Data Semantics Immutable records Stream with changes to same record ● Data Semantics ○ Change Data Capture events - READ, CREATE, INSERT, UPDATE, DELETE ○ Multiple events for same record key

Slide 62

Slide 62 text

62 Data Semantics [ {"name": "test.movies.movies.Key" }, "payload": { "id": "{\"$oid\": \"1002\"}" } }, {"payload": { "before": null, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}", "source": { "connector": "mysql","name": "test","ts_ms": 1670357925000, "snapshot": "false","db": "movies", … }, "op": "c", "ts_ms": 1670357926132, "transaction": null } }] [ {"name": "test.movies.movies.Key" }, "payload": { "id": "{\"$oid\": \"1002\"}" } }, {"payload": { "before": null, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}", "source": { "connector": "mysql","name": "test","ts_ms": 1670357925000, "snapshot": "false","db": "movies", ... }, "op": "c", "ts_ms": 1670357926132, "transaction": null } }] movie_id movie_title rating imdb_score timestamp 1002 Black Panther PG-13 7.3 2022-12-06 20:18:46 Update record by primary key Dealing with Data Semantics

Slide 63

Slide 63 text

Dealing with Data Semantics ID Name Rating IMDB 1002 Black Panther PG-13 7.3 [ {"schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Key" }, "payload": { "id": "{\"$oid\": \"1002\"}" } }, { "schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Envelope" }, "payload": { "before": null, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}", "source": { "connector": "mysql", "name": "test", "ts_ms": 1670357925000, "snapshot": "false", "db": "movies", … }, "op": "c", "ts_ms": 1670357926132, "transaction": null } } ] Create Record MySQL Debezium

Slide 64

Slide 64 text

Dealing with Data Semantics ID Name Rating IMDB 1002 Black Panther PG-13 8.5 [ {"schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Key" }, "payload": { "id": "{\"$oid\": \"1002\"}" } }, { "schema": { "type": "struct", "fields": [ ... ], "optional": false, "name": "test.movies.movies.Envelope" }, "payload": { "before": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 7.3}, "after": "{\"_id\": {\"$oid\": \"1002\"},\"title\": \"Black Panther\",\"rating\": \"PG-13\",\"imdb\": 8.5}", "source": { "connector": "mysql", "name": "test", "ts_ms": 16703579961000, "snapshot": "false", "db": "movies" … }, "op": "u", "ts_ms": 1670357986189, "transaction": null } } ] Update MySQL

Slide 65

Slide 65 text

Stream Consumer Decode Transform Index Upsert Manager Primary Key Doc Id DebeziumDecoder Record stored in segment Upsert Metadata Dealing with Data Semantics CREATE

Slide 66

Slide 66 text

Stream Consumer Decode Transform Index Upsert Manager Primary Key Doc Id 1002 0 1002 Black Panther PG-13 7.3 DebeziumDecoder Record stored in segment Upsert Metadata Dealing with Data Semantics - CDC UPDATE

Slide 67

Slide 67 text

Stream Consumer Decode Transform Index Upsert Manager Primary Key Doc Id 1002 1 1002 Black Panther PG-13 7.3 1002 Black Panther PG-13 8.5 DebeziumDecoder Record stored in segment Upsert Metadata Dealing with Data Semantics - CDC

Slide 68

Slide 68 text

68 From Stream to Pinot Data Semantics Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity

Slide 69

Slide 69 text

69 From Stream to Pinot Data Partitioning Multiple data sources Data Transformation Derived Data Complex type / Nested Fields Data Granularity Data Semantics

Slide 70

Slide 70 text

Stream Processing Still Needed? 70

Slide 71

Slide 71 text

Yes! When is Stream Processing Needed? ● Multiple Data sources ○ Complex processing of data from multiple sources such as ■ Stream-Table join ■ Stream-Stream join ● Data that requires re-partitioning 71

Slide 72

Slide 72 text

Recap 72

Slide 73

Slide 73 text

Realtime Ingestion in Pinot ● Pluggable architecture ● Ingestion-time Data Processing ○ Ingestion Transforms ○ Derived Columns ○ Complex type handling (filtering / flattening) ○ Ingestion Aggregation ○ Upsert semantics ● Use Ingestion features to get data into an optimal form for your use-case 73

Slide 74

Slide 74 text

Looking Ahead 74

Slide 75

Slide 75 text

Enhancements to Pinot Realtime Engine ● Support record deletion in upsert tables (In Progress - GH#10452) ● Segment compaction in upsert tables (In Progress - GH#6912) ● Support for Data Enrichment ● Support ingesting from multiple sources 75

Slide 76

Slide 76 text

Thank You! dev.startree.ai 76

Slide 77

Slide 77 text

Stream Consumer in Apache Pinot ● Streaming data system should ○ Support or allow a partitioned-view of the data for consumption ○ Support or allow deriving a notion of an “offset” or a stream “marker” ○ Support some mechanism to “seek” within a stream ○ Provide message ordering at a partition-level 77

Slide 78

Slide 78 text

Stream Decoders in Apache Pinot ● Stream Message Decoders accept the input from a StreamConsumer and convert it into Pinot’s data model represented as a GenericRow ● Custom-build your decoder by implementing a simple interface Interface StreamMessageDecoder { GenericRow decode(T payload, GenericRow destination); } 78