Backfill Upsert Table Via Flink/Apache Pinot Connector (Yupeng Fu, Uber) | RTA Summit 2023

Backfill Upsert Table Via Apache Flink/Pinot™ Connector Yupeng Fu, Uber

About Me • Yupeng Fu (yupeng9@github) • Principal Engineer @
Uber.Inc • Real-time Data Infrastructure • Apache Pinot Committer

Agenda • RTA@Uber • Upsert overview • Backfill and its
challenges • Flink/Pinot connector • Future work

RTA @ Uber

Real-time Analytics in Uber business 1. Real-time and actionable insights
2. Time-sensitive decisions 3. User engagement growth Restaurant Performance View Demand/Supply Management Fast Access to Fresh Data at Scale Freight Carrier Score Card

EVA: Uber's consolidated RTA platform

Tier 0 platform, 99.99% uptime Self service Onboarding (via uWorc)
Built on top of Apache Pinot SQL API (via Presto / Neutrino) Seconds data freshness < 100ms @P99 query latency https://www.uber.com/blog/operating-apache-pinot/ EVA: Uber's consolidated RTA platform

Why upsert?

Why upsert in Pinot? • Data can be updated or
corrected • Pinot shall deliver an accurate and update-to-date real-time view • No easy workaround in SQL SELECT current_status, count(*) FROM uberEatsOrders WHERE regionid = 1366 AND MinutesSinceEpoch BETWEEN 25432140 AND 25433580 GROUP BY current_status TOP 10000

Aug 2020 Feb 2021 Started building Uber Ads Platform Test
Launch Jan 2021 Engagement with data infra on exactly-once processing Apr 2021 Case Study: Uber Ads Platform built in <12 months Challenge 1: revenue cut by the third-party Software Challenge 2: previous platform limited by stale, discrepant data First exactly-once use case enabled by Data Infra Cutting-edge real-time technologies (Apache Kafka™/Flink™/Pinot™) https://www.uber.com/blog/real-time-exactly-once-ad-event-processing/

How Upsert Works

S1 S3 Apache Pinot Controller S2 3 1 2 2
3 4 Pinot Servers Zookeeper Apache Pinot Broker Apache Pinot’s Real-time Data Flow S4 4 1 Seg1 -> S1 Seg2 -> S2 Seg3 -> S3 Seg4 -> S4 Seg1 -> S1, S4 Seg2 -> S2, S3 Seg3 -> S3, S1 Seg4 -> S4, S2 select count(*) from X where country = us PK=1 PK=1 PK=1 PK=1 segments are immutable segments are distributed segments are replicated

Upsert via local coordination • Key challenge is on tracking
the records with the same PK • Reduce the problem to a local coordination problem by ◦ Leverage the partition-by-key feature in Kafka ◦ Distribute segments of the primary key to the same server

Local coordinator design

Backfill Challenge

Data Backfill • Bootstrapping with historical data • Reprocessing due
to ◦ Bugs ◦ Change in computation logic

Upsert limitation • Real-time table only • Strong dependency on
Kafka • Low ingestion throughput

Backfill via Apache Flink™ Kappa+ https://www.youtube.com/watch?v=ExU7fJFw4Bg • Higher throughput than
Kafka • Reuse the processing logic • Ease of use

Pinot Backfill via Apache Flink™ Kappa+ DataStream<Map<String, Object>> stream =
getBackfillInput("hiveSrc"); // Watermarking handled internally stream .partitionCustom( new PinotPartitioner(jobConfiguration.getParallelism()), new PrimaryKeySelector("record_uuid")) .addSink(getOutputs().get("rta_ads_metrics_test")) .name("pinot-sink");

Apache Flink/Pinot connector https://github.com/apache/pinot/pull/8233 FLIP-166: Pinot Connector

Next • Upsert TTL • Upsert compaction • Dedup support
• Spark connector

Q&A Apache®, Apache Kafka®, Apache Flink®, Apache Pinot®, Apache Hadoop®
Presto® and their logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Backfill Upsert Table Via Flink/Apache Pinot Co...

Backfill Upsert Table Via Flink/Apache Pinot Connector (Yupeng Fu, Uber) | RTA Summit 2023

StarTree

More Decks by StarTree

Other Decks in Technology

Featured

Transcript

Backfill Upsert Table Via Apache Flink/Pinot™ Connector Yupeng Fu, Uber

About Me • Yupeng Fu (yupeng9@github) • Principal Engineer @

Agenda • RTA@Uber • Upsert overview • Backfill and its

RTA @ Uber

Real-time Analytics in Uber business 1. Real-time and actionable insights

EVA: Uber's consolidated RTA platform

Tier 0 platform, 99.99% uptime Self service Onboarding (via uWorc)

Why upsert?

Why upsert in Pinot? • Data can be updated or

Aug 2020 Feb 2021 Started building Uber Ads Platform Test

How Upsert Works

S1 S3 Apache Pinot Controller S2 3 1 2 2

Upsert via local coordination • Key challenge is on tracking

Local coordinator design

Backfill Challenge

Data Backfill • Bootstrapping with historical data • Reprocessing due

Upsert limitation • Real-time table only • Strong dependency on

Backfill via Apache Flink™ Kappa+ https://www.youtube.com/watch?v=ExU7fJFw4Bg • Higher throughput than

Backfill via Apache Flink™ Kappa+ https://www.youtube.com/watch?v=ExU7fJFw4Bg • Higher throughput than

Pinot Backfill via Apache Flink™ Kappa+ DataStream<Map<String, Object>> stream =

Apache Flink/Pinot connector https://github.com/apache/pinot/pull/8233 FLIP-166: Pinot Connector

Next • Upsert TTL • Upsert compaction • Dedup support

Q&A Apache®, Apache Kafka®, Apache Flink®, Apache Pinot®, Apache Hadoop®