Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Backfill Upsert Table Via Flink/Apache Pinot Connector (Yupeng Fu, Uber) | RTA Summit 2023

Backfill Upsert Table Via Flink/Apache Pinot Connector (Yupeng Fu, Uber) | RTA Summit 2023

It has been a challenge to bootstrap or backfill upsert table (e.g. for correction) with long retention in Pinot, given upsert table must be a real-time table. However, in most organizations, streams (e.g. Kafka) have a limited retention period.

To address this challenge, we developed a Flink/Pinot connector to generate Upsert segments directly from batch data sources (e.g. Hive), and thus solved the backfilling problem with the historical data without dependency on Kafka.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. About Me • Yupeng Fu (yupeng9@github) • Principal Engineer @

    Uber.Inc • Real-time Data Infrastructure • Apache Pinot Committer
  2. Agenda • RTA@Uber • Upsert overview • Backfill and its

    challenges • Flink/Pinot connector • Future work
  3. Real-time Analytics in Uber business 1. Real-time and actionable insights

    2. Time-sensitive decisions 3. User engagement growth Restaurant Performance View Demand/Supply Management Fast Access to Fresh Data at Scale Freight Carrier Score Card
  4. Tier 0 platform, 99.99% uptime Self service Onboarding (via uWorc)

    Built on top of Apache Pinot SQL API (via Presto / Neutrino) Seconds data freshness < 100ms @P99 query latency https://www.uber.com/blog/operating-apache-pinot/ EVA: Uber's consolidated RTA platform
  5. Why upsert in Pinot? • Data can be updated or

    corrected • Pinot shall deliver an accurate and update-to-date real-time view • No easy workaround in SQL SELECT current_status, count(*) FROM uberEatsOrders WHERE regionid = 1366 AND MinutesSinceEpoch BETWEEN 25432140 AND 25433580 GROUP BY current_status TOP 10000
  6. Aug 2020 Feb 2021 Started building Uber Ads Platform Test

    Launch Jan 2021 Engagement with data infra on exactly-once processing Apr 2021 Case Study: Uber Ads Platform built in <12 months Challenge 1: revenue cut by the third-party Software Challenge 2: previous platform limited by stale, discrepant data First exactly-once use case enabled by Data Infra Cutting-edge real-time technologies (Apache Kafka™/Flink™/Pinot™) https://www.uber.com/blog/real-time-exactly-once-ad-event-processing/
  7. S1 S3 Apache Pinot Controller S2 3 1 2 2

    3 4 Pinot Servers Zookeeper Apache Pinot Broker Apache Pinot’s Real-time Data Flow S4 4 1 Seg1 -> S1 Seg2 -> S2 Seg3 -> S3 Seg4 -> S4 Seg1 -> S1, S4 Seg2 -> S2, S3 Seg3 -> S3, S1 Seg4 -> S4, S2 select count(*) from X where country = us PK=1 PK=1 PK=1 PK=1 segments are immutable segments are distributed segments are replicated
  8. Upsert via local coordination • Key challenge is on tracking

    the records with the same PK • Reduce the problem to a local coordination problem by ◦ Leverage the partition-by-key feature in Kafka ◦ Distribute segments of the primary key to the same server
  9. Pinot Backfill via Apache Flink™ Kappa+ DataStream<Map<String, Object>> stream =

    getBackfillInput("hiveSrc"); // Watermarking handled internally stream .partitionCustom( new PinotPartitioner(jobConfiguration.getParallelism()), new PrimaryKeySelector("record_uuid")) .addSink(getOutputs().get("rta_ads_metrics_test")) .name("pinot-sink");
  10. Q&A Apache®, Apache Kafka®, Apache Flink®, Apache Pinot®, Apache Hadoop®

    Presto® and their logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.