Using Flink & Pinot for Complex User-Facing Analytics (Ken Krugler, Scale Unlimited) | RTA Summit 2023

Using Flink & Pinot for Complex Analytics Each according to
their abilities…

20 Seconds on Me ❖ Ken Krugler, Scale Unlimite d
❖ I used to run a boutique big data consulting compan y ❖ Now I’m semi-retired, only doing fun stuf f ❖ Active in open source, ASF member, Tika committe r ❖ Starting working with big data in 2006

TL;DR ❖ Pinot supports minions & ingest-time transformation s ❖
Flink supports scalable, complex transformation s ❖ Working together they are a better solution for … ❖ Of fl oading work from the Pinot cluste r ❖ Doing complex transformation s ❖ Optimizing Pinot star tree indexes

The Backstory ❖ Our client (Adbeat) provides advertising analytic s
❖ Based on collecting many billions of display ad s ❖ They have a dashboard and a report service

Collection Source Join image info Filter by language Find ad
brands Join landing pages by original URL Normalize destination URLs Join landing pages by normalized URL Identify Campaigns Flat Map Keyed Reduce Keyed Reduce Flat Map Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Advertiser keyphrases Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Publisher keyphrases Map Map Flat Map This Is What We Need ❖ Daily updates to main tabl e ❖ Monthly NLP for 2 table s ❖ Many joins of data stream s ❖ Complex text analysi s ❖ ML for semantic similarity

Full Workflow Collection Source Collection Source Collection Source Join image
info Filter by language Find ad brands Join landing pages by original URL Normalize destination URLs Join landing pages by normalized URL Identify Campaigns Flat Map Keyed Reduce Keyed Reduce Flat Map Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Advertiser keyphrases Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Publisher keyphrases Write keyphrases Pinot input ﬁles Map Map Flat Map Keyed Reduce autocompletion results top N grams Co-Keyed-Process KeyedProcess Write autocompletion Pinot input ﬁles

This is What We Get (from Pinot) ❖ Pinot supports
tasks via minion s ❖ Can schedule/retry/parallelize tasks ❖ Tasks can do data ingestion, segment managemen t ❖ Runs on separate “minion nodes ” ❖ Data can be transformed on ingest via table con fi g ❖ Rename columns, fl atten data, etc . ❖ But only real-time ingestion

Baseline Requirements ❖ Support text-based fi ltering in 16 language
s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements

DIY Text Search ❖ Do text analysis (normalized tokens) in
Flink work fl o w ❖ Store as multi-valued string colum n ❖ Do same analysis at query time for where claus e ❖ Create shingles (aka n-grams) for approx. phrase quer y ❖ “Fast text search!” => “fast”, “fast text”, “text”, “text search”, “search ” ❖ No positional data means some false positives

Unintended Benefits ❖ Lowers time to build segment s ❖
We’re doing text analysis in transient, scalable Flink work fl o w ❖ “Fuzzy” matching for phrases is often a wi n ❖ “search text faster” also matches “faster text search”

Problematic Query ❖ Filtering: where country = 'us' ❖ Grouping:
group by adHash ❖ Aggregating: sum(adSpend) ❖ Sorting: order by sum(adSpend) desc ❖ Paging: limit 1000

The Curse of High Cardinality ❖ Star trees are a
win if pre-aggregation generates far fewer entrie s ❖ 25B records => 6B unique ads, so maybe 4x wi n ❖ Still ≈ 10 seconds for big aggregation quer y ❖ Need to fi gure out how to sum fewer entries

Using The Power Curve ❖ The “80/20” rule - actually
the 70/2 rule for u s ❖ 70% of ad spend is by top 6K ads, out of 370K ads

Add a “Top Ads” Flag ❖ KLLFloatsSketch, from Apache Datasketches
projec t ❖ Flags ads that have majority of spend (approx ) ❖ Use that fl ag in star tree de fi nition as dimensio n ❖ Now ≈ 300ms for the same quer y ❖ Unsupported fi lter dimensions means falling back to regular query without where topAds = 'true'

Data Processing Flow S3 HDFS Flink Workﬂow Process all days
in current month with Flink

Data Processing Flow S3 HDFS Flink Workﬂow HDFS Text Segments
Write compressed CSV segment fi les to HDFS

MapReduce SegGen HDFS Use Pinot MapReduce job to build segments

MapReduce SegGen HDFS Pinot Cluster Trigger metadata push to Pinot cluster

Doing A Lot With Flink

Auto-completion with Pinot ❖ Meta-goal is to reduce infrastructure cost/complexit
y ❖ One output from monthly analysis is auto-completio n ❖ Table with keyword, weight, ngrams (multi-value) Keyword Weight Ngrams mortgage 0.92 mor,mort,mortg mortgage broker 1.23 mor,mort,mortg,bro,brok,broke mortality 1.43 mor,mort,morta

s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize Pinot cluster server requirements

Off-loading Work from Pinot ❖ Do heavy-lifting in scalable Flink
work fl o w ❖ Build complete segments (all indices) with MapReduc e ❖ Optimize star tree performance with “top” fl a g ❖ Batch means transient/variable size Flink cluster s ❖ Use HDFS for segment deep storag e ❖ Metadata push of segments

TL;DR ❖ Use built-in Pinot support fo r ❖ ingest-time
transformations (where possible ) ❖ minions (data ingestion, segment munging ) ❖ Use Flink for … ❖ Doing complex transformations & optimizations ❖ Especially for regular batch processing (transient ) ❖ Of fl oading work from the Pinot cluster

Thanks! Any Questions?

Using Flink & Pinot for Complex User-Facing Ana...

Using Flink & Pinot for Complex User-Facing Analytics (Ken Krugler, Scale Unlimited) | RTA Summit 2023

StarTree

More Decks by StarTree

Other Decks in Technology

Featured

Transcript

Using Flink & Pinot for Complex Analytics Each according to

20 Seconds on Me ❖ Ken Krugler, Scale Unlimite d

TL;DR ❖ Pinot supports minions & ingest-time transformation s ❖

The Backstory ❖ Our client (Adbeat) provides advertising analytic s

Collection Source Join image info Filter by language Find ad

Full Workflow Collection Source Collection Source Collection Source Join image

This is What We Get (from Pinot) ❖ Pinot supports

Baseline Requirements ❖ Support text-based fi ltering in 16 language

DIY Text Search ❖ Do text analysis (normalized tokens) in

Unintended Benefits ❖ Lowers time to build segment s ❖

Baseline Requirements ❖ Support text-based fi ltering in 16 language

Problematic Query ❖ Filtering: where country = 'us' ❖ Grouping:

The Curse of High Cardinality ❖ Star trees are a

Using The Power Curve ❖ The “80/20” rule - actually

Add a “Top Ads” Flag ❖ KLLFloatsSketch, from Apache Datasketches

Baseline Requirements ❖ Support text-based fi ltering in 16 language

Data Processing Flow S3 HDFS Flink Workﬂow Process all days

Data Processing Flow S3 HDFS Flink Workﬂow HDFS Text Segments

Data Processing Flow S3 HDFS Flink Workﬂow HDFS Text Segments

Data Processing Flow S3 HDFS Flink Workﬂow HDFS Text Segments

Doing A Lot With Flink

Baseline Requirements ❖ Support text-based fi ltering in 16 language

Auto-completion with Pinot ❖ Meta-goal is to reduce infrastructure cost/complexit

Baseline Requirements ❖ Support text-based fi ltering in 16 language

Off-loading Work from Pinot ❖ Do heavy-lifting in scalable Flink

TL;DR ❖ Use built-in Pinot support fo r ❖ ingest-time

Thanks! Any Questions?