Slide 1

Slide 1 text

Using Flink & Pinot for Complex Analytics Each according to their abilities…

Slide 2

Slide 2 text

20 Seconds on Me ❖ Ken Krugler, Scale Unlimite d ❖ I used to run a boutique big data consulting compan y ❖ Now I’m semi-retired, only doing fun stuf f ❖ Active in open source, ASF member, Tika committe r ❖ Starting working with big data in 2006

Slide 3

Slide 3 text

TL;DR ❖ Pinot supports minions & ingest-time transformation s ❖ Flink supports scalable, complex transformation s ❖ Working together they are a better solution for … ❖ Of fl oading work from the Pinot cluste r ❖ Doing complex transformation s ❖ Optimizing Pinot star tree indexes

Slide 4

Slide 4 text

The Backstory ❖ Our client (Adbeat) provides advertising analytic s ❖ Based on collecting many billions of display ad s ❖ They have a dashboard and a report service

Slide 5

Slide 5 text

Collection Source Join image info Filter by language Find ad brands Join landing pages by original URL Normalize destination URLs Join landing pages by normalized URL Identify Campaigns Flat Map Keyed Reduce Keyed Reduce Flat Map Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Advertiser keyphrases Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Publisher keyphrases Map Map Flat Map This Is What We Need ❖ Daily updates to main tabl e ❖ Monthly NLP for 2 table s ❖ Many joins of data stream s ❖ Complex text analysi s ❖ ML for semantic similarity

Slide 6

Slide 6 text

Full Workflow Collection Source Collection Source Collection Source Join image info Filter by language Find ad brands Join landing pages by original URL Normalize destination URLs Join landing pages by normalized URL Identify Campaigns Flat Map Keyed Reduce Keyed Reduce Flat Map Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Advertiser keyphrases Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Publisher keyphrases Write keyphrases Pinot input files Map Map Flat Map Keyed Reduce autocompletion results top N grams Co-Keyed-Process KeyedProcess Write autocompletion Pinot input files

Slide 7

Slide 7 text

This is What We Get (from Pinot) ❖ Pinot supports tasks via minion s ❖ Can schedule/retry/parallelize tasks ❖ Tasks can do data ingestion, segment managemen t ❖ Runs on separate “minion nodes ” ❖ Data can be transformed on ingest via table con fi g ❖ Rename columns, fl atten data, etc . ❖ But only real-time ingestion

Slide 8

Slide 8 text

Baseline Requirements ❖ Support text-based fi ltering in 16 language s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements

Slide 9

Slide 9 text

DIY Text Search ❖ Do text analysis (normalized tokens) in Flink work fl o w ❖ Store as multi-valued string colum n ❖ Do same analysis at query time for where claus e ❖ Create shingles (aka n-grams) for approx. phrase quer y ❖ “Fast text search!” => “fast”, “fast text”, “text”, “text search”, “search ” ❖ No positional data means some false positives

Slide 10

Slide 10 text

Unintended Benefits ❖ Lowers time to build segment s ❖ We’re doing text analysis in transient, scalable Flink work fl o w ❖ “Fuzzy” matching for phrases is often a wi n ❖ “search text faster” also matches “faster text search”

Slide 11

Slide 11 text

Baseline Requirements ❖ Support text-based fi ltering in 16 language s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements

Slide 12

Slide 12 text

Problematic Query ❖ Filtering: where country = 'us' ❖ Grouping: group by adHash ❖ Aggregating: sum(adSpend) ❖ Sorting: order by sum(adSpend) desc ❖ Paging: limit 1000

Slide 13

Slide 13 text

The Curse of High Cardinality ❖ Star trees are a win if pre-aggregation generates far fewer entrie s ❖ 25B records => 6B unique ads, so maybe 4x wi n ❖ Still ≈ 10 seconds for big aggregation quer y ❖ Need to fi gure out how to sum fewer entries

Slide 14

Slide 14 text

Using The Power Curve ❖ The “80/20” rule - actually the 70/2 rule for u s ❖ 70% of ad spend is by top 6K ads, out of 370K ads

Slide 15

Slide 15 text

Add a “Top Ads” Flag ❖ KLLFloatsSketch, from Apache Datasketches projec t ❖ Flags ads that have majority of spend (approx ) ❖ Use that fl ag in star tree de fi nition as dimensio n ❖ Now ≈ 300ms for the same quer y ❖ Unsupported fi lter dimensions means falling back to regular query without where topAds = 'true'

Slide 16

Slide 16 text

Baseline Requirements ❖ Support text-based fi ltering in 16 language s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements

Slide 17

Slide 17 text

Data Processing Flow S3 HDFS Flink Workflow Process all days in current month with Flink

Slide 18

Slide 18 text

Data Processing Flow S3 HDFS Flink Workflow HDFS Text Segments Write compressed CSV segment fi les to HDFS

Slide 19

Slide 19 text

Data Processing Flow S3 HDFS Flink Workflow HDFS Text Segments MapReduce SegGen HDFS Use Pinot MapReduce job to build segments

Slide 20

Slide 20 text

Data Processing Flow S3 HDFS Flink Workflow HDFS Text Segments MapReduce SegGen HDFS Pinot Cluster Trigger metadata push to Pinot cluster

Slide 21

Slide 21 text

Doing A Lot With Flink

Slide 22

Slide 22 text

Baseline Requirements ❖ Support text-based fi ltering in 16 language s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements

Slide 23

Slide 23 text

Auto-completion with Pinot ❖ Meta-goal is to reduce infrastructure cost/complexit y ❖ One output from monthly analysis is auto-completio n ❖ Table with keyword, weight, ngrams (multi-value) Keyword Weight Ngrams mortgage 0.92 mor,mort,mortg mortgage broker 1.23 mor,mort,mortg,bro,brok,broke mortality 1.43 mor,mort,morta

Slide 24

Slide 24 text

Baseline Requirements ❖ Support text-based fi ltering in 16 language s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize Pinot cluster server requirements

Slide 25

Slide 25 text

Off-loading Work from Pinot ❖ Do heavy-lifting in scalable Flink work fl o w ❖ Build complete segments (all indices) with MapReduc e ❖ Optimize star tree performance with “top” fl a g ❖ Batch means transient/variable size Flink cluster s ❖ Use HDFS for segment deep storag e ❖ Metadata push of segments

Slide 26

Slide 26 text

TL;DR ❖ Use built-in Pinot support fo r ❖ ingest-time transformations (where possible ) ❖ minions (data ingestion, segment munging ) ❖ Use Flink for … ❖ Doing complex transformations & optimizations ❖ Especially for regular batch processing (transient ) ❖ Of fl oading work from the Pinot cluster

Slide 27

Slide 27 text

Thanks! Any Questions?