Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Flink & Pinot for Complex User-Facing Analytics (Ken Krugler, Scale Unlimited) | RTA Summit 2023

Using Flink & Pinot for Complex User-Facing Analytics (Ken Krugler, Scale Unlimited) | RTA Summit 2023

Find out how to effectively do heavy lifting (text analysis, NLP, ML) using Flink streaming workflows as a front end for Pinot.

StarTree

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. 20 Seconds on Me ❖ Ken Krugler, Scale Unlimite d

    ❖ I used to run a boutique big data consulting compan y ❖ Now I’m semi-retired, only doing fun stuf f ❖ Active in open source, ASF member, Tika committe r ❖ Starting working with big data in 2006
  2. TL;DR ❖ Pinot supports minions & ingest-time transformation s ❖

    Flink supports scalable, complex transformation s ❖ Working together they are a better solution for … ❖ Of fl oading work from the Pinot cluste r ❖ Doing complex transformation s ❖ Optimizing Pinot star tree indexes
  3. The Backstory ❖ Our client (Adbeat) provides advertising analytic s

    ❖ Based on collecting many billions of display ad s ❖ They have a dashboard and a report service
  4. Collection Source Join image info Filter by language Find ad

    brands Join landing pages by original URL Normalize destination URLs Join landing pages by normalized URL Identify Campaigns Flat Map Keyed Reduce Keyed Reduce Flat Map Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Advertiser keyphrases Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Publisher keyphrases Map Map Flat Map This Is What We Need ❖ Daily updates to main tabl e ❖ Monthly NLP for 2 table s ❖ Many joins of data stream s ❖ Complex text analysi s ❖ ML for semantic similarity
  5. Full Workflow Collection Source Collection Source Collection Source Join image

    info Filter by language Find ad brands Join landing pages by original URL Normalize destination URLs Join landing pages by normalized URL Identify Campaigns Flat Map Keyed Reduce Keyed Reduce Flat Map Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Advertiser keyphrases Map Keyed Reduce Map Keyed Reduce Map Map Keyed Reduce Map Keyed Reduce Map Co-Keyed-Process Co-Keyed-Process Co-Keyed-Process Map KeyedProcess Publisher keyphrases Write keyphrases Pinot input files Map Map Flat Map Keyed Reduce autocompletion results top N grams Co-Keyed-Process KeyedProcess Write autocompletion Pinot input files
  6. This is What We Get (from Pinot) ❖ Pinot supports

    tasks via minion s ❖ Can schedule/retry/parallelize tasks ❖ Tasks can do data ingestion, segment managemen t ❖ Runs on separate “minion nodes ” ❖ Data can be transformed on ingest via table con fi g ❖ Rename columns, fl atten data, etc . ❖ But only real-time ingestion
  7. Baseline Requirements ❖ Support text-based fi ltering in 16 language

    s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
  8. DIY Text Search ❖ Do text analysis (normalized tokens) in

    Flink work fl o w ❖ Store as multi-valued string colum n ❖ Do same analysis at query time for where claus e ❖ Create shingles (aka n-grams) for approx. phrase quer y ❖ “Fast text search!” => “fast”, “fast text”, “text”, “text search”, “search ” ❖ No positional data means some false positives
  9. Unintended Benefits ❖ Lowers time to build segment s ❖

    We’re doing text analysis in transient, scalable Flink work fl o w ❖ “Fuzzy” matching for phrases is often a wi n ❖ “search text faster” also matches “faster text search”
  10. Baseline Requirements ❖ Support text-based fi ltering in 16 language

    s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
  11. Problematic Query ❖ Filtering: where country = 'us' ❖ Grouping:

    group by adHash ❖ Aggregating: sum(adSpend) ❖ Sorting: order by sum(adSpend) desc ❖ Paging: limit 1000
  12. The Curse of High Cardinality ❖ Star trees are a

    win if pre-aggregation generates far fewer entrie s ❖ 25B records => 6B unique ads, so maybe 4x wi n ❖ Still ≈ 10 seconds for big aggregation quer y ❖ Need to fi gure out how to sum fewer entries
  13. Using The Power Curve ❖ The “80/20” rule - actually

    the 70/2 rule for u s ❖ 70% of ad spend is by top 6K ads, out of 370K ads
  14. Add a “Top Ads” Flag ❖ KLLFloatsSketch, from Apache Datasketches

    projec t ❖ Flags ads that have majority of spend (approx ) ❖ Use that fl ag in star tree de fi nition as dimensio n ❖ Now ≈ 300ms for the same quer y ❖ Unsupported fi lter dimensions means falling back to regular query without where topAds = 'true'
  15. Baseline Requirements ❖ Support text-based fi ltering in 16 language

    s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
  16. Data Processing Flow S3 HDFS Flink Workflow HDFS Text Segments

    Write compressed CSV segment fi les to HDFS
  17. Data Processing Flow S3 HDFS Flink Workflow HDFS Text Segments

    MapReduce SegGen HDFS Use Pinot MapReduce job to build segments
  18. Data Processing Flow S3 HDFS Flink Workflow HDFS Text Segments

    MapReduce SegGen HDFS Pinot Cluster Trigger metadata push to Pinot cluster
  19. Baseline Requirements ❖ Support text-based fi ltering in 16 language

    s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
  20. Auto-completion with Pinot ❖ Meta-goal is to reduce infrastructure cost/complexit

    y ❖ One output from monthly analysis is auto-completio n ❖ Table with keyword, weight, ngrams (multi-value) Keyword Weight Ngrams mortgage 0.92 mor,mort,mortg mortgage broker 1.23 mor,mort,mortg,bro,brok,broke mortality 1.43 mor,mort,morta
  21. Baseline Requirements ❖ Support text-based fi ltering in 16 language

    s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize Pinot cluster server requirements
  22. Off-loading Work from Pinot ❖ Do heavy-lifting in scalable Flink

    work fl o w ❖ Build complete segments (all indices) with MapReduc e ❖ Optimize star tree performance with “top” fl a g ❖ Batch means transient/variable size Flink cluster s ❖ Use HDFS for segment deep storag e ❖ Metadata push of segments
  23. TL;DR ❖ Use built-in Pinot support fo r ❖ ingest-time

    transformations (where possible ) ❖ minions (data ingestion, segment munging ) ❖ Use Flink for … ❖ Doing complex transformations & optimizations ❖ Especially for regular batch processing (transient ) ❖ Of fl oading work from the Pinot cluster