❖ I used to run a boutique big data consulting compan y ❖ Now I’m semi-retired, only doing fun stuf f ❖ Active in open source, ASF member, Tika committe r ❖ Starting working with big data in 2006
Flink supports scalable, complex transformation s ❖ Working together they are a better solution for … ❖ Of fl oading work from the Pinot cluste r ❖ Doing complex transformation s ❖ Optimizing Pinot star tree indexes
tasks via minion s ❖ Can schedule/retry/parallelize tasks ❖ Tasks can do data ingestion, segment managemen t ❖ Runs on separate “minion nodes ” ❖ Data can be transformed on ingest via table con fi g ❖ Rename columns, fl atten data, etc . ❖ But only real-time ingestion
s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
Flink work fl o w ❖ Store as multi-valued string colum n ❖ Do same analysis at query time for where claus e ❖ Create shingles (aka n-grams) for approx. phrase quer y ❖ “Fast text search!” => “fast”, “fast text”, “text”, “text search”, “search ” ❖ No positional data means some false positives
We’re doing text analysis in transient, scalable Flink work fl o w ❖ “Fuzzy” matching for phrases is often a wi n ❖ “search text faster” also matches “faster text search”
s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
win if pre-aggregation generates far fewer entrie s ❖ 25B records => 6B unique ads, so maybe 4x wi n ❖ Still ≈ 10 seconds for big aggregation quer y ❖ Need to fi gure out how to sum fewer entries
projec t ❖ Flags ads that have majority of spend (approx ) ❖ Use that fl ag in star tree de fi nition as dimensio n ❖ Now ≈ 300ms for the same quer y ❖ Unsupported fi lter dimensions means falling back to regular query without where topAds = 'true'
s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize 24x7 server requirements
y ❖ One output from monthly analysis is auto-completio n ❖ Table with keyword, weight, ngrams (multi-value) Keyword Weight Ngrams mortgage 0.92 mor,mort,mortg mortgage broker 1.23 mor,mort,mortg,bro,brok,broke mortality 1.43 mor,mort,morta
s ❖ Fast sorted aggregations across billions of record s ❖ Daily updates - millions of new records, plus refres h ❖ Monthly analysis - billions of record s ❖ Minimize Pinot cluster server requirements
work fl o w ❖ Build complete segments (all indices) with MapReduc e ❖ Optimize star tree performance with “top” fl a g ❖ Batch means transient/variable size Flink cluster s ❖ Use HDFS for segment deep storag e ❖ Metadata push of segments
transformations (where possible ) ❖ minions (data ingestion, segment munging ) ❖ Use Flink for … ❖ Doing complex transformations & optimizations ❖ Especially for regular batch processing (transient ) ❖ Of fl oading work from the Pinot cluster