Approximate Algorithms and Sketches in Druid

Summarization, Approximation, Sampling: Improving Query Times on Big Data OR
Approximate Algorithms and Sketches In Druid Lee Rhodes, Eric Tschetter, Much debt owed to Kevin Lang Hadoop Summit, 2015; Druid Meetup Aug 2015

2 Yahoo Analyzes Big Data •  Tens of billions of
user events / day •  Complex behaviors and interactions of User Groups

3 Thus, Much Of Our Analysis Is Keyed From Unique
Identifiers … That Appear Many Times •  B-Cookies •  Device-IDs •  YUIDs / SIDs •  Session IDs •  Advertiser IDs •  Etc. Big Data

4 Arithmetic Operations possible with additive metrics … (e.g., #
views, # clicks …) C i ∑ = C j ∑ C i ∑ C j C i C i ∑ Merging, Partitioning C = C i, j,k Δi,Δj,Δk ∑ i j k Dimensional Summing t1 t2 t3 C t ∑ Summing over Time Differencing - C 2 ∑ C 1 ∑ C 1 ∑ − C 2 ∑

5 C i ∑ = C j ∑ C i
∑ C j C i C i ∑ Merging, Partitioning C = C i, j,k Δi,Δj,Δk ∑ i j k Dimensional Summing t1 t2 t3 C t ∑ Summing over Time Called unique, or “non-additive” metrics. Difficult computational challenge at large scale Differencing - C 2 ∑ C 1 ∑ C 1 ∑ − C 2 ∑ … Are Not Possible with Metrics that have Duplicates (e.g., # B-Cookies, # Session-IDs, …)

6 Proposition… If An Approximate Answer Is Acceptable … There
Is Likely A Much More Efficient Solution! But How Approximate? How Efficient?

7 Welcome to the New Science of Approximate Computing For
Big Data * •  Sampling Oldest: CS ~1986 Stats: 1700’s •  Histograms •  Wavelets •  Sketches Newest: CS ~2008, Theory ~1984 * Graham Cormode, et al. “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”, 2011

8 “Sketch” Applies to a Broad Range of Algorithms Result
+/- ε Big Data Stream •  Small, fast, single pass, and approximate •  With mathematically proven error bounds* * The mathematical proofs are complex! “Stochastic Streaming Algorithms” “Approximate Query Processing” Theor. Math Comp. Sci ‘70s – D. Knuth ‘85 – P. Flajolet Explosion of Papers Since ‘90s-’00s –A. Broder, R. Kumar, E. Cohen Data Structure Transform Estimator White Noise

9 Long City Block: 1/5 Mile = 1056 Ft d
= 100% You @ 2 ft From Curb d =2 / 1056 = 0.19% Couples Closer “Dreamers” Farther Apart 1st Estimate: 1/d 1 person/ (2 ft /1056 ft ) = 528 People Sample contains 1 value Curb @ 0 ft The Concert Line How Long?

10 11th in line @ 30 ft d = 30
/ 1056 = 2.84% Long City Block: 1/5 Mile = 1056 Ft d = 100% You @ 2 ft From Curb d =2 / 1056 = 0.19% Couples Closer “Dreamers” Farther Apart 10 Sidewalk Cracks = 30 feet 1st Estimate: 1/d 1 person/ (2 ft /1056 ft ) = 528 People Sample contains 1 value Better Estimate: 11/d 11 people / (30 ft / 1056 ft ) = 387 People Sample contains 11 values Curb @ 0 ft The Concert Line How Long?

11 Ordered List 1.0 Uniform Random Hash è (0,1) Big
Data 0.0 k Minimum Value (KMV) Sketch for Counting Uniques •  Maintain ordered list of hash values

Data Data as “White Noise” 0.0 0.922 0.873 0.822 0.758 0.610 0.437 0.386 0.195 0.145 0.008 d = 0.191 Est = 1 0.191≈ 5 d = 0.008 Est = 1 0.008 ≈125 k Minimum Value (KMV) Sketch for Counting Uniques •  Maintain ordered list of hash values •  1st Estimate = 1/d but NOISY!

13 Ordered List Uniform Random Hash è (0,1) Big Data
KMV Data as “White Noise” 0.922 0.873 0.822 0.758 0.610 0.437 0.386 0.195 0.145 0.008 1.0 0.0 d V(kth) = 0.195 •  Maintain ordered list of hash values •  Choose k Min Values •  Better Est =(k-1) / V(kth) = 2 / 0.195 = 10.26 k Minimum Value (KMV) Sketch for Counting Uniques

Data KMV Data as “White Noise” 0.0 0.195 0.145 0.008 V(kth) = 0.195 •  Maintain ordered list of hash values •  Choose Min k Values •  Better Est =(k-1)/V(kth) •  Reject hash ≥ V(kth) •  Reject duplicates k Minimum Value (KMV) Sketch for Counting Uniques

15 •  Maintain ordered list of hash values •  Choose
Min k Values •  Better Est =(k-1)/V(kth) •  Reject hash ≥ kth min •  Reject duplicates •  Otherwise insert in order, toss the top value, track V(kth). Ordered List 1.0 Uniform Random Hash è (0,1) Big Data KMV 0.0 0.145 0.100 0.008 V(kth) = 0.145 Note that RSE is independent of n ! Unbiased Est(n): ˆ n = (k −1) /V(kth) Relative Standard Error, RSE = σ ˆ n n < 1 k − 2 k Minimum Value (KMV) Sketch for Counting Uniques

16 Relative Error Distribution, k = 4K Sketch, Synthesized Data
5th %ile 95th %ile Mean, Median 25th %ile Act RSE Th. RSE 75th %ile

17 Yep, It Scales!

18 Relative Error Distribution, k = 64K Sketch, Real Data
Sketch: Words = 64K Std Err: <0.4% Max size = 512KB Input: (many dimensions) 5.2TB ~20B events 5.2M Total Rows = # dim comb 4M Sketches for Viewing BC RESULTS: Avg Size ~2.5KB / sketch 25,604 (0.64%) had any error at all Error Feather Duster Plot Cum Error Distribution Measured Theoretical Upper Bound Zero Error

19 Sketch Update Speed, 64K Sketch, 64-bit Long Inputs 45M
142M Updates/sec 20M

20 Sketch Merge Time / Query 14.5M Sk / Sec
/ Proc

21 θ Sketches Enable Full Set Operations Example Queries Easily
Answered using Sketches How many users visited both Sports and Finance within the last day? How many non-US users visited Yahoo.com in January that also visited in February? A∪B ( )∩ C∪D ( )\ E x y z Sketch Mart Big Data Stream And Real-time Set Expessions On Streams Set Operations Result S Sportst ( )∩S Financet ( ) S Jan ( )∩S Feb ( ) ( )\ S US ( ) Batch

22 Hyper-Log Log Sketches §  2008 Theoretical Paper by Philippe
Flajolet §  Smallest Space Sketch for a Given Counting Accuracy §  Outstanding for Simple Counting and Simple Merging §  Unsuitable for Set Intersection or Difference Ø  Generally relies on “Include / Exclude” approach => Huge Error §  All source and target sketches in Unions must be the same size, k. §  Typically slower to update and merge than Theta Sketches

23 Example θ Sketch System Flow Architecture Sketch Data Σ
Sketch Data Sketch Sketch Data Σ Sketch Data Sketch Sketch Σ Sketch Sketch Sketch Σ Sketch Sketch SetOp Sketch Result Sketch Result Σ Hadoop Grid (Pig) Query Engine (Druid, Java) The Power of “Additive” Intermediates! ∩, \

24 θ Sketch Advantages for Analysis of Uniques §  Scalable
›  User specifies Upper Bound Size / Accuracy Trade-off. Example: 16K Sketch: 2RSE <= 1.56% @ 95% confidence, max size ~ 131KB = 8*k ›  Sketch Upper Bound Size is Independent of Input Stream Size §  “Additive”, Order Insensitive, Duplicate Insensitive, and Parallelizable ›  Merge across arbitrary dimensions or time. ›  Enables ad-hoc queries of unique counts from intermediate sketch data ›  Enables simplified processing of late data ›  Simplifies processing pipeline architectures

25 §  Fast ›  Obtain results from a single pass
of identity-grain data => Real-Time ›  Updates in 10s of nSec, millions of Sketch Merges per Sec. §  Analysis of Set Expressions (Union, Intersection, Difference) ›  Accuracy superior to traditional Include / Exclude approach (and HLL). ›  The result of a Set Operation is Another Sketch ›  Can accommodate source and target sketches of different size, k, and different θ-Sketch type. §  Well Defined (Mathematically Proven) Error Properties ›  The width of the error distribution is a trade-off with sketch size ›  The theoretical error bounds are real-time queryable θ Sketch Advantages for Analysis of Uniques

26 Tuple Sketches for User Behavior Modeling & Analysis Benefits
•  Ideal for Analysis of Sets of Users •  Enables Accurate Estimation and Set Ops •  High Performance for both Real-time and Batch •  Standardized Library Will Promote Data Sharing Across Platforms Tuple Sketch: b1 Object b2 Object b3 Object Real-Time Data Stream Extendable For User Behavior Analysis θ Under Development •  Frequency Cap Modeling •  Stratified Sampling

27 Off-Heap Sketches & Swim Lanes Sketch JVM Sketch Data
Sketch Sketch Data 4GB 10GB 10GB Off-Java Heap “Swim Lanes” 1 N Thread / CPU Managed via “Memory” Package

Thank You! Questions?

•  Enables many sketch algorithms •  Enables sketches as results
•  Simplifies implementation of set expressions 29 Theta (θ) Sketch Framework •  Create variable θ = 1.0 •  Configure k •  Define: Theta Choosing Function (TCF) •  Example: θ = last tossed. •  k can be stochastic k values 0.195 0.145 0.100 0.008 θ Unbiased Estimate ˆ n = S θ S = cache (Set) cardinality Relative Standard Error, RSE < 1 k −1 Theta is the probability that the next value will modify the sketch •  Estimate & RSE become very simple

30 •  Set Theoretic Include / Exclude Equation is based
on cardinalities: •  Unfortunately, this results in large relative errors when the result is small compared to the largest set: A∩B = A + B − A∪B A B A∩B ±εr = A ±εA + B ±εB − A∪B ±εA∪B Set Intersection and Difference Error Because the error components ADD! The smaller the Relative Result the worse the Relative Error

31 Δ =∩, \ = Intersection or Difference F =
A∪B AΔB = Inverse "Broder Rule" RE = Measured Truth −1 RE θ = Relative error for θ Sketch Intersection ≈ F 1 k RE IE = Relative error for HLL using IE ≈ F 1 k RE IE RE θ = F F = F

Approximate Algorithms and Sketches in Druid

Approximate Algorithms and Sketches in Druid

Druid

More Decks by Druid

Other Decks in Research

Featured

Transcript

Summarization, Approximation, Sampling: Improving Query Times on Big Data OR

2 Yahoo Analyzes Big Data •  Tens of billions of

3 Thus, Much Of Our Analysis Is Keyed From Unique

4 Arithmetic Operations possible with additive metrics … (e.g., #

5 C i ∑ = C j ∑ C i

6 Proposition… If An Approximate Answer Is Acceptable … There

7 Welcome to the New Science of Approximate Computing For

8 “Sketch” Applies to a Broad Range of Algorithms Result

9 Long City Block: 1/5 Mile = 1056 Ft d

10 11th in line @ 30 ft d = 30

11 Ordered List 1.0 Uniform Random Hash è (0,1) Big

12 Ordered List 1.0 Uniform Random Hash è (0,1) Big

13 Ordered List Uniform Random Hash è (0,1) Big Data

14 Ordered List 1.0 Uniform Random Hash è (0,1) Big

15 •  Maintain ordered list of hash values •  Choose

16 Relative Error Distribution, k = 4K Sketch, Synthesized Data

17 Yep, It Scales!

18 Relative Error Distribution, k = 64K Sketch, Real Data

19 Sketch Update Speed, 64K Sketch, 64-bit Long Inputs 45M

20 Sketch Merge Time / Query 14.5M Sk / Sec

21 θ Sketches Enable Full Set Operations Example Queries Easily

22 Hyper-Log Log Sketches §  2008 Theoretical Paper by Philippe

23 Example θ Sketch System Flow Architecture Sketch Data Σ

24 θ Sketch Advantages for Analysis of Uniques §  Scalable

25 §  Fast ›  Obtain results from a single pass

26 Tuple Sketches for User Behavior Modeling & Analysis Benefits

27 Off-Heap Sketches & Swim Lanes Sketch JVM Sketch Data

Thank You! Questions?

•  Enables many sketch algorithms •  Enables sketches as results

30 •  Set Theoretic Include / Exclude Equation is based

31 Δ =∩, \ = Intersection or Difference F =

32