Slide 1

Slide 1 text

Quantiles over Data Streams An Experimental Study Lu Wang [email protected] HKUST Joint work with Ge Luo, Ke Yi, Graham Cormode

Slide 2

Slide 2 text

http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html

Slide 3

Slide 3 text

Huge amount of Data generated every minute

Slide 4

Slide 4 text

Quantiles over Data Streams

Slide 5

Slide 5 text

3 14 15 9 26 5 35 89 79 … ? ? ? ? Data Streams • An infinite sequence of elements • Limited memory • Can make queries at any time • About elements seen so far

Slide 6

Slide 6 text

3 14 15 9 26 5 35 89 79 ? Data Streams • Answer queries in one scan over a large database

Slide 7

Slide 7 text

Characterizing a Distribution • Mean & Standard deviation – Assuming a normal distribution • Not accurate enough http://www.statsdirect.co.uk/help/distributions/normal_distribution.htm

Slide 8

Slide 8 text

Characterizing a Distribution • Example http://www.math.wisc.edu/~angenent/221.2008f/theNaturalBlog.html

Slide 9

Slide 9 text

Characterizing a Distribution • Equi-Width Histogram – 5 buckets • Not good for skewed data

Slide 10

Slide 10 text

Characterizing a Distribution • Equi-Depth Histogram – 5 buckets • Adaptive • a.k.a. Height Balanced Histogram

Slide 11

Slide 11 text

Quantiles over Data Streams

Slide 12

Slide 12 text

The Quantile Problem 0 1 CDF ϕ-quantile ϕ

Slide 13

Slide 13 text

Quantiles = Equi-Depth Histogram • 5 buckets 0.2-quantile 0.4-quantile 0.6-quantile 0.8-quantile

Slide 14

Slide 14 text

Kolmogorov-Smirnov Divergence Median Equi-Depth Histogram Network Health Monitoring Wireless Sensor Network Distribution Estimation R Excel Sawzall Log Data Analysis Data Collection

Slide 15

Slide 15 text

Offline Quantile Algorithm • The selection algorithm • M. Blum et al. J. Comput. System Sci. 7 (1973) • For offline data • Time: • Space:

Slide 16

Slide 16 text

Offline Quantile Algorithm • The selection algorithm • M. Blum et al. J. Comput. System Sci. 7 (1973) • For offline data • Time: • Space:

Slide 17

Slide 17 text

For data streams, linear space is needed to find an exact quantile

Slide 18

Slide 18 text

Approx. Version of Quantiles 0 1 CDF ϕ ϕ+ε ϕ–ε

Slide 19

Slide 19 text

Rank Queries Given x, to find rank(x)=# of elements that are smaller than x Approx. version: to find r’ such that |r’ – rank(x)| < εN Equivalent to the quantile problem • Binary search on the domain

Slide 20

Slide 20 text

Cash Register Algorithms

Slide 21

Slide 21 text

Cash Register Model • Every element from the data stream denotes an insertion New element Data set 3 3 14 3, 14 15 3, 14, 15 9 3, 9, 14, 15 26 3, 9, 14, 15, 26 5 3, 5, 9, 14, 15, 26 35 3, 5, 9, 14, 15, 26, 35

Slide 22

Slide 22 text

Algorithm Space Note MP80 1 log2 Munro & Paterson Theoretical CS 1980 MRL98 1 log2 Manku et al. SIGMOD 1998 GK01 1 log Greenwald & Khanna SIGMOD 2001 QDigest 1 log Shrivastava et al. SenSys 2004 Random Sampling 1 2 log 1 Classical MRL99 1 log2 1 Manku et al. SIGMOD 1999 Agarwal12 1 log1.5 1 Agarwal et al. PODS 2012 Random 1 log1.5 1 New

Slide 23

Slide 23 text

Algorithm Space Note MP80 1 log2 Munro & Paterson Theoretical CS 1980 MRL98 1 log2 Manku et al. SIGMOD 1998 GK01 1 log Greenwald & Khanna SIGMOD 2001 QDigest 1 log Shrivastava et al. SenSys 2004 Random Sampling 1 2 log 1 Classical MRL99 1 log2 1 Manku et al. SIGMOD 1999 Agarwal12 1 log1.5 1 Agarwal et al. PODS 2012 Random 1 log1.5 1 New

Slide 24

Slide 24 text

Algorithm Space Note MP80 1 log2 Munro & Paterson Theoretical CS 1980 MRL98 1 log2 Manku et al. SIGMOD 1998 GK01 1 log Greenwald & Khanna SIGMOD 2001 QDigest 1 log Shrivastava et al. SenSys 2004 Random Sampling 1 2 log 1 Classical MRL99 1 log2 1 Manku et al. SIGMOD 1999 Agarwal12 1 log1.5 1 Agarwal et al. PODS 2012 Random 1 log1.5 1 New • Achieves the best bounds • Simple and practical • Very fast

Slide 25

Slide 25 text

GK 3 Incoming: 3 0 node: value & rank N=1

Slide 26

Slide 26 text

GK 3 Incoming: 14 0 14 1 N=2

Slide 27

Slide 27 text

GK 3 Incoming: 50 0 14 1 50 2 We know exact rank of any element Space: (#node) N=3

Slide 28

Slide 28 text

GK 3 Incoming: 9 0 14 1 9 1 Insertion in the middle: affects other nodes’ ranks 50 2 N=4

Slide 29

Slide 29 text

GK 3 0 14 2 9 1 Deletion: introducing errors 50 3 N=4

Slide 30

Slide 30 text

GK 3 0 9 1 Deletion: introducing errors rank(13) = 2 or 3 ? Incoming: 13 50 3 N=5

Slide 31

Slide 31 text

GK 3 0 13 2-3 9 1 Incoming: 14 node: value & rank bounds 50 3 N=5

Slide 32

Slide 32 text

GK 3 0 13 2-3 9 1 Incoming: 14 Insertion: does not introduce errors Error retains in the interval 50 4 14 3-4 N=6

Slide 33

Slide 33 text

GK 3 0 13 2-3 9 1 One comes, one goes New elements are inserted immediately Try to delete one after each insertion 50 5 14 3-4 N=6

Slide 34

Slide 34 text

GK 3 0 13 2-3 9 1 Removal cost: uncertainty caused Delete the node with smallest cost, if it is smaller than 50 5 14 3-4 1 2 2 2 2 N=6

Slide 35

Slide 35 text

GK 3 0 13 2-3 9 1 Query: rank(20) 50 5 14 3-4 Return rank(20)=4 The error is at most 1

Slide 36

Slide 36 text

GK • The practical version • Space: Unknown – Small in practice – Still open • Update Time (per element): log Space – Optimization: following rank updates may be avoided

Slide 37

Slide 37 text

GK (Theoretical ver.) • Remove nodes regularly – Complicated • Space: O(1 log ) • No implementation so far

Slide 38

Slide 38 text

Random • New • Best randomized algorithm • A hybrid of MRL99 & Agarwal12 – Achieves the best bounds – Practical – Simple to implement – Very fast

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

Random … Fixed sized buffer (Logically) Divide the stream into fixed sized buffers Stream:

Slide 41

Slide 41 text

Random • Merge {1,2,5,7} and {3,4,6,8} 1,2,3,4,5,6,7,8 1,2,5,7 3,4,6,8 1,3,5,7 2,4,6,8 1/2 1/2 rank(6) = 5 2rank(6) = 4 or 6

Slide 42

Slide 42 text

Random … Fixed size buffer Whenever there are two buffers at a same level: Merge into a buffer at one level higher Merge Stream:

Slide 43

Slide 43 text

Random Fixed size buffer Merge, sort & random halve Stream:

Slide 44

Slide 44 text

Random Fixed size buffer Merge, sort & random halve Stream: Orange: buffers we need Blue: buffers we don’t need

Slide 45

Slide 45 text

Random Fixed size buffer Merge, sort & random halve Stream:

Slide 46

Slide 46 text

Random Fixed size buffer Merge, sort & random halve Stream:

Slide 47

Slide 47 text

Random Fixed size buffer Merge, sort & random halve Stream:

Slide 48

Slide 48 text

Random Fixed size buffer Merge, sort & random halve Stream:

Slide 49

Slide 49 text

Random Fixed size buffer At any time There is at most one buffer at any level Merge, sort & random halve Stream:

Slide 50

Slide 50 text

Random x1 x2 x4 Stream: Query rank(x) rank(x) Combine the ranks of x in all buffers we have

Slide 51

Slide 51 text

Random Setting = 1 log 1 • Space: (1 log 1 log ) • Update Time per element: log 1 Optimization: Sampling the input • Space: (1 log1.5 1 ) • Update Time per element: log 1 – Decreasing to (1) as → ∞

Slide 52

Slide 52 text

Experimental Results:

Slide 53

Slide 53 text

Experimental Setup: Inputs • 2 real sets – MPCAT-OBS: 87 million observations from 1802-2012, Minor Planet Center – Neuse River Basin Terrain: 100 million elevation points • 12 synthetic sets – Sizes: 107 − 1010 – : 216 − 232 – Distributions: Uniform, Normal (different deviations) – Order: Sorted, Shuffled • : 0.1 − 10−6

Slide 54

Slide 54 text

Experimental Setup: Measurements • Space • Time • Maximum Error & Average Error – Of a number of quantile queries – Average of 100 runs (for randomized algorithms) • A comprehensive comparison for the first time

Slide 55

Slide 55 text

Average Error Randomized Alg

Slide 56

Slide 56 text

Space (MB) Average Error Randomized Alg GK

Slide 57

Slide 57 text

Update Time (us) Average Error

Slide 58

Slide 58 text

Update Time (us) Space (MB)

Slide 59

Slide 59 text

Turnstile Algorithms

Slide 60

Slide 60 text

Turnstile Model • Every element from the data stream denotes for an insertion or a deletion New element Data set Ins 3 3 Ins 14 3, 14 Ins 15 3, 14, 15 Del 3 14, 15 Ins 26 14, 15, 26 Del 14 15, 26 Ins 53 15, 26, 53

Slide 61

Slide 61 text

Algorithm Space* Note GM07 1 2 log5 Ganguly & Majumder ESCAPE 2007 Random Subset Sum 1 2 log2 Gilbert et al. VLDB 2002 DCM 1 log2 Cormode & Muthukrishnan J. of Algorithms 2005 DCS 1 log1.5 New * Small log factors are ignored.

Slide 62

Slide 62 text

DCM Dyadic Decomposition 0 U-1 U/2 # of elements between U/2 and U-1

Slide 63

Slide 63 text

DCM Dyadic Decomposition 0 U-1 x rank(x)

Slide 64

Slide 64 text

DCM Dyadic Decomposition Space: () Update Time: (log )

Slide 65

Slide 65 text

DCM Consider each layer as a frequency vector, and store it in a Count Min sketch [Cormode & Muthukrishnan 2005] Space: log × SketchSize Update Time: (log × "SketchUpdateTime") CM CM CM CM

Slide 66

Slide 66 text

DCM • Query: sum of log sketches • Count Min sketch has error – log -sketches -> total error log

Slide 67

Slide 67 text

DCS • Count Min sketch -> Count sketch [Charikar 2002] • Count sketch has error , but it is unbiased – log -sketches -> total error log

Slide 68

Slide 68 text

DCS/DCM • Space – DCM: 1 log2 – DCS: 1 log1.5 • Optimization – The top log 1 layers can be removed – Smaller -> fewer layers -> faster update/query

Slide 69

Slide 69 text

Average Error

Slide 70

Slide 70 text

Space (MB) Average Error

Slide 71

Slide 71 text

Update Time (us) Average Error

Slide 72

Slide 72 text

Conclusions • Cash Register – Random • Probabilistic guarantee • Fast: cache friendly – GK • Deterministic guarantee on error • Simple • Turnstile – DCS • The best

Slide 73

Slide 73 text

More Results & Code http://quantiles.github.com/

Slide 74

Slide 74 text

Thanks!

Slide 75

Slide 75 text

Slides https://speakerdeck.com/coolwanglu/quantiles-over-data-streams-an-experimental-study