Quantiles over Data Streams: An Experimental Study

Slide 1

Slide 1 text

Quantiles over Data Streams An Experimental Study Lu Wang [email protected] HKUST Joint work with Ge Luo, Ke Yi, Graham Cormode

Slide 2

Slide 2 text

http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html

Slide 3

Slide 3 text

Huge amount of Data generated every minute

Slide 4

Slide 4 text

Quantiles over Data Streams

Slide 5

Slide 5 text

3 14 15 9 26 5 35 89 79 … ? ? ? ? Data Streams • An infinite sequence of elements • Limited memory • Can make queries at any time • About elements seen so far

Slide 6

Slide 6 text

3 14 15 9 26 5 35 89 79 ? Data Streams • Answer queries in one scan over a large database

Slide 7

Slide 7 text

Characterizing a Distribution • Mean & Standard deviation – Assuming a normal distribution • Not accurate enough http://www.statsdirect.co.uk/help/distributions/normal_distribution.htm

Slide 8

Slide 8 text

Characterizing a Distribution • Example http://www.math.wisc.edu/~angenent/221.2008f/theNaturalBlog.html

Slide 9

Slide 9 text

Characterizing a Distribution • Equi-Width Histogram – 5 buckets • Not good for skewed data

Slide 10

Slide 10 text

Characterizing a Distribution • Equi-Depth Histogram – 5 buckets • Adaptive • a.k.a. Height Balanced Histogram

Slide 11

Slide 11 text

Quantiles over Data Streams

Slide 12

Slide 12 text

The Quantile Problem 0 1 CDF ϕ-quantile ϕ

Slide 13

Slide 13 text

Quantiles = Equi-Depth Histogram • 5 buckets 0.2-quantile 0.4-quantile 0.6-quantile 0.8-quantile

Slide 14

Slide 14 text

Kolmogorov-Smirnov Divergence Median Equi-Depth Histogram Network Health Monitoring Wireless Sensor Network Distribution Estimation R Excel Sawzall Log Data Analysis Data Collection

Slide 15

Slide 15 text

Offline Quantile Algorithm • The selection algorithm • M. Blum et al. J. Comput. System Sci. 7 (1973) • For offline data • Time: • Space:

Slide 16

Slide 16 text

Offline Quantile Algorithm • The selection algorithm • M. Blum et al. J. Comput. System Sci. 7 (1973) • For offline data • Time: • Space:

Slide 17

Slide 17 text

For data streams, linear space is needed to find an exact quantile

Slide 18

Slide 18 text

Approx. Version of Quantiles 0 1 CDF ϕ ϕ+ε ϕ–ε

Slide 19

Slide 19 text

Rank Queries Given x, to find rank(x)=# of elements that are smaller than x Approx. version: to find r’ such that |r’ – rank(x)| < εN Equivalent to the quantile problem • Binary search on the domain

Slide 20

Slide 20 text

Cash Register Algorithms

Slide 21

Slide 21 text

Cash Register Model • Every element from the data stream denotes an insertion New element Data set 3 3 14 3, 14 15 3, 14, 15 9 3, 9, 14, 15 26 3, 9, 14, 15, 26 5 3, 5, 9, 14, 15, 26 35 3, 5, 9, 14, 15, 26, 35

Slide 22

Slide 22 text

Algorithm Space Note MP80 1 log2 Munro & Paterson Theoretical CS 1980 MRL98 1 log2 Manku et al. SIGMOD 1998 GK01 1 log Greenwald & Khanna SIGMOD 2001 QDigest 1 log Shrivastava et al. SenSys 2004 Random Sampling 1 2 log 1 Classical MRL99 1 log2 1 Manku et al. SIGMOD 1999 Agarwal12 1 log1.5 1 Agarwal et al. PODS 2012 Random 1 log1.5 1 New

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

GK 3 Incoming: 3 0 node: value & rank N=1

Slide 26

Slide 26 text

GK 3 Incoming: 14 0 14 1 N=2

Slide 27

Slide 27 text

GK 3 Incoming: 50 0 14 1 50 2 We know exact rank of any element Space: (#node) N=3

Slide 28

Slide 28 text

GK 3 Incoming: 9 0 14 1 9 1 Insertion in the middle: affects other nodes’ ranks 50 2 N=4

Slide 29

Slide 29 text

GK 3 0 14 2 9 1 Deletion: introducing errors 50 3 N=4

Slide 30

Slide 30 text

GK 3 0 9 1 Deletion: introducing errors rank(13) = 2 or 3 ? Incoming: 13 50 3 N=5

Slide 31

Slide 31 text

GK 3 0 13 2-3 9 1 Incoming: 14 node: value & rank bounds 50 3 N=5

Slide 32

Slide 32 text

GK 3 0 13 2-3 9 1 Incoming: 14 Insertion: does not introduce errors Error retains in the interval 50 4 14 3-4 N=6

Slide 33

Slide 33 text

GK 3 0 13 2-3 9 1 One comes, one goes New elements are inserted immediately Try to delete one after each insertion 50 5 14 3-4 N=6

Slide 34

Slide 34 text

GK 3 0 13 2-3 9 1 Removal cost: uncertainty caused Delete the node with smallest cost, if it is smaller than 50 5 14 3-4 1 2 2 2 2 N=6

Slide 35

Slide 35 text

GK 3 0 13 2-3 9 1 Query: rank(20) 50 5 14 3-4 Return rank(20)=4 The error is at most 1

Slide 36

Slide 36 text

GK • The practical version • Space: Unknown – Small in practice – Still open • Update Time (per element): log Space – Optimization: following rank updates may be avoided

Slide 37

Slide 37 text

GK (Theoretical ver.) • Remove nodes regularly – Complicated • Space: O(1 log ) • No implementation so far

Slide 38

Slide 38 text

Random • New • Best randomized algorithm • A hybrid of MRL99 & Agarwal12 – Achieves the best bounds – Practical – Simple to implement – Very fast

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

Random … Fixed sized buffer (Logically) Divide the stream into fixed sized buffers Stream:

Slide 41

Slide 41 text

Random • Merge {1,2,5,7} and {3,4,6,8} 1,2,3,4,5,6,7,8 1,2,5,7 3,4,6,8 1,3,5,7 2,4,6,8 1/2 1/2 rank(6) = 5 2rank(6) = 4 or 6

Slide 42

Slide 42 text

Random … Fixed size buffer Whenever there are two buffers at a same level: Merge into a buffer at one level higher Merge Stream: