Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Quantiles over Data Streams: An Experimental Study

Lu Wang
June 27, 2013

Quantiles over Data Streams: An Experimental Study

Lu Wang

June 27, 2013
Tweet

Other Decks in Research

Transcript

  1. Quantiles over Data Streams An Experimental Study Lu Wang [email protected]

    HKUST Joint work with Ge Luo, Ke Yi, Graham Cormode
  2. 3 14 15 9 26 5 35 89 79 …

    ? ? ? ? Data Streams • An infinite sequence of elements • Limited memory • Can make queries at any time • About elements seen so far
  3. 3 14 15 9 26 5 35 89 79 ?

    Data Streams • Answer queries in one scan over a large database
  4. Characterizing a Distribution • Mean & Standard deviation – Assuming

    a normal distribution • Not accurate enough http://www.statsdirect.co.uk/help/distributions/normal_distribution.htm
  5. Characterizing a Distribution • Equi-Depth Histogram – 5 buckets •

    Adaptive • a.k.a. Height Balanced Histogram
  6. Kolmogorov-Smirnov Divergence Median Equi-Depth Histogram Network Health Monitoring Wireless Sensor

    Network Distribution Estimation R Excel Sawzall Log Data Analysis Data Collection
  7. Offline Quantile Algorithm • The selection algorithm • M. Blum

    et al. J. Comput. System Sci. 7 (1973) • For offline data • Time: • Space:
  8. Offline Quantile Algorithm • The selection algorithm • M. Blum

    et al. J. Comput. System Sci. 7 (1973) • For offline data • Time: • Space:
  9. Rank Queries Given x, to find rank(x)=# of elements that

    are smaller than x Approx. version: to find r’ such that |r’ – rank(x)| < εN Equivalent to the quantile problem • Binary search on the domain
  10. Cash Register Model • Every element from the data stream

    denotes an insertion New element Data set 3 3 14 3, 14 15 3, 14, 15 9 3, 9, 14, 15 26 3, 9, 14, 15, 26 5 3, 5, 9, 14, 15, 26 35 3, 5, 9, 14, 15, 26, 35
  11. Algorithm Space Note MP80 1 log2 Munro & Paterson Theoretical

    CS 1980 MRL98 1 log2 Manku et al. SIGMOD 1998 GK01 1 log Greenwald & Khanna SIGMOD 2001 QDigest 1 log Shrivastava et al. SenSys 2004 Random Sampling 1 2 log 1 Classical MRL99 1 log2 1 Manku et al. SIGMOD 1999 Agarwal12 1 log1.5 1 Agarwal et al. PODS 2012 Random 1 log1.5 1 New
  12. Algorithm Space Note MP80 1 log2 Munro & Paterson Theoretical

    CS 1980 MRL98 1 log2 Manku et al. SIGMOD 1998 GK01 1 log Greenwald & Khanna SIGMOD 2001 QDigest 1 log Shrivastava et al. SenSys 2004 Random Sampling 1 2 log 1 Classical MRL99 1 log2 1 Manku et al. SIGMOD 1999 Agarwal12 1 log1.5 1 Agarwal et al. PODS 2012 Random 1 log1.5 1 New
  13. Algorithm Space Note MP80 1 log2 Munro & Paterson Theoretical

    CS 1980 MRL98 1 log2 Manku et al. SIGMOD 1998 GK01 1 log Greenwald & Khanna SIGMOD 2001 QDigest 1 log Shrivastava et al. SenSys 2004 Random Sampling 1 2 log 1 Classical MRL99 1 log2 1 Manku et al. SIGMOD 1999 Agarwal12 1 log1.5 1 Agarwal et al. PODS 2012 Random 1 log1.5 1 New • Achieves the best bounds • Simple and practical • Very fast
  14. GK 3 Incoming: 50 0 14 1 50 2 We

    know exact rank of any element Space: (#node) N=3
  15. GK 3 Incoming: 9 0 14 1 9 1 Insertion

    in the middle: affects other nodes’ ranks 50 2 N=4
  16. GK 3 0 13 2-3 9 1 Incoming: 14 node:

    value & rank bounds 50 3 N=5
  17. GK 3 0 13 2-3 9 1 Incoming: 14 Insertion:

    does not introduce errors Error retains in the interval 50 4 14 3-4 N=6
  18. GK 3 0 13 2-3 9 1 One comes, one

    goes New elements are inserted immediately Try to delete one after each insertion 50 5 14 3-4 N=6
  19. GK 3 0 13 2-3 9 1 Removal cost: uncertainty

    caused Delete the node with smallest cost, if it is smaller than 50 5 14 3-4 1 2 2 2 2 N=6
  20. GK 3 0 13 2-3 9 1 Query: rank(20) 50

    5 14 3-4 Return rank(20)=4 The error is at most 1
  21. GK • The practical version • Space: Unknown – Small

    in practice – Still open • Update Time (per element): log Space – Optimization: following rank updates may be avoided
  22. GK (Theoretical ver.) • Remove nodes regularly – Complicated •

    Space: O(1 log ) • No implementation so far
  23. Random • New • Best randomized algorithm • A hybrid

    of MRL99 & Agarwal12 – Achieves the best bounds – Practical – Simple to implement – Very fast
  24. Random … Fixed size buffer Whenever there are two buffers

    at a same level: Merge into a buffer at one level higher Merge Stream:
  25. Random Fixed size buffer Merge, sort & random halve Stream:

    Orange: buffers we need Blue: buffers we don’t need
  26. Random Fixed size buffer At any time There is at

    most one buffer at any level Merge, sort & random halve Stream:
  27. Random Setting = 1 log 1 • Space: (1 log

    1 log ) • Update Time per element: log 1 Optimization: Sampling the input • Space: (1 log1.5 1 ) • Update Time per element: log 1 – Decreasing to (1) as → ∞
  28. Experimental Setup: Inputs • 2 real sets – MPCAT-OBS: 87

    million observations from 1802-2012, Minor Planet Center – Neuse River Basin Terrain: 100 million elevation points • 12 synthetic sets – Sizes: 107 − 1010 – : 216 − 232 – Distributions: Uniform, Normal (different deviations) – Order: Sorted, Shuffled • : 0.1 − 10−6
  29. Experimental Setup: Measurements • Space • Time • Maximum Error

    & Average Error – Of a number of quantile queries – Average of 100 runs (for randomized algorithms) • A comprehensive comparison for the first time
  30. Turnstile Model • Every element from the data stream denotes

    for an insertion or a deletion New element Data set Ins 3 3 Ins 14 3, 14 Ins 15 3, 14, 15 Del 3 14, 15 Ins 26 14, 15, 26 Del 14 15, 26 Ins 53 15, 26, 53
  31. Algorithm Space* Note GM07 1 2 log5 Ganguly & Majumder

    ESCAPE 2007 Random Subset Sum 1 2 log2 Gilbert et al. VLDB 2002 DCM 1 log2 Cormode & Muthukrishnan J. of Algorithms 2005 DCS 1 log1.5 New * Small log factors are ignored.
  32. DCM Consider each layer as a frequency vector, and store

    it in a Count Min sketch [Cormode & Muthukrishnan 2005] Space: log × SketchSize Update Time: (log × "SketchUpdateTime") CM CM CM CM
  33. DCM • Query: sum of log sketches • Count Min

    sketch has error – log -sketches -> total error log
  34. DCS • Count Min sketch -> Count sketch [Charikar 2002]

    • Count sketch has error , but it is unbiased – log -sketches -> total error log
  35. DCS/DCM • Space – DCM: 1 log2 – DCS: 1

    log1.5 • Optimization – The top log 1 layers can be removed – Smaller -> fewer layers -> faster update/query
  36. Conclusions • Cash Register – Random • Probabilistic guarantee •

    Fast: cache friendly – GK • Deterministic guarantee on error • Simple • Turnstile – DCS • The best