3 14 15 9 26 5 35 89 79 …
? ?
? ?
Data Streams
• An infinite sequence of elements
• Limited memory
• Can make queries at any time
• About elements seen so far
Slide 6
Slide 6 text
3 14 15 9 26 5 35 89 79
?
Data Streams
• Answer queries in one scan over a large database
Slide 7
Slide 7 text
Characterizing a Distribution
• Mean & Standard deviation
– Assuming a normal distribution
• Not accurate enough
http://www.statsdirect.co.uk/help/distributions/normal_distribution.htm
Slide 8
Slide 8 text
Characterizing a Distribution
• Example
http://www.math.wisc.edu/~angenent/221.2008f/theNaturalBlog.html
Slide 9
Slide 9 text
Characterizing a Distribution
• Equi-Width Histogram
– 5 buckets
• Not good for skewed data
Slide 10
Slide 10 text
Characterizing a Distribution
• Equi-Depth Histogram
– 5 buckets
• Adaptive
• a.k.a. Height Balanced Histogram
Kolmogorov-Smirnov Divergence
Median
Equi-Depth Histogram
Network Health Monitoring
Wireless Sensor Network
Distribution Estimation
R
Excel
Sawzall
Log Data Analysis
Data Collection
Slide 15
Slide 15 text
Offline Quantile Algorithm
• The selection algorithm
• M. Blum et al. J. Comput. System Sci. 7 (1973)
• For offline data
• Time:
• Space:
Slide 16
Slide 16 text
Offline Quantile Algorithm
• The selection algorithm
• M. Blum et al. J. Comput. System Sci. 7 (1973)
• For offline data
• Time:
• Space:
Slide 17
Slide 17 text
For data streams,
linear space is needed to find an exact quantile
Slide 18
Slide 18 text
Approx. Version of Quantiles
0
1
CDF
ϕ
ϕ+ε
ϕ–ε
Slide 19
Slide 19 text
Rank Queries
Given x, to find
rank(x)=# of elements that are smaller than x
Approx. version: to find r’ such that
|r’ – rank(x)| < εN
Equivalent to the quantile problem
• Binary search on the domain
Slide 20
Slide 20 text
Cash Register Algorithms
Slide 21
Slide 21 text
Cash Register Model
• Every element from the data stream denotes
an insertion
New element Data set
3 3
14 3, 14
15 3, 14, 15
9 3, 9, 14, 15
26 3, 9, 14, 15, 26
5 3, 5, 9, 14, 15, 26
35 3, 5, 9, 14, 15, 26, 35
Slide 22
Slide 22 text
Algorithm Space Note
MP80
1
log2
Munro & Paterson
Theoretical CS 1980
MRL98
1
log2
Manku et al.
SIGMOD 1998
GK01
1
log
Greenwald & Khanna
SIGMOD 2001
QDigest
1
log
Shrivastava et al.
SenSys 2004
Random Sampling
1
2
log
1
Classical
MRL99
1
log2
1
Manku et al.
SIGMOD 1999
Agarwal12
1
log1.5
1
Agarwal et al.
PODS 2012
Random
1
log1.5
1
New
Slide 23
Slide 23 text
Algorithm Space Note
MP80
1
log2
Munro & Paterson
Theoretical CS 1980
MRL98
1
log2
Manku et al.
SIGMOD 1998
GK01
1
log
Greenwald & Khanna
SIGMOD 2001
QDigest
1
log
Shrivastava et al.
SenSys 2004
Random Sampling
1
2
log
1
Classical
MRL99
1
log2
1
Manku et al.
SIGMOD 1999
Agarwal12
1
log1.5
1
Agarwal et al.
PODS 2012
Random
1
log1.5
1
New
Slide 24
Slide 24 text
Algorithm Space Note
MP80
1
log2
Munro & Paterson
Theoretical CS 1980
MRL98
1
log2
Manku et al.
SIGMOD 1998
GK01
1
log
Greenwald & Khanna
SIGMOD 2001
QDigest
1
log
Shrivastava et al.
SenSys 2004
Random Sampling
1
2
log
1
Classical
MRL99
1
log2
1
Manku et al.
SIGMOD 1999
Agarwal12
1
log1.5
1
Agarwal et al.
PODS 2012
Random
1
log1.5
1
New
• Achieves the best bounds
• Simple and practical
• Very fast
Slide 25
Slide 25 text
GK
3
Incoming: 3
0
node: value & rank
N=1
Slide 26
Slide 26 text
GK
3
Incoming: 14
0
14
1
N=2
Slide 27
Slide 27 text
GK
3
Incoming: 50
0
14
1
50
2
We know exact rank of any element
Space: (#node)
N=3
Slide 28
Slide 28 text
GK
3
Incoming: 9
0
14
1
9
1
Insertion in the middle: affects other nodes’ ranks
50
2
N=4
GK
3
0
13
2-3
9
1
Incoming: 14
Insertion: does not introduce errors
Error retains in the interval
50
4
14
3-4
N=6
Slide 33
Slide 33 text
GK
3
0
13
2-3
9
1
One comes, one goes
New elements are inserted immediately
Try to delete one after each insertion
50
5
14
3-4
N=6
Slide 34
Slide 34 text
GK
3
0
13
2-3
9
1
Removal cost: uncertainty caused
Delete the node with smallest cost, if it is smaller than
50
5
14
3-4
1 2 2 2 2
N=6
Slide 35
Slide 35 text
GK
3
0
13
2-3
9
1
Query: rank(20)
50
5
14
3-4
Return rank(20)=4
The error is at most 1
Slide 36
Slide 36 text
GK
• The practical version
• Space: Unknown
– Small in practice
– Still open
• Update Time (per element): log Space
– Optimization: following rank updates may be
avoided
Slide 37
Slide 37 text
GK (Theoretical ver.)
• Remove nodes regularly
– Complicated
• Space: O(1
log )
• No implementation so far
Slide 38
Slide 38 text
Random
• New
• Best randomized algorithm
• A hybrid of MRL99 & Agarwal12
– Achieves the best bounds
– Practical
– Simple to implement
– Very fast
Slide 39
Slide 39 text
No content
Slide 40
Slide 40 text
Random
…
Fixed sized buffer
(Logically) Divide the stream into fixed sized buffers
Stream:
Slide 41
Slide 41 text
Random
• Merge {1,2,5,7} and {3,4,6,8}
1,2,3,4,5,6,7,8
1,2,5,7
3,4,6,8
1,3,5,7
2,4,6,8
1/2
1/2
rank(6) = 5 2rank(6) = 4 or 6
Slide 42
Slide 42 text
Random
…
Fixed size buffer
Whenever there are two buffers at a same level:
Merge into a buffer at one level higher
Merge
Stream:
Slide 43
Slide 43 text
Random
Fixed size buffer
Merge, sort
& random halve
Stream:
Slide 44
Slide 44 text
Random
Fixed size buffer
Merge, sort
& random halve
Stream:
Orange: buffers we need
Blue: buffers we don’t need
Slide 45
Slide 45 text
Random
Fixed size buffer
Merge, sort
& random halve
Stream:
Slide 46
Slide 46 text
Random
Fixed size buffer
Merge, sort
& random halve
Stream:
Slide 47
Slide 47 text
Random
Fixed size buffer
Merge, sort
& random halve
Stream:
Slide 48
Slide 48 text
Random
Fixed size buffer
Merge, sort
& random halve
Stream:
Slide 49
Slide 49 text
Random
Fixed size buffer
At any time
There is at most one buffer at any level
Merge, sort
& random halve
Stream:
Slide 50
Slide 50 text
Random
x1
x2
x4
Stream:
Query rank(x)
rank(x)
Combine the ranks of x in all buffers we have
Slide 51
Slide 51 text
Random
Setting = 1
log 1
• Space: (1
log 1
log )
• Update Time per element: log 1
Optimization: Sampling the input
• Space: (1
log1.5 1
)
• Update Time per element: log 1
– Decreasing to (1) as → ∞
Slide 52
Slide 52 text
Experimental Results:
Slide 53
Slide 53 text
Experimental Setup: Inputs
• 2 real sets
– MPCAT-OBS: 87 million observations from 1802-2012,
Minor Planet Center
– Neuse River Basin Terrain: 100 million elevation points
• 12 synthetic sets
– Sizes: 107 − 1010
– : 216 − 232
– Distributions: Uniform, Normal (different deviations)
– Order: Sorted, Shuffled
• : 0.1 − 10−6
Slide 54
Slide 54 text
Experimental Setup: Measurements
• Space
• Time
• Maximum Error & Average Error
– Of a number of quantile queries
– Average of 100 runs (for randomized algorithms)
• A comprehensive comparison for the first time
Slide 55
Slide 55 text
Average Error
Randomized Alg
Slide 56
Slide 56 text
Space (MB)
Average Error
Randomized Alg
GK
Slide 57
Slide 57 text
Update Time (us)
Average Error
Slide 58
Slide 58 text
Update Time (us)
Space (MB)
Slide 59
Slide 59 text
Turnstile Algorithms
Slide 60
Slide 60 text
Turnstile Model
• Every element from the data stream denotes
for an insertion or a deletion
New element Data set
Ins 3 3
Ins 14 3, 14
Ins 15 3, 14, 15
Del 3 14, 15
Ins 26 14, 15, 26
Del 14 15, 26
Ins 53 15, 26, 53
Slide 61
Slide 61 text
Algorithm Space* Note
GM07
1
2
log5
Ganguly & Majumder
ESCAPE 2007
Random Subset Sum
1
2
log2
Gilbert et al.
VLDB 2002
DCM
1
log2
Cormode & Muthukrishnan
J. of Algorithms 2005
DCS
1
log1.5 New
* Small log factors are ignored.
Slide 62
Slide 62 text
DCM
Dyadic Decomposition
0 U-1
U/2
# of elements
between U/2 and U-1
DCM
Consider each layer as a frequency vector, and
store it in a Count Min sketch [Cormode & Muthukrishnan 2005]
Space: log × SketchSize
Update Time: (log × "SketchUpdateTime")
CM
CM
CM
CM
Slide 66
Slide 66 text
DCM
• Query: sum of log sketches
• Count Min sketch has error
– log -sketches -> total error log
Slide 67
Slide 67 text
DCS
• Count Min sketch -> Count sketch [Charikar 2002]
• Count sketch has error , but it is unbiased
– log -sketches -> total error log
Slide 68
Slide 68 text
DCS/DCM
• Space
– DCM: 1
log2
– DCS: 1
log1.5
• Optimization
– The top log 1
layers can be removed
– Smaller -> fewer layers -> faster update/query
Slide 69
Slide 69 text
Average Error
Slide 70
Slide 70 text
Space (MB)
Average Error
Slide 71
Slide 71 text
Update Time (us)
Average Error
Slide 72
Slide 72 text
Conclusions
• Cash Register
– Random
• Probabilistic guarantee
• Fast: cache friendly
– GK
• Deterministic guarantee on error
• Simple
• Turnstile
– DCS
• The best