1
S K E T C H I N G
D ATA
S T R U C T U R E S
Velocity London
Slide 2
Slide 2 text
2
S K E T C H I N G
D ATA
S T R U C T U R E S
Velocity London
Slide 3
Slide 3 text
3
Kiran Bhattaram
@kiranb
Slide 4
Slide 4 text
4
Timeline
W h e r e t o g o
f r o m h e r e
S o l v i n g
p r o b l e m s
M o t i v a t i o n S y s t e m s i n
P r o d u c t i o n !
Slide 5
Slide 5 text
4
Timeline
W h e r e t o g o
f r o m h e r e
S o l v i n g
p r o b l e m s
M o t i v a t i o n S y s t e m s i n
P r o d u c t i o n !
Slide 6
Slide 6 text
4
Timeline
W h e r e t o g o
f r o m h e r e
S o l v i n g
p r o b l e m s
M o t i v a t i o n S y s t e m s i n
P r o d u c t i o n !
Slide 7
Slide 7 text
4
Timeline
W h e r e t o g o
f r o m h e r e
S o l v i n g
p r o b l e m s
M o t i v a t i o n S y s t e m s i n
P r o d u c t i o n !
Slide 8
Slide 8 text
4
Timeline
W h e r e t o g o
f r o m h e r e
S o l v i n g
p r o b l e m s
M o t i v a t i o n S y s t e m s i n
P r o d u c t i o n !
Slide 9
Slide 9 text
5
Background
1 motivation,
history,
system models
Slide 10
Slide 10 text
6
Algorithm Efficiency Axes
time
space
Slide 11
Slide 11 text
6
Algorithm Efficiency Axes
time
space
error probability
Slide 12
Slide 12 text
6
Algorithm Efficiency Axes
time
space
implementation complexity
error probability
Slide 13
Slide 13 text
7
Classic algorithms
time
space
error probability = 0
Slide 14
Slide 14 text
8
Probabilistic Algorithms
time
space
error probability
Slide 15
Slide 15 text
9
If you can tolerate error…
4 x 109
=> 0.5 GiB to store
IPv4 addresses
how many IP addresses have we seen?
Slide 16
Slide 16 text
9
If you can tolerate error…
4 x 109
=> 0.5 GiB to store
IPv4 addresses
vs. 1.5kB with a 2% error
how many IP addresses have we seen?
Slide 17
Slide 17 text
9
If you can tolerate error…
4 x 109
=> 0.5 GiB to store
IPv4 addresses
vs. 1.5kB with a 2% error
how many IP addresses have we seen?
x 358,000
Slide 18
Slide 18 text
10
What are sketches?
Slide 19
Slide 19 text
10
What are sketches?
probabilistic algorithms
Slide 20
Slide 20 text
10
What are sketches?
probabilistic algorithms
summarize stream of data
Slide 21
Slide 21 text
10
What are sketches?
probabilistic algorithms
summarize stream of data
streaming data/online queries
Slide 22
Slide 22 text
11
how they work
Slide 23
Slide 23 text
11
how they work
[ ]
stream of data
. . .
Slide 24
Slide 24 text
11
how they work
P(n)
hash!
uniform distribution
[ ]
stream of data
. . .
Slide 25
Slide 25 text
11
how they work
P(n)
hash!
uniform distribution
[ ]
stream of data
. . .
data structure
Slide 26
Slide 26 text
11
how they work
P(n)
hash!
uniform distribution
estimator
[ ]
stream of data
. . .
data structure
Slide 27
Slide 27 text
11
how they work
P(n)
hash!
uniform distribution
estimator
guess +/- ε
[ ]
stream of data
. . .
data structure
Slide 28
Slide 28 text
12
Estimators & Observables
✦ Order statistics:
[10, 11, 10, 01]
ex: smallest value seen
so far
✦ Bit-pattern:
ex: longest run of
contiguous 0s
10001010
✦ Presence:
ex: is the bit
set?
15
Editor: Features
1. Feed of short stories without duplicates
Slide 33
Slide 33 text
15
Editor: Features
1. Feed of short stories without duplicates
2. Working vocabulary size (# of unique words)
Slide 34
Slide 34 text
15
Editor: Features
1. Feed of short stories without duplicates
2. Working vocabulary size (# of unique words)
3. Word length statistics
Slide 35
Slide 35 text
16
Editor: Analytics Requirements
Fast: want real-time statistics
Okay to be good ~enough
Cheap to run: no data analytics team!
Slide 36
Slide 36 text
17
Bloom Filters
2 set membership
Slide 37
Slide 37 text
18
The Problem
is this element in this set?
[ ]
Slide 38
Slide 38 text
18
The Problem
Google Chrome: ”is this URL known to be malicious?"
is this element in this set?
[ ]
Slide 39
Slide 39 text
18
The Problem
Google Chrome: ”is this URL known to be malicious?"
is this element in this set?
[ ]
Databases/LSM trees: “is this data on disk?”
Slide 40
Slide 40 text
18
The Problem
Google Chrome: ”is this URL known to be malicious?"
is this element in this set?
[ ]
Databases/LSM trees: “is this data on disk?”
Story Feed: “have I read this short story?”
Slide 41
Slide 41 text
19
Hash Set
hash to a bitmap; test for presence
[ ]
21
Hash Set — Insertion
hash to a bitmap; test for presence
[ ]
array of size m
Slide 46
Slide 46 text
21
Hash Set — Insertion
hash to a bitmap; test for presence
[ ]
array of size m
hash ( )
mod m
Slide 47
Slide 47 text
21
Hash Set — Insertion
hash to a bitmap; test for presence
[ ]
array of size m
hash ( )
mod m
Slide 48
Slide 48 text
21
Hash Set — Insertion
hash to a bitmap; test for presence
[ ]
array of size m
hash ( )
mod m
Slide 49
Slide 49 text
21
Hash Set — Insertion
hash to a bitmap; test for presence
[ ]
array of size m
hash ( )
mod m
Slide 50
Slide 50 text
21
Hash Set — Insertion
hash to a bitmap; test for presence
[ ]
array of size m
hash ( )
mod m
Slide 51
Slide 51 text
21
Hash Set — Insertion
hash to a bitmap; test for presence
[ ]
array of size m
hash ( )
mod m
Slide 52
Slide 52 text
22
Hash Set — Testing
hash to a bitmap; test for presence
Slide 53
Slide 53 text
22
Hash Set — Testing
hash to a bitmap; test for presence
Slide 54
Slide 54 text
22
Hash Set — Testing
hash to a bitmap; test for presence
Slide 55
Slide 55 text
22
Hash Set — Testing
hash to a bitmap; test for presence
Slide 56
Slide 56 text
23
Hash Set — Collisions
Slide 57
Slide 57 text
24
The system now
Slide 58
Slide 58 text
24
The system now
Slide 59
Slide 59 text
24
The system now
Slide 60
Slide 60 text
24
The system now
Slide 61
Slide 61 text
24
The system now
Slide 62
Slide 62 text
25
Scaling the system
x100
Slide 63
Slide 63 text
25
Scaling the system
x100
Slide 64
Slide 64 text
25
Scaling the system
x100
Slide 65
Slide 65 text
26
Intuition 1: don’t store the entire object!
false positives!
m bits in the array
P(bit = 0)
Slide 66
Slide 66 text
26
Intuition 1: don’t store the entire object!
false positives!
( )n
m bits in the array
P(bit = 0)
Slide 67
Slide 67 text
26
Intuition 1: don’t store the entire object!
false positives!
( )n
number of elements inserted
m bits in the array
P(bit = 0)
Slide 68
Slide 68 text
26
Intuition 1: don’t store the entire object!
false positives!
( )n
1 - number of elements inserted
m bits in the array
P(bit = 0)
Slide 69
Slide 69 text
27
Intuition 2 — Multiply Hashing!
run through k independent hash functions
Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
Slide 70
Slide 70 text
27
Intuition 2 — Multiply Hashing!
run through k independent hash functions
Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
Slide 71
Slide 71 text
27
Intuition 2 — Multiply Hashing!
run through k independent hash functions
h1(x) h2(x)
h3(x)
Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
Slide 72
Slide 72 text
27
Intuition 2 — Multiply Hashing!
run through k independent hash functions
Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
Slide 73
Slide 73 text
27
Intuition 2 — Multiply Hashing!
run through k independent hash functions
Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
Slide 74
Slide 74 text
27
run through k independent hash functions
Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"
Bloom Filter!
Slide 75
Slide 75 text
28
Bloom Filter — Testing!
hash to a bitmap; test for presence
31
Bloom Filter — Error Rates!
false positives!
number of hash functions (k)
false positive possibility
optimal k!
Slide 82
Slide 82 text
32
Bloom Filters: a summary
No false negatives
Smaller memory footprint
(store 4-8 bits vs. entire obj)
Small (and tunable!)
false positive rate
Can’t retrieve or delete
items
Slide 83
Slide 83 text
33
how they work: Bloom Filters
[ ]
stream of data
. . .
Slide 84
Slide 84 text
33
how they work: Bloom Filters
[ ]
stream of data
. . .
P(n)
hash!
uniform distribution
Slide 85
Slide 85 text
33
how they work: Bloom Filters
[ ]
stream of data
. . .
P(n)
hash!
uniform distribution
data structure: bitmap
Slide 86
Slide 86 text
33
how they work: Bloom Filters
[ ]
stream of data
. . .
P(n)
hash!
uniform distribution
data structure: bitmap
estimator: presence
Slide 87
Slide 87 text
33
how they work: Bloom Filters
[ ]
stream of data
. . .
P(n)
hash!
uniform distribution
data structure: bitmap
estimator: presence
guess +/- ε
Slide 88
Slide 88 text
34
Story Feed: feed architecture
Slide 89
Slide 89 text
34
Story Feed: feed architecture
Slide 90
Slide 90 text
34
Story Feed: feed architecture
Slide 91
Slide 91 text
34
Story Feed: feed architecture
Slide 92
Slide 92 text
35
Merging Bloom Filters
Bitwise OR
=
Slide 93
Slide 93 text
36
An extension: Counting Bloom Filters
allows for deletions
Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol"
0 0
0 0 0 0
0 0 0
Slide 94
Slide 94 text
36
An extension: Counting Bloom Filters
allows for deletions
1 1
1
Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol"
0 0
0
0 0 0
Slide 95
Slide 95 text
36
An extension: Counting Bloom Filters
allows for deletions
1 1 1 1
1
1
Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol"
0 0
0
Slide 96
Slide 96 text
36
An extension: Counting Bloom Filters
allows for deletions
2 2 1 1 1 1 1
Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol"
0 0
Slide 97
Slide 97 text
36
An extension: Counting Bloom Filters
allows for deletions
2 1 1 1
Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol"
1 0 0
0 0
Slide 98
Slide 98 text
2 1 2 1
37
An extension: Count-Min Sketch
keep a count of the frequency of items seen
min() estimator
2 3 1
2 4
h1
h2
h3
Cormode, Graham (2009). "Count-min sketch"
Slide 99
Slide 99 text
2 1 2 1
37
An extension: Count-Min Sketch
keep a count of the frequency of items seen
min() estimator
2 3 1
2 4
h1
h2
h3
Cormode, Graham (2009). "Count-min sketch"
40
Editor: Text Analytics
denote read stories (Bloom filters!)
count unique words used/unique users
Slide 107
Slide 107 text
41
The Problem: Cardinality
number of unique values in a collection
[ ]
Slide 108
Slide 108 text
41
The Problem: Cardinality
number of unique values in a collection
[ ]
advertising: number of “uniques”
Slide 109
Slide 109 text
41
The Problem: Cardinality
traffic modeling: # of unique IP addresses
number of unique values in a collection
[ ]
advertising: number of “uniques”
Slide 110
Slide 110 text
41
The Problem: Cardinality
traffic modeling: # of unique IP addresses
number of unique values in a collection
[ ]
advertising: number of “uniques”
natural language processing: number of unique words
Slide 111
Slide 111 text
42
Measuring Cardinality
size = N (number of unique values)
Slide 112
Slide 112 text
42
Measuring Cardinality
size = N (number of unique values)
(ex) IPv4: 232 bits
Slide 113
Slide 113 text
42
Measuring Cardinality
size = N (number of unique values)
(ex) IPv4: 232 bits
0.5 GiB
Slide 114
Slide 114 text
43
The system now
Slide 115
Slide 115 text
43
The system now
Slide 116
Slide 116 text
43
The system now
Slide 117
Slide 117 text
43
The system now
Slide 118
Slide 118 text
43
The system now
Slide 119
Slide 119 text
43
The system now
Slide 120
Slide 120 text
43
The system now
Slide 121
Slide 121 text
43
The system now
Slide 122
Slide 122 text
44
The Paper
Flajolet, Fusy, et
al. 2007
Slide 123
Slide 123 text
45
Flipping coins!
Slide 124
Slide 124 text
46
Flipping coins!
seeing a rare combination
=> I’ve seen a lot of trials!
Slide 125
Slide 125 text
47
Bit patterns!
0101
1010
0010
0001
1100
1011
0101 1011 1010
run of 3 0s
=> likely seen 8 numbers!
Slide 126
Slide 126 text
48
Making a uniform distribution
Hashing!
(ex: murmurhash)
01
10
11
Slide 127
Slide 127 text
49
But the cardinality estimate could be
so wrong!
Techniques for increasing accuracy
~8 friends ~4 friends ~4 friends
= ~5.33 friends
x 3 trials
Slide 128
Slide 128 text
50
The Algorithm
000 001 010 011 100 101 110 111
a register of m=8 bytes
Slide 129
Slide 129 text
50
The Algorithm
000 001 010 011 100 101 110 111
a register of m=8 bytes
010 00010
Slide 130
Slide 130 text
50
The Algorithm
Bucket
the first log2
8 bits
000 001 010 011 100 101 110 111
a register of m=8 bytes
010 00010
Slide 131
Slide 131 text
50
The Algorithm
Bucket
the first log2
8 bits
000 001 010 011 100 101 110 111
count leading 0s
a register of m=8 bytes
010 00010
Slide 132
Slide 132 text
50
The Algorithm
Bucket
the first log2
8 bits
000 001 010 011 100 101 110 111
count leading 0s
a register of m=8 bytes
010 00010
Slide 133
Slide 133 text
50
The Algorithm
3
Bucket
the first log2
8 bits
000 001 010 011 100 101 110 111
count leading 0s
a register of m=8 bytes
010 00010
52
The Algorithm
1 2 3 5 2 3 1 2
000 001 010 011 100 101 110 111
take the harmonic mean of all of these!
Slide 136
Slide 136 text
52
The Algorithm
1 2 3 5 2 3 1 2
000 001 010 011 100 101 110 111
take the harmonic mean of all of these!
= 8 * 3.93 = 31.5
Slide 137
Slide 137 text
52
The Algorithm
1 2 3 5 2 3 1 2
000 001 010 011 100 101 110 111
take the harmonic mean of all of these!
= 8 * 3.93 = 31.5 (I used 28 values)
Slide 138
Slide 138 text
52
The Algorithm
1 2 3 5 2 3 1 2
000 001 010 011 100 101 110 111
take the harmonic mean of all of these!
= 8 * 3.93 = 31.5 (I used 28 values)
Plus corrections for small and large values!
78
A brief list of other sketches
• Skip Lists
• frequency: count-min sketch, heavy hitters, etc
• membership: Bloom filters, Cuckoo hashing
• cardinality: hyperloglog
• geometric data: coresets, locality-sensitive hashing
Slide 190
Slide 190 text
79
tl;dr — error is a tradeoff in algorithms
approximations are often Good Enough
and a hell of a lot cheaper
Slide 191
Slide 191 text
80
Thanks!
@ k i r a n b kiranbot.com
Slide 192
Slide 192 text
81
Appendix!
Slide 193
Slide 193 text
82
Database Semi-Joins
city author
1 Kiran
2
or peer-to-peer networks!
Slide 194
Slide 194 text
82
Database Semi-Joins
city author
1 Kiran
2
or peer-to-peer networks!
Slide 195
Slide 195 text
82
Database Semi-Joins
city author
1 Kiran
2
or peer-to-peer networks!
Slide 196
Slide 196 text
83
Small Value Corrections
1 3 2
Slide 197
Slide 197 text
83
Small Value Corrections
1 3 2
Estimate = m*log(m/# of un-init registers)
= ~ 3.75 values
Slide 198
Slide 198 text
84
Large Value Corrections
Slide 199
Slide 199 text
84
Large Value Corrections
as the number of unique values approaches 2^(2^m), you start
seeing hash collisions!
Slide 200
Slide 200 text
84
Large Value Corrections
as the number of unique values approaches 2^(2^m), you start
seeing hash collisions!
=> use a 64 bit hash & more bits in the registers!