Velocity London 2017 - A tour of sketching data structures

1 S K E T C H I N G
D ATA S T R U C T U R E S Velocity London

2 S K E T C H I N G
D ATA S T R U C T U R E S Velocity London

3 Kiran Bhattaram @kiranb

4 Timeline W h e r e t o g
o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !

5 Background 1 motivation, history, system models

6 Algorithm Efficiency Axes time space

6 Algorithm Efficiency Axes time space error probability

6 Algorithm Efficiency Axes time space implementation complexity error probability

7 Classic algorithms time space error probability = 0

8 Probabilistic Algorithms time space error probability

9 If you can tolerate error… 4 x 109 =>
0.5 GiB to store IPv4 addresses how many IP addresses have we seen?

0.5 GiB to store IPv4 addresses vs. 1.5kB with a 2% error how many IP addresses have we seen?

0.5 GiB to store IPv4 addresses vs. 1.5kB with a 2% error how many IP addresses have we seen? x 358,000

10 What are sketches?

10 What are sketches? probabilistic algorithms

10 What are sketches? probabilistic algorithms summarize stream of data

10 What are sketches? probabilistic algorithms summarize stream of data
streaming data/online queries

11 how they work

11 how they work [ ] stream of data .
. .

11 how they work P(n) hash! uniform distribution [ ]
stream of data . . .

11 how they work P(n) hash! uniform distribution [ ]
stream of data . . . data structure

11 how they work P(n) hash! uniform distribution estimator [
] stream of data . . . data structure

11 how they work P(n) hash! uniform distribution estimator guess
+/- ε [ ] stream of data . . . data structure

12 Estimators & Observables ✦ Order statistics: [10, 11, 10,
01] ex: smallest value seen so far ✦ Bit-pattern: ex: longest run of contiguous 0s 10001010 ✦ Presence: ex: is the bit set?

13 But also! Horizontal Scalability! :( :) :) :) :)
:) :) :)

14 A Case Study: Story Reader

15 Editor: Features

15 Editor: Features 1. Feed of short stories without duplicates

2. Working vocabulary size (# of unique words)

2. Working vocabulary size (# of unique words) 3. Word length statistics

16 Editor: Analytics Requirements Fast: want real-time statistics Okay to
be good ~enough Cheap to run: no data analytics team!

17 Bloom Filters 2 set membership

18 The Problem is this element in this set? [
]

18 The Problem Google Chrome: ”is this URL known to
be malicious?" is this element in this set? [ ]

be malicious?" is this element in this set? [ ] Databases/LSM trees: “is this data on disk?”

be malicious?" is this element in this set? [ ] Databases/LSM trees: “is this data on disk?” Story Feed: “have I read this short story?”

19 Hash Set hash to a bitmap; test for presence
[ ]

20 Hash Functions

20 Hash Functions 34248a9bfcbd589d 9b5fccb6a0ac6963 2fc01ec765ec0cb3 dcc559126de20b30 1. Deterministic

20 Hash Functions 34248a9bfcbd589d 9b5fccb6a0ac6963 2fc01ec765ec0cb3 dcc559126de20b30 1. Deterministic 2.
Uniform P(n)

21 Hash Set — Insertion hash to a bitmap; test
for presence [ ] array of size m

21 Hash Set — Insertion hash to a bitmap; test
for presence [ ] array of size m hash ( ) mod m

22 Hash Set — Testing hash to a bitmap; test
for presence

23 Hash Set — Collisions

24 The system now

25 Scaling the system x100

26 Intuition 1: don’t store the entire object! false positives!
m bits in the array P(bit = 0)

( )n m bits in the array P(bit = 0)

( )n number of elements inserted m bits in the array P(bit = 0)

( )n 1 - number of elements inserted m bits in the array P(bit = 0)

27 Intuition 2 — Multiply Hashing! run through k independent
hash functions Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"

hash functions h1(x) h2(x) h3(x) Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"

hash functions Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors"

27 run through k independent hash functions Bloom, Burton H.
(1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" Bloom Filter!

28 Bloom Filter — Testing! hash to a bitmap; test
for presence

29 Bloom Filter — Testing! false positives!

30 Bloom Filter — Error Rates! false positives! oooh!

31 Bloom Filter — Error Rates! false positives! number of
hash functions (k) false positive possibility optimal k!

32 Bloom Filters: a summary No false negatives Smaller memory
footprint  (store 4-8 bits vs. entire obj) Small (and tunable!) false positive rate Can’t retrieve or delete items

33 how they work: Bloom Filters [ ] stream of
data . . .

data . . . P(n) hash! uniform distribution

data . . . P(n) hash! uniform distribution data structure: bitmap

data . . . P(n) hash! uniform distribution data structure: bitmap estimator: presence

data . . . P(n) hash! uniform distribution data structure: bitmap estimator: presence guess +/- ε

34 Story Feed: feed architecture

35 Merging Bloom Filters Bitwise OR =

36 An extension: Counting Bloom Filters allows for deletions Fan,
Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0 0 0 0 0 0 0 0

36 An extension: Counting Bloom Filters allows for deletions 1
1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0 0 0 0 0

1 1 1 1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0 0

2 1 1 1 1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 0 0

1 1 1 Fan, Li et al. (2000), "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol" 1 0 0 0 0

2 1 2 1 37 An extension: Count-Min Sketch keep
a count of the frequency of items seen min() estimator 2 3 1 2 4 h1 h2 h3 Cormode, Graham (2009). "Count-min sketch"

38 Bloom Filters: A Summary

38 Bloom Filters: A Summary • Hash Sets —> Bloom
Filters

Filters • bits & multiple hashing!

Filters • bits & multiple hashing! • Extensions: Counting

Filters • bits & multiple hashing! • Extensions: Counting • Extensions: Count-min sketch

39 Hyper Log Log 3 counting uniques

40 Editor: Text Analytics denote read stories (Bloom filters!) count
unique words used/unique users

41 The Problem: Cardinality number of unique values in a
collection [ ]

41 The Problem: Cardinality number of unique values in a
collection [ ] advertising: number of “uniques”

41 The Problem: Cardinality trafﬁc modeling: # of unique IP
addresses number of unique values in a collection [ ] advertising: number of “uniques”

41 The Problem: Cardinality trafﬁc modeling: # of unique IP
addresses number of unique values in a collection [ ] advertising: number of “uniques” natural language processing: number of unique words

42 Measuring Cardinality size = N (number of unique values)

(ex) IPv4: 232 bits

(ex) IPv4: 232 bits 0.5 GiB

43 The system now

44 The Paper Flajolet, Fusy, et al. 2007

45 Flipping coins!

46 Flipping coins! seeing a rare combination => I’ve seen
a lot of trials!

47 Bit patterns! 0101 1010 0010 0001 1100 1011 0101
1011 1010 run of 3 0s => likely seen 8 numbers!

48 Making a uniform distribution Hashing! (ex: murmurhash) 01 10
11

49 But the cardinality estimate could be so wrong! Techniques
for increasing accuracy ~8 friends ~4 friends ~4 friends = ~5.33 friends x 3 trials

50 The Algorithm 000 001 010 011 100 101 110
111 a register of m=8 bytes

50 The Algorithm 000 001 010 011 100 101 110
111 a register of m=8 bytes 010 00010

50 The Algorithm Bucket the ﬁrst log2 8 bits 000
001 010 011 100 101 110 111 a register of m=8 bytes 010 00010

50 The Algorithm Bucket the ﬁrst log2 8 bits 000
001 010 011 100 101 110 111 count leading 0s a register of m=8 bytes 010 00010

50 The Algorithm 3 Bucket the ﬁrst log2 8 bits
000 001 010 011 100 101 110 111 count leading 0s a register of m=8 bytes 010 00010

51 The Algorithm 110 01111 111 00100 111 00111 110
01010 011 00000 100 00100 101 00011 101 01010 010 00011 000 01001 001 00111 001 01111 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111 … …

52 The Algorithm 1 2 3 5 2 3 1
2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these!

2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these! = 8 * 3.93 = 31.5

2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these! = 8 * 3.93 = 31.5 (I used 28 values)

2 000 001 010 011 100 101 110 111 take the harmonic mean of all of these! = 8 * 3.93 = 31.5 (I used 28 values) Plus corrections for small and large values!

53 Merging Hyper Log Logs 1 2 3 5 2
1 4 8 max() for each register 2 2 4 8 =

54 Hyper Log Log — Error Rates! over and under
estimating Cardinality Space required (m) Error 109 1.5kB ~2% (vs. 0.5GiB!)

55 how they work: Hyper Log Logs [ ] stream
of data . . .

of data . . . P(n) hash! uniform distribution

of data . . . P(n) hash! uniform distribution data structure 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111

of data . . . P(n) hash! uniform distribution estimator (run of 0s) data structure 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111

of data . . . P(n) hash! uniform distribution estimator (run of 0s) guess +/- ε data structure 1 2 3 5 2 3 1 2 000 001 010 011 100 101 110 111

56 Estimating your working vocabulary

57 HyperLogLog: A Summary

57 HyperLogLog: A Summary • uniform distributions!

57 HyperLogLog: A Summary • uniform distributions! • log! log!
space!

57 HyperLogLog: A Summary • uniform distributions! • log! log!
space! • commutativity!

58 t-digests 4 estimating quantiles

59 Editor: Text Analytics denote read stories (Bloom filters!) count
unique words used estimate percentiles for word length

60 The Algorithm - CDFs! Value Frequency

60 The Algorithm - CDFs! Value Frequency Value Probability

60 The Algorithm - CDFs! Value Frequency Value Probability 0.5

0.95

0.95 0.99

61 The Algorithm - compression q-digests

61 The Algorithm - compression q-digests Value Probability

61 The Algorithm - compression q-digests Value Probability 0.5

61 The Algorithm - compression q-digests Value Probability 0.5 0.95

61 The Algorithm - compression q-digests Value Probability 0.5 0.95
0.99

62 t digest — Error Rates! non-constant error rates

63 how they work - t digests

63 how they work - t digests [ ] stream
of data . . .

of data . . . data structure

of data . . . estimator data structure

of data . . . estimator guess +/- ε data structure

64 Estimating your word length quantiles

65 in production 5 Veneur, Summingbird, Sawzall

66 Veneur

67 Analytics Overall API Latency (p90, p95, p99)

68 previously, on Stripe monitoring => no global statistics!

69 previously, on Stripe monitoring => UDP packet drops :(
:(

70 previously, on Stripe monitoring aggregation! :)

71 Sketches! HyperLogLogs (sets) t-digests (quantiles/histograms)

72 Algebird

73 Sketches HyperLogLogs (sets) Bloom Filters Count-min sketch MinHasher and
more!

74 Google Sawzall

75 Sketches! HyperLogLogs (sets) Munro & Peterson quantile estimation (quantiles)

76 Where to go from here 6 evaluating sketches, further
resources

77 Evaluating Sketches • performance • error rate • distortion
• tuning

78 A brief list of other sketches • Skip Lists
• frequency: count-min sketch, heavy hitters, etc • membership: Bloom filters, Cuckoo hashing • cardinality: hyperloglog • geometric data: coresets, locality-sensitive hashing

79 tl;dr — error is a tradeoff in algorithms approximations
are often Good Enough and a hell of a lot cheaper

80 Thanks! @ k i r a n b kiranbot.com

81 Appendix!

82 Database Semi-Joins city author 1 Kiran 2 or peer-to-peer
networks!

83 Small Value Corrections 1 3 2

83 Small Value Corrections 1 3 2 Estimate = m*log(m/#
of un-init registers) = ~ 3.75 values

84 Large Value Corrections

84 Large Value Corrections as the number of unique values
approaches 2^(2^m), you start seeing hash collisions!

84 Large Value Corrections as the number of unique values
approaches 2^(2^m), you start seeing hash collisions! => use a 64 bit hash & more bits in the registers!

85 History query cost estimation

86 Sphere: dealing with fraudsters :(

87 Sphere: building a fraud shield :)

Velocity London 2017 - A tour of sketching data...

Velocity London 2017 - A tour of sketching data structures

More Decks by Kiran Bhattaram

Other Decks in Technology

Featured

Transcript