Velocity London 2017 - A tour of sketching data structures

Slide 1

Slide 1 text

1 S K E T C H I N G D ATA S T R U C T U R E S Velocity London

Slide 2

Slide 2 text

2 S K E T C H I N G D ATA S T R U C T U R E S Velocity London

Slide 3

Slide 3 text

3 Kiran Bhattaram @kiranb

Slide 4

Slide 4 text

4 Timeline W h e r e t o g o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !

Slide 5

Slide 5 text

4 Timeline W h e r e t o g o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !

Slide 6

Slide 6 text

4 Timeline W h e r e t o g o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !

Slide 7

Slide 7 text

4 Timeline W h e r e t o g o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !

Slide 8

Slide 8 text

4 Timeline W h e r e t o g o f r o m h e r e S o l v i n g p r o b l e m s M o t i v a t i o n S y s t e m s i n P r o d u c t i o n !

Slide 9

Slide 9 text

5 Background 1 motivation, history, system models

Slide 10

Slide 10 text

6 Algorithm Efficiency Axes time space

Slide 11

Slide 11 text

6 Algorithm Efficiency Axes time space error probability

Slide 12

Slide 12 text

6 Algorithm Efficiency Axes time space implementation complexity error probability

Slide 13

Slide 13 text

7 Classic algorithms time space error probability = 0

Slide 14

Slide 14 text

8 Probabilistic Algorithms time space error probability

Slide 15

Slide 15 text

9 If you can tolerate error… 4 x 109 => 0.5 GiB to store IPv4 addresses how many IP addresses have we seen?

Slide 16

Slide 16 text

9 If you can tolerate error… 4 x 109 => 0.5 GiB to store IPv4 addresses vs. 1.5kB with a 2% error how many IP addresses have we seen?

Slide 17

Slide 17 text

9 If you can tolerate error… 4 x 109 => 0.5 GiB to store IPv4 addresses vs. 1.5kB with a 2% error how many IP addresses have we seen? x 358,000

Slide 18

Slide 18 text

10 What are sketches?

Slide 19

Slide 19 text

10 What are sketches? probabilistic algorithms

Slide 20

Slide 20 text

10 What are sketches? probabilistic algorithms summarize stream of data

Slide 21

Slide 21 text

10 What are sketches? probabilistic algorithms summarize stream of data streaming data/online queries

Slide 22

Slide 22 text

11 how they work

Slide 23

Slide 23 text

11 how they work [ ] stream of data . . .

Slide 24

Slide 24 text

11 how they work P(n) hash! uniform distribution [ ] stream of data . . .

Slide 25

Slide 25 text

11 how they work P(n) hash! uniform distribution [ ] stream of data . . . data structure

Slide 26

Slide 26 text

11 how they work P(n) hash! uniform distribution estimator [ ] stream of data . . . data structure

Slide 27

Slide 27 text

11 how they work P(n) hash! uniform distribution estimator guess +/- ε [ ] stream of data . . . data structure

Slide 28

Slide 28 text

12 Estimators & Observables ✦ Order statistics: [10, 11, 10, 01] ex: smallest value seen so far ✦ Bit-pattern: ex: longest run of contiguous 0s 10001010 ✦ Presence: ex: is the bit set?

Slide 29

Slide 29 text

13 But also! Horizontal Scalability! :( :) :) :) :) :) :) :)

Slide 30

Slide 30 text

14 A Case Study: Story Reader

Slide 31

Slide 31 text

15 Editor: Features

Slide 32

Slide 32 text

15 Editor: Features 1. Feed of short stories without duplicates

Slide 33

Slide 33 text

15 Editor: Features 1. Feed of short stories without duplicates 2. Working vocabulary size (# of unique words)

Slide 34

Slide 34 text

15 Editor: Features 1. Feed of short stories without duplicates 2. Working vocabulary size (# of unique words) 3. Word length statistics

Slide 35

Slide 35 text

16 Editor: Analytics Requirements Fast: want real-time statistics Okay to be good ~enough Cheap to run: no data analytics team!

Slide 36

Slide 36 text

17 Bloom Filters 2 set membership

Slide 37

Slide 37 text

18 The Problem is this element in this set? [ ]

Slide 38

Slide 38 text

18 The Problem Google Chrome: ”is this URL known to be malicious?" is this element in this set? [ ]

Slide 39

Slide 39 text

18 The Problem Google Chrome: ”is this URL known to be malicious?" is this element in this set? [ ] Databases/LSM trees: “is this data on disk?”

Slide 40

Slide 40 text

18 The Problem Google Chrome: ”is this URL known to be malicious?" is this element in this set? [ ] Databases/LSM trees: “is this data on disk?” Story Feed: “have I read this short story?”

Slide 41

Slide 41 text

19 Hash Set hash to a bitmap; test for presence [ ]

Slide 42

Slide 42 text

20 Hash Functions

Slide 43

Slide 43 text

20 Hash Functions 34248a9bfcbd589d 9b5fccb6a0ac6963 2fc01ec765ec0cb3 dcc559126de20b30 1. Deterministic

Slide 44

Slide 44 text

20 Hash Functions 34248a9bfcbd589d 9b5fccb6a0ac6963 2fc01ec765ec0cb3 dcc559126de20b30 1. Deterministic 2. Uniform P(n)

Slide 45