Count-Min Sketch, Bloom Filter, TopK: Efficient probabilistic data structures

Slide 1

Slide 1 text

Count-Min Sketch, Bloom Filter, Top K: Efficient probabilistic data structures Raphael De Lio

Slide 2

Slide 2 text

Who is in Bluesky?

Slide 3

Slide 3 text

1 September • Bluesky gets 1 million new users Timeline

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

One of many things BSKY doesn’t have… didn’t

Slide 9

Slide 9 text

Blooming Trending Topics on Bluesky

Slide 10

Slide 10 text

How many posts per minute in Bluesky?

Slide 11

Slide 11 text

Approximately 2200

Slide 12

Slide 12 text

Approximately 95 million per month

Slide 13

Slide 13 text

How many unique words are mentioned in Bluesky per minute?

Slide 14

Slide 14 text

Approximately 22.000

Slide 15

Slide 15 text

How much memory do I need to analyze this amount of data?

Slide 16

Slide 16 text

~2MB per minute ~120MB per hour ~2.8GB per day ~87GB per month ~1TB per year

Slide 17

Slide 17 text

Bluesky has 35 million users today

Slide 18

Slide 18 text

What if I had a data structure with fixed size?

Slide 19

Slide 19 text

Probabilistic Data Structures

Slide 20

Slide 20 text

Deterministic vs Probabilistic Always exact May have false positives/negatives Accuracy Dynamic Fixed Memory Consumption Sets, Lists, Maps, Stacks, Trees Bloom Filters, Count-Min Sketch, Hyperloglog, TopK, T-Digest Examples Feature Deterministic Probabilistic

Slide 21

Slide 21 text

A Data Structure Server • String • List • Set • Sorted Set • Vector Set • Hash • Stream • Geo • Bitmaps • Bitfield • JSON • TimeSeries • Bloom Filter • Count-Min Sketch • TopK • Cuckoo Filter • T Digest • HyperLogLog

Slide 22

Slide 22 text

…

Slide 23

Slide 23 text

Building our own Trending Topics 1. Counting words 2. Deduplicating messages 3. Detecting Spikes

Slide 24

Slide 24 text

Building our own Trending Topics 1. Counting words 2. Deduplicating messages 3. Tracking Spikes Count-Min Sketch Bloom Filter TopK

Slide 25

Slide 25 text

#1 Counting individual terms

Slide 26

Slide 26 text

Sorted Set • Deterministic • Stores individual members with a score • Dynamic Memory Growth

Slide 27

Slide 27 text

Count-Min Sketch • Probabilistic • Used to estimate the frequency of elements in a data stream • Somehow similar to a Sorted Set • Fixed Memory • Trade-o ff : might give wrong estimations

Slide 28

Slide 28 text

How it works…

Slide 29

Slide 29 text

Count-Min Sketch • Internally it’s a grid (sketch) of w (width) and d (depth) • The rows (d) represent the number of hash functions. The columns (w) represent the counter array for each of the hashing functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fi xed size

Slide 30

Slide 30 text

Count-Min Sketch: Incrementing 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.INCRBY terms redis 1 1 1 1 CMS.INCRBY terms redis 1

Slide 31

Slide 31 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“pets”) % 5 = 0 Hash2(“pets”) % 5 = 3 Hash3(“pets”) % 5 = 1 CMS.INCRBY terms pets 1 1 1 1 1 1 2 Count-Min Sketch: Incrementing CMS.INCRBY terms pets 1

Slide 32

Slide 32 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“cats”) % 5 = 3 Hash2(“cats”) % 5 = 4 Hash3(“cats”) % 5 = 0 CMS.INCRBY terms cats 1 1 1 1 1 1 2 1 2 1 Count-Min Sketch: Incrementing CMS.INCRBY terms cats 1

Slide 33

Slide 33 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.INCRBY terms dogs 1 1 1 1 1 1 2 1 2 1 2 1 1 Count-Min Sketch: Incrementing CMS.INCRBY terms dogs 1

Slide 34

Slide 34 text

Count-Min Sketch: Querying 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.QUERY terms dogs 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 CMS.QUERY terms dogs

Slide 35

Slide 35 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.QUERY terms redis 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 Count-Min Sketch: Querying 2 CMS.QUERY terms redis

Slide 36

Slide 36 text

Count-Min Sketch: Probability • The width determines the error rate. • The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value

Slide 37

Slide 37 text

Demo time!

Slide 38

Slide 38 text

Other use cases • Logistics System • Usage Analytics

Slide 39

Slide 39 text

#2 Deduplicating Messages

Slide 40

Slide 40 text

Set • Deterministic • Stores individual members • Dynamic Memory Growth

Slide 41

Slide 41 text

Bloom Filter • Probabilistic • Used to test whether an element is possibly in a set • Somehow similar to a Set • Fixed Memory • Trade-o ff : might give false positives

Slide 42

Slide 42 text

How it works…

Slide 43

Slide 43 text

Bloom Filter • Internally it’s a 1D array. • It also works with multiple hash functions. 0 0 0 0 0 0 0 0 fi xed size

Slide 44

Slide 44 text

Bloom Filter: Adding member Hash1(“I’m”) % 8 = 2 Hash2(“I’m”) % 8 = 4 Hash3(“I’m”) % 8 = 1 BF.ADD stop-words “I’m” 0 0 0 0 0 0 0 0 1 1 1 BF.ADD stop-words “I’m”

Slide 45

Slide 45 text

Bloom Filter: Adding member Hash1(“lol”) % 5 = 0 Hash2(“lol”) % 8 = 3 Hash3(“lol”) % 8 = 1 BF.ADD stop-words “lol” 0 0 0 0 0 0 0 0 1 1 1 1 1 1 BF.ADD stop-words “lol”

Slide 46

Slide 46 text

Bloom Filter: Checking if exists BF.EXISTS stop-words “I’m” 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“I’m”) % 8 = 2 Hash2(“I’m”) % 8 = 4 Hash3(“I’m”) % 8 = 1 BF.EXISTS stop-words “I’m”

Slide 47

Slide 47 text

Bloom Filter: Checking if exists BF.EXISTS stop-words “devoxx” 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“devoxx”) % 8 = 5 Hash2(“devoxx”) % 8 = 4 Hash3(“devoxx”) % 8 = 1 BF.EXISTS stop-words “devoxx”

Slide 48

Slide 48 text

Bloom Filter: Checking if exists BF.EXISTS stop-words “Brazil” 0 0 0 0 0 0 0 0 1 1 1 1 1 1 Hash1(“Brazil”) % 8 = 3 Hash2(“Brazil”) % 8 = 4 Hash3(“Brazil”) % 8 = 1 BF.EXISTS stop-words “Brazil

Slide 49

Slide 49 text

Demo time!

Slide 50

Slide 50 text

Other use cases • Username Availability Checker • Early Fraud or Spam Detection

Slide 51

Slide 51 text

#3 Tracking Spikes

Slide 52

Slide 52 text

Sorted Set • Deterministic • Stores individual members with a score • Dynamic Memory Growth

Slide 53

Slide 53 text

TopK • Probabilistic • Used to track the most frequent elements in a data stream • Somehow similar to a Sorted Set & Count-min Sketch • Fixed Memory • Trade-o ff : might give imprecise results

Slide 54

Slide 54 text

Top K • It’s a 1D array of K slots • It also works with multiple hash functions • It receives a decay How it works 0 1 2 3 4 decay: 0.9

Slide 55

Slide 55 text

Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY spiking-now “redis" 1 TOPK.INCRBY spiking-now “redis” 1 Hash(“redis”) % 5 = 1 redis: 1 decay: 0.9

Slide 56

Slide 56 text

Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY spiking-now “devoxx” 1 TOPK.INCRBY spiking-now “devoxx” 1 Hash(“devoxx”) % 5 = 2 redis: 1 decay: 0.9 devoxx: 1

Slide 57

Slide 57 text

Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY spiking-now “devoxx” 1 TOPK.INCRBY spiking-now “devoxx” 1 Hash(“devoxx”) % 5 = 2 redis: 1 decay: 0.9 devoxx: 1 devoxx: 2

Slide 58

Slide 58 text

Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY spiking-now “java” 1 TOPK.INCRBY spiking-now “java” 1 Hash(“java”) % 5 = 2 redis: 1 decay: 0,9 devoxx: 2 devoxx: 1 java: 1 2 * 0,9 = 1,8

Slide 59

Slide 59 text

Top K Incrementing counter 0 1 2 3 4 TOPK.INCRBY spiking-now “java” 1 TOPK.INCRBY spiking-now “java” 1 Hash(“java”) % 5 = 2 redis: 1 decay: 0,9 devoxx: 1 java: 1 1 * 0,9 = 0,9

Slide 60

Slide 60 text

Detecting Spikes • Calculate the average of every term in the past three minutes • Compare with the current minute • “devoxx”: • current min = 455 times • -1 min = 400 times • -2 min = 350 times • -3 min = 300 times avg: 300 times } 455 - 300 300 = 0,5

Slide 61

Slide 61 text

Demo time!

Slide 62

Slide 62 text

Concluding

Slide 63

Slide 63 text

Counting Individual Terms ~2MB per minute ~120MB per hour ~2.8GB per day ~87GB per month ~1TB per year 156KB per minute 9.3MB per hour 223MB per day 6.7GB per month 80GB per year Sorted Set Count Min Sketch

Slide 64

Slide 64 text

Deduplicating ~2.6MB per minute ~156MB per hour ~3.7GB per day ~111GB per month ~1.3TB per year 135KB per hour 3.2MB per day 97MB per month 1.2GB per year Set Bloom Filter

Slide 65

Slide 65 text

Tracking Spikes ~2.6MB per minute ~156MB per hour ~3.7GB per day ~111GB per month ~1.3TB per year 284KB per 5 minutes 3.4MB per hour 82MB per day 2.4GB per month 288GB per year Sorted Set TopK