Slide 1

Slide 1 text

Count-Min Sketch: An efficient probabilistic data structure Raphael De Lio

Slide 2

Slide 2 text

Who is in Bluesky?

Slide 3

Slide 3 text

How many unique words are mentioned in Bluesky per minute?

Slide 4

Slide 4 text

Approximately 22.000

Slide 5

Slide 5 text

How much memory does it take to count 22.000 terms with a Sorted Set?

Slide 6

Slide 6 text

2MB

Slide 7

Slide 7 text

What if we wanted to keep historical data?

Slide 8

Slide 8 text

~2MB per minute ~120MB per hour ~2.8GB per day ~87GB per month ~1TB per year

Slide 9

Slide 9 text

Bluesky has 35 million users today

Slide 10

Slide 10 text

What if I had a data structure with fixed size?

Slide 11

Slide 11 text

Count-Min Sketch • It’s a probabilistic data structure • Used to estimate the frequency of elements in a data stream • Operates with space-e ff i ciency, using a fi xed amount of memory regardless of data scale • It operates in constant time: O(1) • It’s included in Redis alongside other Probabilistic Data Structures

Slide 12

Slide 12 text

How it works…

Slide 13

Slide 13 text

Count-Min Sketch • Internally it’s a grid (sketch) of w (width) and d (depth) • The rows (d) represent the number of hash functions. The columns (w) represent the counter array for each of the hashing functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fi xed size

Slide 14

Slide 14 text

Count-Min Sketch: Incrementing 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.INCRBY terms redis 1 1 1 1 CMS.INCRBY terms redis 1

Slide 15

Slide 15 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“pets”) % 5 = 0 Hash2(“pets”) % 5 = 3 Hash3(“pets”) % 5 = 1 CMS.INCRBY terms pets 1 1 1 1 1 1 2 Count-Min Sketch: Incrementing CMS.INCRBY terms pets 1

Slide 16

Slide 16 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“cats”) % 5 = 3 Hash2(“cats”) % 5 = 4 Hash3(“cats”) % 5 = 0 CMS.INCRBY terms cats 1 1 1 1 1 1 2 1 2 1 Count-Min Sketch: Incrementing CMS.INCRBY terms cats 1

Slide 17

Slide 17 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.INCRBY terms dogs 1 1 1 1 1 1 2 1 2 1 2 1 1 Count-Min Sketch: Incrementing CMS.INCRBY terms dogs 1

Slide 18

Slide 18 text

Count-Min Sketch: Querying 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.QUERY terms dogs 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 CMS.QUERY terms dogs

Slide 19

Slide 19 text

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.QUERY terms redis 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 Count-Min Sketch: Querying 2 CMS.QUERY terms redis

Slide 20

Slide 20 text

Count-Min Sketch: Probability • The width determines the error rate. • The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value

Slide 21

Slide 21 text

Count-Min Sketch: Probability • The width determines the error rate. • The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value

Slide 22

Slide 22 text

Demo time!

Slide 23

Slide 23 text

Concluding ~2MB per minute ~120MB per hour ~2.8GB per day ~87GB per month ~1TB per year 156KB per minute 9.3MB per hour 223MB per day 6.7GB per month 80GB per year Sorted Set Count Min Sketch ~12X smaller

Slide 24

Slide 24 text

Thursday from 09:00 - 09:50 Exec Centre

Slide 25

Slide 25 text

RAPHAEL DE LIO DEVELOPER ADVOCATE *