Count-Min Sketch: An efficient probabilistic data structure

Count-Min Sketch: An efficient probabilistic data structure Raphael De Lio

Who is in Bluesky?

How many unique words are mentioned in Bluesky per minute?

Approximately 22.000

How much memory does it take to count 22.000 terms
with a Sorted Set?

What if we wanted to keep historical data?

~2MB per minute ~120MB per hour ~2.8GB per day ~87GB
per month ~1TB per year

Bluesky has 35 million users today

What if I had a data structure with fixed size?

Count-Min Sketch • It’s a probabilistic data structure • Used
to estimate the frequency of elements in a data stream • Operates with space-e ff i ciency, using a fi xed amount of memory regardless of data scale • It operates in constant time: O(1) • It’s included in Redis alongside other Probabilistic Data Structures

How it works…

Count-Min Sketch • Internally it’s a grid (sketch) of w
(width) and d (depth) • The rows (d) represent the number of hash functions. The columns (w) represent the counter array for each of the hashing functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fi xed size

Count-Min Sketch: Incrementing 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.INCRBY terms redis 1 1 1 1 CMS.INCRBY terms redis 1

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“pets”) % 5 = 0 Hash2(“pets”) % 5 = 3 Hash3(“pets”) % 5 = 1 CMS.INCRBY terms pets 1 1 1 1 1 1 2 Count-Min Sketch: Incrementing CMS.INCRBY terms pets 1

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“cats”) % 5 = 3 Hash2(“cats”) % 5 = 4 Hash3(“cats”) % 5 = 0 CMS.INCRBY terms cats 1 1 1 1 1 1 2 1 2 1 Count-Min Sketch: Incrementing CMS.INCRBY terms cats 1

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.INCRBY terms dogs 1 1 1 1 1 1 2 1 2 1 2 1 1 Count-Min Sketch: Incrementing CMS.INCRBY terms dogs 1

Count-Min Sketch: Querying 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.QUERY terms dogs 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 CMS.QUERY terms dogs

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.QUERY terms redis 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 Count-Min Sketch: Querying 2 CMS.QUERY terms redis

Count-Min Sketch: Probability • The width determines the error rate.
• The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value

Demo time!

Concluding ~2MB per minute ~120MB per hour ~2.8GB per day
~87GB per month ~1TB per year 156KB per minute 9.3MB per hour 223MB per day 6.7GB per month 80GB per year Sorted Set Count Min Sketch ~12X smaller

Thursday from 09:00 - 09:50 Exec Centre

RAPHAEL DE LIO DEVELOPER ADVOCATE *

Count-Min Sketch: An efficient probabilistic da...

Count-Min Sketch: An efficient probabilistic data structure

Raphael De Lio

More Decks by Raphael De Lio

Other Decks in Programming

Featured

Transcript