Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Count-Min Sketch: An efficient probabilistic da...

Count-Min Sketch: An efficient probabilistic data structure

A Count-Min Sketch is a data structure that estimates how often something appears in a large dataset while using very little memory. It relies on a table and hash functions to map items to specific spots in the table. Adding an item increases the values in those spots, and checking an item’s count returns the smallest value from them. While not exact due to possible collisions, it’s efficient and great for approximate counts when precision isn’t critical.

In this talk, we’ll explore:
• What this data structure is
• How it works internally
• How I used it to build an efficient version of Trending Topics for Bluesky

By the end of this session, you’ll have a clear understanding of Count-Min Sketches, why they’re valuable for handling large-scale data efficiently, and how you can apply them to solve real-world problems.

Avatar for Raphael De Lio

Raphael De Lio

December 27, 2024
Tweet

More Decks by Raphael De Lio

Other Decks in Programming

Transcript

  1. 2MB

  2. Count-Min Sketch • It’s a probabilistic data structure • Used

    to estimate the frequency of elements in a data stream • Operates with space-e ff i ciency, using a fi xed amount of memory regardless of data scale • It operates in constant time: O(1) • It’s included in Redis alongside other Probabilistic Data Structures
  3. Count-Min Sketch • Internally it’s a grid (sketch) of w

    (width) and d (depth) • The rows (d) represent the number of hash functions. The columns (w) represent the counter array for each of the hashing functions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 fi xed size
  4. Count-Min Sketch: Incrementing 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.INCRBY terms redis 1 1 1 1 CMS.INCRBY terms redis 1
  5. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“pets”) % 5 = 0 Hash2(“pets”) % 5 = 3 Hash3(“pets”) % 5 = 1 CMS.INCRBY terms pets 1 1 1 1 1 1 2 Count-Min Sketch: Incrementing CMS.INCRBY terms pets 1
  6. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“cats”) % 5 = 3 Hash2(“cats”) % 5 = 4 Hash3(“cats”) % 5 = 0 CMS.INCRBY terms cats 1 1 1 1 1 1 2 1 2 1 Count-Min Sketch: Incrementing CMS.INCRBY terms cats 1
  7. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.INCRBY terms dogs 1 1 1 1 1 1 2 1 2 1 2 1 1 Count-Min Sketch: Incrementing CMS.INCRBY terms dogs 1
  8. Count-Min Sketch: Querying 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 Hash1(“dogs”) % 5 = 2 Hash2(“dogs”) % 5 = 1 Hash3(“dogs”) % 5 = 3 CMS.QUERY terms dogs 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 CMS.QUERY terms dogs
  9. 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 Hash1(“redis”) % 5 = 2 Hash2(“redis”) % 5 = 4 Hash3(“redis”) % 5 = 1 CMS.QUERY terms redis 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 Count-Min Sketch: Querying 2 CMS.QUERY terms redis
  10. Count-Min Sketch: Probability • The width determines the error rate.

    • The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value
  11. Count-Min Sketch: Probability • The width determines the error rate.

    • The depth determines the con fi dence in this error rate. For a Sketch of 5/3: • Error rate: 40% • Con fi dence in this error rate: 99.87% 99.87% of the time, the counter will be within 40% of the true value For a Sketch of 2000/10: • Error rate: 0.1% • Con fi dence in this error rate: 99,99% 99.99% of the time, the counter will be within 0.1% of the true value
  12. Concluding ~2MB per minute ~120MB per hour ~2.8GB per day

    ~87GB per month ~1TB per year 156KB per minute 9.3MB per hour 223MB per day 6.7GB per month 80GB per year Sorted Set Count Min Sketch ~12X smaller