Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Probabilistic Data Structures

Probabilistic Data Structures

Bloom Filters, Count Min Sketches, LogLog, SuperLogLog, HyperLogLog, HyperLogLog++, BigData, Analytics

Ben Darfler

October 09, 2014
Tweet

More Decks by Ben Darfler

Other Decks in Programming

Transcript

  1. Standard Data Structures • List • Set • Map •

    Queue • Priority Queue • Stack • Deque • Sorted Set • Sorted Map • Multiset • Multimap
  2. Hashing - Definition “A hash function is any function that

    can be used to map digital data of arbitrary size to digital data of fixed size, with slight differences in input data producing very big differences in output data.” - 62 United States Attorneys’ Bulletin 44-82
  3. Membership - Use Cases • NoSQL Databases ◦ Does this

    data segment contain this key? • Chrome ◦ Is this url malicious? • Squid Caching Proxy ◦ Does a cached version exist?
  4. Bloom Filter - What is it? • Can answer the

    membership question • False positives are possible • False negatives are not possible
  5. Bloom Filter - Implementation • Setup ◦ Set of k

    different hash functions ◦ Bit array of size m with all bits initialized to 0 • Insertion ◦ Hash the input with all k hash functions ◦ Set each corresponding bit in the array to 1 • Query ◦ Hash the input with all k hash functions ◦ All corresponding bits are set to 1 ? “Yes” : “No”
  6. Bloom Filter - Insertion • Hash1 = 4 • Hash2

    = 7 • Hash3 = 8 • Hash4 = 14 ‘This is my input’ 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0
  7. Bloom Filter - Insertion • Hash1 = 5 • Hash2

    = 7 • Hash3 = 9 • Hash4 = 11 ‘This is another input’ 0 0 0 1 1 0 1 1 1 0 1 0 0 1 1 0
  8. Bloom Filter - Query • Hash1 = 4 • Hash2

    = 7 • Hash3 = 8 • Hash4 = 14 ‘This is my input’ => Yes 0 0 0 1 1 0 1 1 1 0 1 0 0 1 1 0
  9. Bloom Filter - Query • Hash1 = 2 • Hash2

    = 5 • Hash3 = 10 • Hash4 = 12 ‘I have not seen this’ => No 0 0 0 1 1 0 1 1 1 0 1 0 0 1 1 0
  10. Bloom Filter - Collision • Hash1 = 5 • Hash2

    = 7 • Hash3 = 8 • Hash4 = 11 ‘Is this in the set?’ => Yes (oops) 0 0 0 1 1 0 1 1 1 0 1 0 0 1 1 0
  11. Heavy Hitters - Use Cases • Top-k of anything ◦

    Users who X the most ◦ Users who score the most ◦ Leaderboards • DDOS detection / Rate Limiting ◦ Track the abusive IP addresses
  12. Count Min Sketch - What is it? • Can answer

    the heavy hitters question • Answer comes with ◦ estimation error (count will be ± some value) ◦ probability (x% of the time)
  13. Count Min Sketch - Implementation • Setup ◦ Set of

    k different hash functions ◦ Matrix of size m x k with all values initialized to 0 • Insertion ◦ Hash the input with all k hash functions ◦ Increment each corresponding value by 1 • Query ◦ Hash the input with all k hash functions ◦ Return the minimum of all the corresponding values
  14. Count Min Sketch - Insertion • Hash1 = 4 •

    Hash2 = 7 • Hash3 = 8 • Hash4 = 14 ‘This is my input’ (5 times) 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0
  15. Count Min Sketch - Insertion ‘This is another input’ (2

    times) • Hash1 = 5 • Hash2 = 7 • Hash3 = 9 • Hash4 = 11 0 0 0 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 5 0 0
  16. Count Min Sketch - Query 0 0 0 5 2

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 5 0 0 • Hash1 = 4 • Hash2 = 7 • Hash3 = 8 • Hash4 = 14 ‘This is my input’ => 5
  17. Cardinality - Use Cases • Count Distinct ◦ Number of

    users who X ◦ Number of users who logged in ◦ Daily active users
  18. Linear Counting - What is it? • Can answer the

    cardinality question • Answer comes with ◦ standard error (percentage error rate)
  19. Linear Counting - Implementation • Setup ◦ A hash function

    ◦ Bit array of size m with all bits initialized to 0 • Insertion ◦ Hash the input with the hash function ◦ Set the corresponding bit in the array to 1 • Query ◦ Count the number of 1s in the array
  20. Linear Counting - Insertion • Hash = 4 ‘This is

    my input’ 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
  21. Linear Counting - Insertion • Hash = 9 ‘This is

    another input’ 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
  22. LogLog Counting - What is it? • Can answer the

    cardinality question • Much more efficient than Linear Counting • Answer comes with ◦ standard error (percentage error rate)
  23. LogLog Counting - Intuition • Hashing the input creates a

    (pseudo) random, evenly distributed output • In such a random, evenly distributed series, you would expect to see a value with k leading zeros once in every 2^k elements
  24. LogLog Counting - Implementation • Setup ◦ def hash(input) ◦

    m = 2^k ◦ buckets = [0] * m • Insertion ◦ hashed = hash(input) ◦ index = get_bits(hashed, 0, k) ◦ observation = get_bits(hashed, k, hashed.size) ◦ rank = num_leading_zeros(observation) ◦ buckets[index] = max(buckets[index], rank) • Query ◦ 2^(sum(buckets)/ m) * m * 0.79402
  25. LogLog Counting - Insertion ‘This is my input’ hashed =

    0b10000000001010101010000111101 index = 0b1000 (i.e. 8) observation = 0b0000001010101010000111101 rank = 6 buckets[8] = max(buckets[8], 6)
  26. LogLog Counting - Variants • SuperLogLog ◦ Outliers skew results

    - Toss out largest 30% • HyperLogLog ◦ Use harmonic mean instead of geometric mean ◦ Special case logic for low cardinality • HyperLogLog++ ◦ Google’s version of HLL ◦ Use 64-bit hash function to count larger cardinalities ◦ Different special case logic for low cardinality