HyperLogLog in 15 minutes

Beb7f5dd655d8b8e9093ef4fc5e59b6e?s=47 Paul Mucur
November 28, 2018

HyperLogLog in 15 minutes

A brief explanation of the HyperLogLog algorithm for estimating the cardinality of large sets given at the Drover Ruby Meetup on Wednesday, 28th November 2018.

Beb7f5dd655d8b8e9093ef4fc5e59b6e?s=128

Paul Mucur

November 28, 2018
Tweet

Transcript

  1. HyperLogLog in 15 minutes @mudge

  2. “Cardinality of a set”

  3. animals = Set.new => #<Set: {}> animals << "dog" =>

    #<Set: {"dog"}> animals << "dog" => #<Set: {"dog"}> animals << "cat" => #<Set: {"dog", "cat"}> animals.size => 2
  4. What do we need to count the number of unique

    elements exactly?
  5. None
  6. None
  7. Flipping a coin

  8. None
  9. None
  10. 1

  11. 1

  12. 1

  13. 1

  14. 1

  15. 1 2

  16. 2

  17. 2

  18. 2 5

  19. P(0) = ?

  20. P(0) = 1 2

  21. P(0) = 1 2 P(1) = ?

  22. P(0) = 1 2 P(1) = 1 4

  23. P(0) = 1 2 P(1) = 1 4 P(2) =

    1 8
  24. P(0) = 1 21 = 1 2 P(1) = 1

    22 = 1 4 P(2) = 1 23 = 1 8 . . . P(n) = 1 2n+1
  25. If our highest score is 5 then we can guess

    26 runs P(n) = 1 2n+1
  26. What’s this got to do with estimating the cardinality of

    a set?
  27. "dog"

  28. "dog"

  29. 1 "dog"

  30. 1 "cat"

  31. 1 "cat"

  32. 1 "cat" 3

  33. 1 "dog"

  34. 1 "dog" 0 1 1 0 0 1

  35. 1 "cat" 3

  36. 1 "cat" 3 0 0 0 1 1 0

  37. E := αm m2Z

  38. > PFADD tweets 1 2 3 4 5 6 (integer)

    1 > PFCOUNT tweets (integer) 6
  39. None
  40. ~fin~ @mudge https://mudge.name

  41. 0110101101010

  42. 1 0110101101010

  43. 1 0110101101010 0100001010100

  44. 1 3 0110101101010 0100001010100

  45. 1 3 0110101101010 0100001010100 0011101101010

  46. 1 3 1 0110101101010 0100001010100 0011101101010

  47. 1 3 1 0110101101010 0100001010100 0011101101010 0110011010101

  48. 1 3 1 0110101101010 0100001010100 0011101101010 0110011010101 2

  49. E := αm m2Z

  50. E := αm × m × mZ

  51. E := αm × m × mZ

  52. E := αm × m × mZ

  53. E := αm × m × mZ

  54. α16 = 0.673; α32 = 0.697; αm = 0.7213/(1 +

    1.1079/m) for m ≥ 128
  55. x1 + x2 + … + xn n Arithmetic mean

  56. n 1 x1 + 1 x2 + … + 1

    xn Harmonic mean
  57. mZ := m 2−M[1] + 2−M[2] + . . .

    + 2−M[m]