Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ben Linsay on HyperLogLog

4762c519c32569eb4c6f1c2797947eb5?s=47 pwl
May 14, 2018

Ben Linsay on HyperLogLog

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HyperLogLog, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes"), HyperLogLog performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/√m. This improves on the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

4762c519c32569eb4c6f1c2797947eb5?s=128

pwl

May 14, 2018
Tweet

Transcript

  1. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet

    et. al (2007) ben linsay pwl nyc may 2018
  2. It’s personal

  3. - Why Bother? - Intuition - The Algorithm - Consequences

    - HLL IRL
  4. Why Bother?

  5. SET = SET CARDINALITY = SIZE

  6. “...the cardinality of a [set] can be exactly determined with

    a storage complexity essentially proportional to its number of elements”
  7. - Set doesn’t fit in RAM - Set doesn’t fit

    on disk - Read-once data (streams)
  8. > r = Random.new > stream = (1..10**9).lazy.map{|_| r.rand(3) }

    > Set.new(stream).length => 3
  9. > r = Random.new > stream = (1..10**9).lazy.map{|_| r.rand(30000) }

    > Set.new(stream).length => 30000
  10. > r = Random.new > stream = (1..10**9).lazy.map{|x| x }

    > Set.new(stream).length => 10**9
  11. Network traffic

  12. Advertising

  13. struct HLL{ … } add(HLL, element) -> HLL cardinality(HLL) ->

    Number
  14. "...estimate cardinalities well beyond 109 with a typical accuracy of

    2% while using a memory of only 1.5 kilobytes.”
  15. For a set with cardinality N O(m) memory std. error

    ~= 1.04/√m O(1) time to add an element
  16. A Disclaimer

  17. We’re skipping the proofs

  18. “...techniques that are now standard in analysis of algorithms, like

    poissonization, Mellin transforms, and saddle-point depoissonization.”
  19. None
  20. Intuition

  21. Flip a coin, forever

  22. Flip a coin 32 times, forever

  23. pr(1… ) = 1/2 pr(01… ) = 1/4 pr(001… )

    = 1/8 pr(0001…) = 1/16 … pr(0 * n) = 1/2n
  24. if the longest run of tails is x then we’ve

    probably done at least 2x experiments
  25. Bit-Pattern Observables

  26. ρ(x) “the position of the leftmost 1-bit in binary string

    x”
  27. ρ(0b00101010) = 3 ρ(0b00001010) = 5 ρ(0b00000001) = 8

  28. if x = max(ρ) then we’ve probably done at least

    2x experiments
  29. “perform [m] experiments in parallel… their arithmetic mean has standard

    deviation σ/√m”
  30. Do parallel experiments to get bad estimators. Combine bad estimators

    into a good estimator.
  31. The Algorithm

  32. add(HLL, element) -> HLL

  33. “very_cool_input_data”

  34. h(“very_cool_input_data”) ~ rand()

  35. Uniform Hash Functions

  36. ρ(h 1 (x)) ρ(h 2 (x)) ρ(h 3 (x)) ...

    ρ(h m (x))
  37. “...[emulate] the effect of m experiments with a single hash

    function… ...divide the input stream h(M) into m substreams”
  38. h(“A”) M 1 M 2 ... M m

  39. h(“B”) M 1 M 2 ... M m

  40. h(“A”) M 1 M 2 ... M m

  41. > hashed_val = hash(“foobar”) => 0b0110101000101010 > i = 01101

    > rho_x = ρ(01000101010) > M[i] = max(M[i], rho_x)
  42. struct HLL = [M 1 , M 2 , …,

    M m ]
  43. cardinality(HLL) -> Number

  44. HLL = [1, 7, 8, …,3]

  45. estimate = mean([2^1, 2^7, 2^8, …,2^3])

  46. “...our algorithm differs from standard LOGLOG by its evaluation function:

    its is based on harmonic means, while [LOGLOG] uses what amounts to a geometric mean.”
  47. Pythagorean means

  48. None
  49. None
  50. (x 1 -1 + x 2 -1 + … +

    x n -1) n-1
  51. E = m * a(m) * m/(2-M_1 + … 2-M_m)

  52. whew

  53. add(HLL, element) -> HLL - Hash the input - Partition

    into substreams - Keep max(ρ(x)) per substream
  54. cardinality(HLL) -> number - Take the harmonic mean of the

    substream estimates and correct for bias
  55. For a set with cardinality N O(m) memory O(1) time

    to add an element std. error ~= 1.04/√m
  56. std. error ~= 1.04/√m as n → ∞

  57. Small range correction

  58. Large range correction

  59. None
  60. Consequences

  61. LOGLOG

  62. 0b0110101000101010 max(ρ(w)) = 32 - b

  63. store an int <= 32

  64. log 2 (32) = 5

  65. each experiment uses at most log 2 (log 2 (2^32))

    bits
  66. space complexity is O(log 2 (log 2 (N)))

  67. Actual Size

  68. total_size = m x register_size std_err = 1.04 / √m

    2^11 * (5 bits) = 1280 bytes 1.04 / √(2^11) =~ 0.0230
  69. m size std error p99 error 210 640b 0.0325 0.0975

    211 1.25k 0.0230 0.0690 212 2.5K 0.0163 0.0488 213 5k 0.0115 0.0345 214 10k 0.0081 0.0244 215 20k 0.0057 0.0172 216 40k 0.0041 0.0122
  70. m size max uint32 max hll 210 640b 160 2^32

    -1 211 1.25k 320 2^32 -1 212 2.5K 640 2^32 -1 213 5k 1280 2^32 -1 214 10k 2560 2^32 -1 215 20k 5120 2^32 -1 216 40k 10240 2^32 -1
  71. One important footnote

  72. “Given an arbitrary partitioning of the original file into subfiles,

    it suffices to collect register values and apply componentwise a max operation.”
  73. You can union two HLLs

  74. HLL(A) ∪ HLL(B) = HLL(A ∪ B)

  75. max(a, max(b, c)) = max(max(a, b), c)

  76. max(M a,1 , M b,1 ) = M union,1

  77. struct HLL{ … } add(HLL, element) -> HLL cardinality(HLL) ->

    Number union(HLL, HLL) -> HLL
  78. 2018-05-14 00:00:00 HLL{...} 2018-05-14 01:00:00 HLL{...} 2018-05-14 02:00:00 HLL{...} ...

    2018-05-14 23:00:00 HLL{...}
  79. HLL IRL

  80. HyperLogLog in Practice Heule, Nunkesser, and Hall (2013)

  81. 64-bit Hash Functions

  82. Estimating Small Cardinalities

  83. Sparse Representation

  84. Notes and Further Reading

  85. at me: @blinsay implementations: https://github.com/aggregateknowledge/hll-storage-spec https://github.com/twitter/algebird https://github.com/apache/lucene-solr a good series

    of blog posts: https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerst one-of-a-big-data-infrastructure/
  86. None