630

# Ben Linsay on HyperLogLog

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HyperLogLog, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes"), HyperLogLog performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/√m. This improves on the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

May 14, 2018

## Transcript

1. ### HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet

et. al (2007) ben linsay pwl nyc may 2018

- HLL IRL

6. ### “...the cardinality of a [set] can be exactly determined with

a storage complexity essentially proportional to its number of elements”
7. ### - Set doesn’t fit in RAM - Set doesn’t fit

on disk - Read-once data (streams)
8. ### > r = Random.new > stream = (1..10**9).lazy.map{|_| r.rand(3) }

> Set.new(stream).length => 3
9. ### > r = Random.new > stream = (1..10**9).lazy.map{|_| r.rand(30000) }

> Set.new(stream).length => 30000
10. ### > r = Random.new > stream = (1..10**9).lazy.map{|x| x }

> Set.new(stream).length => 10**9

Number
14. ### "...estimate cardinalities well beyond 109 with a typical accuracy of

2% while using a memory of only 1.5 kilobytes.”
15. ### For a set with cardinality N O(m) memory std. error

~= 1.04/√m O(1) time to add an element

18. ### “...techniques that are now standard in analysis of algorithms, like

poissonization, Mellin transforms, and saddle-point depoissonization.”
19. None

23. ### pr(1… ) = 1/2 pr(01… ) = 1/4 pr(001… )

= 1/8 pr(0001…) = 1/16 … pr(0 * n) = 1/2n
24. ### if the longest run of tails is x then we’ve

probably done at least 2x experiments

x”

28. ### if x = max(ρ) then we’ve probably done at least

2x experiments
29. ### “perform [m] experiments in parallel… their arithmetic mean has standard

deviation σ/√m”
30. ### Do parallel experiments to get bad estimators. Combine bad estimators

into a good estimator.

ρ(h m (x))
37. ### “...[emulate] the effect of m experiments with a single hash

function… ...divide the input stream h(M) into m substreams”

41. ### > hashed_val = hash(“foobar”) => 0b0110101000101010 > i = 01101

> rho_x = ρ(01000101010) > M[i] = max(M[i], rho_x)

M m ]

46. ### “...our algorithm differs from standard LOGLOG by its evaluation function:

its is based on harmonic means, while [LOGLOG] uses what amounts to a geometric mean.”

48. None
49. None

x n -1) n-1

53. ### add(HLL, element) -> HLL - Hash the input - Partition

into substreams - Keep max(ρ(x)) per substream
54. ### cardinality(HLL) -> number - Take the harmonic mean of the

substream estimates and correct for bias
55. ### For a set with cardinality N O(m) memory O(1) time

to add an element std. error ~= 1.04/√m

59. None

bits

68. ### total_size = m x register_size std_err = 1.04 / √m

2^11 * (5 bits) = 1280 bytes 1.04 / √(2^11) =~ 0.0230
69. ### m size std error p99 error 210 640b 0.0325 0.0975

211 1.25k 0.0230 0.0690 212 2.5K 0.0163 0.0488 213 5k 0.0115 0.0345 214 10k 0.0081 0.0244 215 20k 0.0057 0.0172 216 40k 0.0041 0.0122
70. ### m size max uint32 max hll 210 640b 160 2^32

-1 211 1.25k 320 2^32 -1 212 2.5K 640 2^32 -1 213 5k 1280 2^32 -1 214 10k 2560 2^32 -1 215 20k 5120 2^32 -1 216 40k 10240 2^32 -1

72. ### “Given an arbitrary partitioning of the original file into subfiles,

it suffices to collect register values and apply componentwise a max operation.”

77. ### struct HLL{ … } add(HLL, element) -> HLL cardinality(HLL) ->

Number union(HLL, HLL) -> HLL
78. ### 2018-05-14 00:00:00 HLL{...} 2018-05-14 01:00:00 HLL{...} 2018-05-14 02:00:00 HLL{...} ...

2018-05-14 23:00:00 HLL{...}