Ben Linsay on HyperLogLog

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm Flajolet
et. al (2007) ben linsay pwl nyc may 2018

It’s personal

- Why Bother? - Intuition - The Algorithm - Consequences
- HLL IRL

Why Bother?

SET = SET CARDINALITY = SIZE

“...the cardinality of a [set] can be exactly determined with
a storage complexity essentially proportional to its number of elements”

- Set doesn’t fit in RAM - Set doesn’t fit
on disk - Read-once data (streams)

> r = Random.new > stream = (1..10**9).lazy.map{|_| r.rand(3) }
> Set.new(stream).length => 3

> r = Random.new > stream = (1..10**9).lazy.map{|_| r.rand(30000) }
> Set.new(stream).length => 30000

> r = Random.new > stream = (1..10**9).lazy.map{|x| x }
> Set.new(stream).length => 10**9

Network traffic

Advertising

struct HLL{ … } add(HLL, element) -> HLL cardinality(HLL) ->
Number

"...estimate cardinalities well beyond 109 with a typical accuracy of
2% while using a memory of only 1.5 kilobytes.”

For a set with cardinality N O(m) memory std. error
~= 1.04/√m O(1) time to add an element

A Disclaimer

We’re skipping the proofs

“...techniques that are now standard in analysis of algorithms, like
poissonization, Mellin transforms, and saddle-point depoissonization.”

Intuition

Flip a coin, forever

Flip a coin 32 times, forever

pr(1… ) = 1/2 pr(01… ) = 1/4 pr(001… )
= 1/8 pr(0001…) = 1/16 … pr(0 * n) = 1/2n

if the longest run of tails is x then we’ve
probably done at least 2x experiments

Bit-Pattern Observables

ρ(x) “the position of the leftmost 1-bit in binary string
x”

ρ(0b00101010) = 3 ρ(0b00001010) = 5 ρ(0b00000001) = 8

if x = max(ρ) then we’ve probably done at least
2x experiments

“perform [m] experiments in parallel… their arithmetic mean has standard
deviation σ/√m”

Do parallel experiments to get bad estimators. Combine bad estimators
into a good estimator.

The Algorithm

add(HLL, element) -> HLL

“very_cool_input_data”

h(“very_cool_input_data”) ~ rand()

Uniform Hash Functions

ρ(h 1 (x)) ρ(h 2 (x)) ρ(h 3 (x)) ...
ρ(h m (x))

“...[emulate] the effect of m experiments with a single hash
function… ...divide the input stream h(M) into m substreams”

h(“A”) M 1 M 2 ... M m

h(“B”) M 1 M 2 ... M m

h(“A”) M 1 M 2 ... M m

> hashed_val = hash(“foobar”) => 0b0110101000101010 > i = 01101
> rho_x = ρ(01000101010) > M[i] = max(M[i], rho_x)

struct HLL = [M 1 , M 2 , …,
M m ]

cardinality(HLL) -> Number

HLL = [1, 7, 8, …,3]

estimate = mean([2^1, 2^7, 2^8, …,2^3])

“...our algorithm differs from standard LOGLOG by its evaluation function:
its is based on harmonic means, while [LOGLOG] uses what amounts to a geometric mean.”

Pythagorean means

(x 1 -1 + x 2 -1 + … +
x n -1) n-1

E = m * a(m) * m/(2-M_1 + … 2-M_m)

add(HLL, element) -> HLL - Hash the input - Partition
into substreams - Keep max(ρ(x)) per substream

cardinality(HLL) -> number - Take the harmonic mean of the
substream estimates and correct for bias

For a set with cardinality N O(m) memory O(1) time
to add an element std. error ~= 1.04/√m

std. error ~= 1.04/√m as n → ∞

Small range correction

Large range correction

Consequences

LOGLOG

0b0110101000101010 max(ρ(w)) = 32 - b

store an int <= 32

log 2 (32) = 5

each experiment uses at most log 2 (log 2 (2^32))
bits

space complexity is O(log 2 (log 2 (N)))

Actual Size

total_size = m x register_size std_err = 1.04 / √m
2^11 * (5 bits) = 1280 bytes 1.04 / √(2^11) =~ 0.0230

m size std error p99 error 210 640b 0.0325 0.0975
211 1.25k 0.0230 0.0690 212 2.5K 0.0163 0.0488 213 5k 0.0115 0.0345 214 10k 0.0081 0.0244 215 20k 0.0057 0.0172 216 40k 0.0041 0.0122

m size max uint32 max hll 210 640b 160 2^32
-1 211 1.25k 320 2^32 -1 212 2.5K 640 2^32 -1 213 5k 1280 2^32 -1 214 10k 2560 2^32 -1 215 20k 5120 2^32 -1 216 40k 10240 2^32 -1

One important footnote

“Given an arbitrary partitioning of the original file into subfiles,
it suffices to collect register values and apply componentwise a max operation.”

You can union two HLLs

HLL(A) ∪ HLL(B) = HLL(A ∪ B)

max(a, max(b, c)) = max(max(a, b), c)

max(M a,1 , M b,1 ) = M union,1

struct HLL{ … } add(HLL, element) -> HLL cardinality(HLL) ->
Number union(HLL, HLL) -> HLL

2018-05-14 00:00:00 HLL{...} 2018-05-14 01:00:00 HLL{...} 2018-05-14 02:00:00 HLL{...} ...
2018-05-14 23:00:00 HLL{...}

HLL IRL

HyperLogLog in Practice Heule, Nunkesser, and Hall (2013)

64-bit Hash Functions

Estimating Small Cardinalities

Sparse Representation

Notes and Further Reading

at me: @blinsay implementations: https://github.com/aggregateknowledge/hll-storage-spec https://github.com/twitter/algebird https://github.com/apache/lucene-solr a good series
of blog posts: https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerst one-of-a-big-data-infrastructure/

Ben Linsay on HyperLogLog

Ben Linsay on HyperLogLog

More Decks by pwl

Other Decks in Technology

Featured

Transcript