Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ben Linsay on HyperLogLog

pwl
May 14, 2018

Ben Linsay on HyperLogLog

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HyperLogLog, dedicated to estimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes"), HyperLogLog performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about 1.04/√m. This improves on the best previously known cardinality estimator, LogLog, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10^9 with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.

pwl

May 14, 2018
Tweet

More Decks by pwl

Other Decks in Technology

Transcript

  1. “...the cardinality of a [set] can be exactly determined with

    a storage complexity essentially proportional to its number of elements”
  2. - Set doesn’t fit in RAM - Set doesn’t fit

    on disk - Read-once data (streams)
  3. "...estimate cardinalities well beyond 109 with a typical accuracy of

    2% while using a memory of only 1.5 kilobytes.”
  4. For a set with cardinality N O(m) memory std. error

    ~= 1.04/√m O(1) time to add an element
  5. “...techniques that are now standard in analysis of algorithms, like

    poissonization, Mellin transforms, and saddle-point depoissonization.”
  6. pr(1… ) = 1/2 pr(01… ) = 1/4 pr(001… )

    = 1/8 pr(0001…) = 1/16 … pr(0 * n) = 1/2n
  7. if the longest run of tails is x then we’ve

    probably done at least 2x experiments
  8. “...[emulate] the effect of m experiments with a single hash

    function… ...divide the input stream h(M) into m substreams”
  9. > hashed_val = hash(“foobar”) => 0b0110101000101010 > i = 01101

    > rho_x = ρ(01000101010) > M[i] = max(M[i], rho_x)
  10. “...our algorithm differs from standard LOGLOG by its evaluation function:

    its is based on harmonic means, while [LOGLOG] uses what amounts to a geometric mean.”
  11. add(HLL, element) -> HLL - Hash the input - Partition

    into substreams - Keep max(ρ(x)) per substream
  12. cardinality(HLL) -> number - Take the harmonic mean of the

    substream estimates and correct for bias
  13. For a set with cardinality N O(m) memory O(1) time

    to add an element std. error ~= 1.04/√m
  14. total_size = m x register_size std_err = 1.04 / √m

    2^11 * (5 bits) = 1280 bytes 1.04 / √(2^11) =~ 0.0230
  15. m size std error p99 error 210 640b 0.0325 0.0975

    211 1.25k 0.0230 0.0690 212 2.5K 0.0163 0.0488 213 5k 0.0115 0.0345 214 10k 0.0081 0.0244 215 20k 0.0057 0.0172 216 40k 0.0041 0.0122
  16. m size max uint32 max hll 210 640b 160 2^32

    -1 211 1.25k 320 2^32 -1 212 2.5K 640 2^32 -1 213 5k 1280 2^32 -1 214 10k 2560 2^32 -1 215 20k 5120 2^32 -1 216 40k 10240 2^32 -1
  17. “Given an arbitrary partitioning of the original file into subfiles,

    it suffices to collect register values and apply componentwise a max operation.”
  18. at me: @blinsay implementations: https://github.com/aggregateknowledge/hll-storage-spec https://github.com/twitter/algebird https://github.com/apache/lucene-solr a good series

    of blog posts: https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerst one-of-a-big-data-infrastructure/