Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Breaking Down Our Billion User Reach with HyperLogLog

Breaking Down Our Billion User Reach with HyperLogLog

Measuring unique users in a billion user network is hard - accurate counting is space consuming and not easily distributable.

In this talk I will describe HyperLogLog, a probabilistic cardinality estimation algorithm and data structure and how we used it to provide breakdowns of our billion user reach.

avramson

June 16, 2019
Tweet

Other Decks in Programming

Transcript

  1. !2

  2. !3 Impressions Table UUID | Country | Publisher | Platform

    | ... ad4f038f-e675-4265-850b-d7a94519126f | US | CNN | mobile | ... 3f712b92-89b6-4e12-8cff-563df2a2e524 | JP | MSN | desktop | ... ad4f038f-e675-4265-850b-d7a94519126f | US | CNN | mobile | ... ad4f038f-e675-4265-850b-d7a94519126f | US | MSN | mobile | ... fe9e5096-4019-11e9-b210-d663bd873d93 | US | CNN | desktop | ... ...
  3. Count Is Easy, Count Distinct Is Hard !5 Neptune Mars

    Neptune Neptune Saturn Neptune Neptune Saturn Neptune Saturn Neptune Saturn Saturn Venus Count = 5 Unique = 3 Count = 5 Unique = 2 Count = 4 Unique = 3 Total: Count = 5 + 5 + 4 Unique = ?
  4. Sliding Window !6 Day 1 Day 2 Day 3 Day

    4 Day 5 Unique elements over 3 days
  5. !8

  6. Pr(height < 190)¹ ≈ 97% Pr(height < 190)¹⁰ ≈ 74%

    Pr(height < 190)¹⁰⁰ ≈ 5% Pr(height < 190)²⁰⁰ ≈ 0.2% Pr(height < 190)⁵⁰⁰ ≈ 0.00002% !9
  7. Pr(height < 177)¹ ≈ 50% Pr(height < 177)⁵ ≈ 3%

    Pr(height < 177)¹⁰ ≈ 0.1% Pr(height < 177)⁵⁰ ≈ 0.00000000000009% Pr(height < 177)¹⁰⁰ ≈ 0.000000000000000000000000 !10
  8. 00000100 11110100 00000100 01001101 11011101 01000101 11111010 10001001 01111111 11100010

    11100111 11000110 00110101 10111001 01000010 10011111 00011000 00101101 10100011 00111011 10011110 00100000 01111010 10100100 01011110 11100100 01101011 11000100 10100101 00101100 10001100 10001101 !11 bit-pattern observable
  9. Use a uniformly distributed hash function h to transform our

    data domain into a stream of bits: • h : D → {0, 1} • A = B 㱺 h(A) = h(B) • Independently for each bit xᵢ in h(A): Pr(xᵢ=0) = Pr(xᵢ=1) = 50% !12 ∞
  10. Let (x₁x₂x₃…) be the position of the leftmost 1 in

    a stream of bits x₁x₂x₃…: (1…) = 1 (01…) = 2 (001…) = 3 (0001…) = 4 ... !13 0 0 1 0 1 1 0 (00101100...) = 3 = index of first 1 0
  11. Let n be the number of unique elements: (1…) =

    1 (01…) = 2 (001…) = 3 (0001…) = 4 ... !14 We expect n/2ᵏ elements to have =k We can use 2 to estimate n max((x))
  12. h(“Neptune”) = 0100011010 !17 0 1 0 0 0 1

    1 0 1 0 (0011010) = 3 = index of first 1 bucket_index = 2 Using b = 3 bits we get m = 2 = 8 buckets b
  13. M[0..m-1]: Array of m values (m=2 ), initialized to 0’s

    (x₁x₂x₃…): Index of the leftmost 1 in a stream of bits x₁x₂x₃… observe(value): x₁x₂x₃… := h(value) bucket_index := value from the first b bits: x₁x₂…xb M[bucket_index] := max( M[bucket_index] , (xb+1xb+2…) ) !18 b
  14. M[]: !19 0 0 0 0 0 0 0 0

    h(“Neptune”) = 010 0011010
  15. M[]: !20 0 0 3 0 0 0 0 0

    h(“Neptune”) = 010 0011010
  16. M[]: !21 0 0 3 0 0 0 0 0

    h(“Mars”) = 110 0001011
  17. M[]: !22 0 0 3 0 0 0 4 0

    h(“Mars”) = 110 0001011
  18. M[]: !23 0 0 3 0 0 0 4 0

    h(“Venus”) = 110 0110111
  19. M[]: !24 0 0 3 0 0 0 4 0

    h(“Venus”) = 110 0110111
  20. M[]: !25 3 4 3 3 7 3 9 3

    m estimates of log₂(n/m) Bits used: m·log₂(log₂(n/m))
  21. M[0..m-1]: Array of m values logLogEstimate(M): Standard error ≈ 1.3/√m

    n = m·m·2 The constant m is the bias correction factor for m !27 ᵢM[i]/m
  22. superLogLogEstimate(M): Standard error ≈ 1.05/√m Remove highest 30% of values

    from M to get M₀[], m₀ = 0.7·m n = ̃m·m₀·2 !28 ᵢM₀[i]/m₀
  23. hyperLogLogEstimate(M): Standard error ≈ 1.04/√m n = m·m· Small range

    correction if n < 5m/2: v = number of untouched buckets in M if v = 0 return the normal HLL estimation n, otherwise return m·log(m/v) ᵢ m -M[i] !29 2 bits buckets σ 10 1024 3.25% 11 2048 2.23% 12 4096 1.63%
  24. M₁, M₂: Arrays of m values merge(M₁ , M₂): for

    i=0 to i=m-1: Mᵣ[i] := max( M₁[i] , M₂[i] ) !30
  25. !31 4 2 5 3 5 4 4 5 3

    4 3 3 7 3 9 3 4 4 5 3 7 4 9 5 M₁ M₂ M₁∪M₂
  26. ADD jar /usr/lib/hive/lib/sketches-hive-0.11.0-with-shaded-core.jar; CREATE TEMPORARY FUNCTION data2sketch AS 'com.yahoo.sketches.hive.hll.DataToSketchUDAF'; INSERT

    INTO TABLE sketch_daily_publisher_impressions SELECT publisher_id as publisher_id, -- int CURRENT_DATE as stats_date, -- string data2sketch(uuid) as uuids_sketch -- binary FROM clicks WHERE impression_date = CURRENT_DATE GROUP BY publisher_id; !34
  27. CREATE TEMPORARY FUNCTION unionSketches as 'com.yahoo.sketches.hive.hll.UnionSketchUDAF'; SELECT stats_date, round(estimate(unionSketches(uuids_sketch))) FROM

    sketch_daily_publisher_impressions WHERE stats_date > date_sub(CURRENT_DATE, 7) GROUP BY stats_date; !36
  28. Thank You, Any Questions? Further reading: Hyperloglog: The analysis of

    a near-optimal cardinality estimation algorithm (2007) by Philippe Flajolet , Éric Fusy , Olivier Gandouet, et al. DataSketches.GitHub.io Photos by Patrick Fore on Unsplash !37