Breaking Down Our Billion User Reach with HyperLogLog

Breaking Down Our Billion User Reach with HyperLogLog Daniel Avramson

!3 Impressions Table UUID | Country | Publisher | Platform
| ... ad4f038f-e675-4265-850b-d7a94519126f | US | CNN | mobile | ... 3f712b92-89b6-4e12-8cff-563df2a2e524 | JP | MSN | desktop | ... ad4f038f-e675-4265-850b-d7a94519126f | US | CNN | mobile | ... ad4f038f-e675-4265-850b-d7a94519126f | US | MSN | mobile | ... fe9e5096-4019-11e9-b210-d663bd873d93 | US | CNN | desktop | ... ...

!4 ONE DOES NOT SIMPLY COUNT DISTINCT ALL THE DATA

Count Is Easy, Count Distinct Is Hard !5 Neptune Mars
Neptune Neptune Saturn Neptune Neptune Saturn Neptune Saturn Neptune Saturn Saturn Venus Count = 5 Unique = 3 Count = 5 Unique = 2 Count = 4 Unique = 3 Total: Count = 5 + 5 + 4 Unique = ?

Sliding Window !6 Day 1 Day 2 Day 3 Day
4 Day 5 Unique elements over 3 days

HyperLogLog (2007) !7

Pr(height < 190)¹ ≈ 97% Pr(height < 190)¹⁰ ≈ 74%
Pr(height < 190)¹⁰⁰ ≈ 5% Pr(height < 190)²⁰⁰ ≈ 0.2% Pr(height < 190)⁵⁰⁰ ≈ 0.00002% !9

Pr(height < 177)¹ ≈ 50% Pr(height < 177)⁵ ≈ 3%
Pr(height < 177)¹⁰ ≈ 0.1% Pr(height < 177)⁵⁰ ≈ 0.00000000000009% Pr(height < 177)¹⁰⁰ ≈ 0.000000000000000000000000 !10

00000100 11110100 00000100 01001101 11011101 01000101 11111010 10001001 01111111 11100010
11100111 11000110 00110101 10111001 01000010 10011111 00011000 00101101 10100011 00111011 10011110 00100000 01111010 10100100 01011110 11100100 01101011 11000100 10100101 00101100 10001100 10001101 !11 bit-pattern observable

Use a uniformly distributed hash function h to transform our
data domain into a stream of bits: • h : D → {0, 1} • A = B 㱺 h(A) = h(B) • Independently for each bit xᵢ in h(A): Pr(xᵢ=0) = Pr(xᵢ=1) = 50% !12 ∞

Let (x₁x₂x₃…) be the position of the leftmost 1 in
a stream of bits x₁x₂x₃…: (1…) = 1 (01…) = 2 (001…) = 3 (0001…) = 4 ... !13 0 0 1 0 1 1 0 (00101100...) = 3 = index of first 1 0

Let n be the number of unique elements: (1…) =
1 (01…) = 2 (001…) = 3 (0001…) = 4 ... !14 We expect n/2ᵏ elements to have =k We can use 2 to estimate n max((x))

!15 2 37

!16 Stochastic Averaging

h(“Neptune”) = 0100011010 !17 0 1 0 0 0 1
1 0 1 0 (0011010) = 3 = index of first 1 bucket_index = 2 Using b = 3 bits we get m = 2 = 8 buckets b

M[0..m-1]: Array of m values (m=2 ), initialized to 0’s
(x₁x₂x₃…): Index of the leftmost 1 in a stream of bits x₁x₂x₃… observe(value): x₁x₂x₃… := h(value) bucket_index := value from the first b bits: x₁x₂…xb M[bucket_index] := max( M[bucket_index] , (xb+1xb+2…) ) !18 b

M[]: !19 0 0 0 0 0 0 0 0
h(“Neptune”) = 010 0011010

M[]: !20 0 0 3 0 0 0 0 0
h(“Neptune”) = 010 0011010

M[]: !21 0 0 3 0 0 0 0 0
h(“Mars”) = 110 0001011

M[]: !22 0 0 3 0 0 0 4 0
h(“Mars”) = 110 0001011

M[]: !23 0 0 3 0 0 0 4 0
h(“Venus”) = 110 0110111

M[]: !24 0 0 3 0 0 0 4 0
h(“Venus”) = 110 0110111

M[]: !25 3 4 3 3 7 3 9 3
m estimates of log₂(n/m) Bits used: m·log₂(log₂(n/m))

LogLog SuperLogLog HyperLogLog !26

M[0..m-1]: Array of m values logLogEstimate(M): Standard error ≈ 1.3/√m
n = m·m·2 The constant m is the bias correction factor for m !27 ᵢM[i]/m

superLogLogEstimate(M): Standard error ≈ 1.05/√m Remove highest 30% of values
from M to get M₀[], m₀ = 0.7·m n = ̃m·m₀·2 !28 ᵢM₀[i]/m₀

hyperLogLogEstimate(M): Standard error ≈ 1.04/√m n = m·m· Small range
correction if n < 5m/2: v = number of untouched buckets in M if v = 0 return the normal HLL estimation n, otherwise return m·log(m/v) ᵢ m -M[i] !29 2 bits buckets σ 10 1024 3.25% 11 2048 2.23% 12 4096 1.63%

M₁, M₂: Arrays of m values merge(M₁ , M₂): for
i=0 to i=m-1: Mᵣ[i] := max( M₁[i] , M₂[i] ) !30

!31 4 2 5 3 5 4 4 5 3
4 3 3 7 3 9 3 4 4 5 3 7 4 9 5 M₁ M₂ M₁∪M₂

DataSketches.GitHub.io !32

CREATE TEMPORARY TABLE sketch_daily_publisher_impressions ( publisher_id int, stats_date string, uuids_sketch
binary ) !33

ADD jar /usr/lib/hive/lib/sketches-hive-0.11.0-with-shaded-core.jar; CREATE TEMPORARY FUNCTION data2sketch AS 'com.yahoo.sketches.hive.hll.DataToSketchUDAF'; INSERT
INTO TABLE sketch_daily_publisher_impressions SELECT publisher_id as publisher_id, -- int CURRENT_DATE as stats_date, -- string data2sketch(uuid) as uuids_sketch -- binary FROM clicks WHERE impression_date = CURRENT_DATE GROUP BY publisher_id; !34

CREATE TEMPORARY FUNCTION estimate AS 'com.yahoo.sketches.hive.hll.SketchToEstimateUDF'; SELECT publisher_id, round(estimate(uuids_sketch))) FROM
sketch_daily_publisher_impressions WHERE stats_date = CURRENT_DATE !35

CREATE TEMPORARY FUNCTION unionSketches as 'com.yahoo.sketches.hive.hll.UnionSketchUDAF'; SELECT stats_date, round(estimate(unionSketches(uuids_sketch))) FROM
sketch_daily_publisher_impressions WHERE stats_date > date_sub(CURRENT_DATE, 7) GROUP BY stats_date; !36

Thank You, Any Questions? Further reading: Hyperloglog: The analysis of
a near-optimal cardinality estimation algorithm (2007) by Philippe Flajolet , Éric Fusy , Olivier Gandouet, et al. DataSketches.GitHub.io Photos by Patrick Fore on Unsplash !37

Breaking Down Our Billion User Reach with HyperLogLog

Breaking Down Our Billion User Reach with HyperLogLog

Other Decks in Programming

Featured

Transcript