PWLSF#13=> Armon Dadgar on Bloom Filters and HyperLogLog

Bloom Filters and HyperLogLogs or Cutting Corners for Fun and
Proﬁt

Armon Dadgar @armon

Ad Network • Duplicate impressions • Metrics: Active Users (Uniques)
• Daily (DAU), Weekly (WAU), Monthly (MAU) • Measured by Application, Country, Version, etc • Combinatorial Explosion

Problem Deﬁnition • Stream of {User, Application, Country, Version} •
Compute combinatorial uniques • Generate {User, Ad} • Avoid duplication

Solutions?

Abstract Solution: Sets • Set S • Add(S, Item) •
Contains(S, Item) => {true, false} • Size(S) => Cardinality of S • Fancy: Union, Intersect, Diff, Subset, Remove

Using Sets • For Impression De-Duplication • Before Impression: Contains(SeenAds,
{User, Ad}) • After Ad Impression, Add(SeenAds, {User, Ad}) • For each of Interval {Daily, Weekly, Monthly}, Country and Application: • Add(IntervalCountryAppSet, User) • Cardinality = Size(IntervalCountryAppSet) • Set explosion: (Intervals x Countries x Applications) = 1M+

“foo” Hash(“foo”) % Buckets = 5 “foo” Set Implementation •
Hash table implementation • Hash input to bucket, store the key for comparison • Add, Contains, Size all O(1)!

Memory Cost?

“foo” Hash(“foo”) % Buckets = 5 “foo” Memory Cost •
Bucket is 8 bytes, Entry is 64 bytes • Fill is 75% • Space(Size) = (Size/Fill) * Bucket + Size*Entry • Space(10M) = 746M • If we have 1M sets, we need 746TB of RAM!

Reducing Memory Cost • Hashing may collide, Item is kept
for comparison • What if we allow error?

Skip storing the key! “foo” Hash(“foo”) % Buckets = 5
“foo” “foo” Hash(“foo”) % Buckets = 5 1

“foo” Hash(“foo”) % Buckets = 5 1 Memory Cost •
Bucket is 1 bit • Fill is 75% • Space(Size) = (Size/Fill) * Bucket • Space(10M) = 1.6M • 1M sets == 1.6TB of RAM vs 746TB • Possible with a few large machines!

“bar” Hash(“bar”) % Buckets = 5 1 What’s the catch?
• Add(S, “foo”) • Contains(S, “bar”) => true • Whoops! (False Positive)

Quantifying the Error • Distribution of Hash(Item) should be uniform
• Probability of hashing to ﬁlled bucket: N/M • N = 2, M =8, Collision Rate = 25% • Fundamental tradeoff: Space vs Collision Rate • Can it be reduced?

Bloom Filters! “Space/Time Trade-offs in Hash Coding with Allowable Errors”
- Bloom

“foo” Hash1(“foo”) % Buckets = 5 1 1 Hash2(“foo”) %
Buckets = 3 Bloom Filters • Hash to K locations instead of 1 • Add(S, Item) => Set Hashk(Item) % M • Contains(S, Item) => • true iff Hashk(Item) % M == 1 for all k • false else

“bar” Hash1(“bar”) % Buckets = 5 0 1 1 Hash2(“bar”)
% Buckets =0 Bloom Filters • Probability of collision • P = (1- (1 - 1/M)n)k • N=2, M=8, K=2, P = .054

Tuning Parameters • N: Number of Items • M: Number
of buckets (Storage Size) • K: Number of hash functions (Speed vs Errors) • P: False Probability Rate • Minimize FP Rate: k = (M/N) * ln(2) • Equivalent: M = (-N)*ln(P)/ln(2)2

More K, More Cycles • K hashes used to reduce
false positive rate • Extra CPU time spent hashing • Need unique hash functions or unique seeds

Math to the Rescue! • “Less Hashing, Same Performance: Building
a Better Bloom Filter” by Adam Kirsch and Michael Mitzenmacher • gi(x) = h1(x) + i*h2(x) mod m • Only 2 hash functions are required

Filter Construction • Given P and N, Compute M and
K • P depends on problem domain • How to select N? • Too small, ﬁlter is exhausted • Too large, very sparse

Hash Table Resizing • Resize Table when Fill% Reached •
Amortized O(1) • Requires keys to re-hash • Hash(X) % m != Hash(X) % c*m

Scalable Bloom Filters Almeida, Baquero, Preguic ̧a

1 1 1 1 1 1 1 Layer 1 Layer
2 Scalable Bloom Filters • Pick target P, initial N • When exhausted create new BF • Layer each ﬁlter

1 1 1 1 1 1 1 Layer 1 Layer
2 “foo” Hash1(“foo”) % Buckets = 14 Hash2(“foo”) % Buckets = 4 Hash1(“foo”) % Buckets = 5 Hash2(“foo”) % Buckets = 3 Scalable Bloom Filters • Add in “largest” ﬁlter • Contains checks all layers

New Tuning Parameters • S: Scale Factor • Too Small
=> Many Layers => Wasted CPU Time • Too Large => Storage Requirement => Wasted Memory • S=2 or 4

New Tuning Parameters • R: Probability Reduction • Each layer
compounds FP rate, R is a compensation factor • Too Small => Each new layer must be much larger • Too Large => Initial layer must be much larger • R = [0.8, 0.9]

SBF in Practice • 1M Filters @ 1.6MB Per =
1.6TB • Large variation in cardinality • {Monthly, USA, Angry Birds} >>> {Daily, Mexico, Flashlight Magic} • Initial N=32K, Layers = 128K, 512K, 2M, 8M • Initial layer is 256x smaller • In practice, 11GB of RAM used!

“foo” Hash1(“foo”) % Buckets = 5 1 1 2 Hash2(“foo”)
% Buckets = 3 “bar” Hash1(“bar”) % Buckets = 5 Hash2(“bar”) % Buckets =0 Counting Bloom Filters • Support Delete(S, Item) • Saturating Counter • 2bit => Count to 3 • 3bit => Count to 7 • Nuance around overﬂow and underﬂow

TL;DR: Bloom Filters • Set’s with allowable error • Trade
False Positives for Space Utilization • Rich ﬁeld of research • github.com/armon/bloomd

HyperLogLog Flajolet, Fusy, Gandouet, et al.

Revisiting the Problem • Cardinality of {Interval, Country, App} •
1M+ combinations • Add() and Size() used, Contains() is not! • Sacriﬁce the ability to check for single item for cardinality?

Hash Functions • Hash(x) => Bit stream {0,1,0,1…..} • “Good”
hash function distributes values uniformly • Value of each bit should be independent

Name that Bit! • Given Hash(X) => {0,1,0,1,0,…} • Odd
that ﬁrst bit is zero: 1/2 • Odd that Nth bit is zero: 1/2 • Odds of 4 consecutive zero’s? {0,0,0,0,1,…} • (1/2)^4 • Odd’s of N consecutive zeros: (1/2)^N • This requires values are independent

Guessing Game • N Consecutive Zero Bits: (1/2)^N • How
many samples required to reach N? • If N = 32, Odds = 1/4294967296, Expected 4294967296 Samples

Reducing Space • N=32 = {1, 0, 0, 0, 0,
0} = 6 bits! • With 6 bits, we can count into 2^64 items • Hence Log(Log(2^64)) = 6!

Bad Apples • What if Hash(“foo”) = {0, 0, 0,
0, 0, …, 1} • N = 32, Estimated Samples = 2^32 = 4294967296 • Extreme sensitivity to outliers

Multiple Samples • Single register is sensitive to outliers •
Multiple registers?

Hash(“foo”) = {0, 1, 0, 0, 1, 0, 0 …}
Register Index Zero Count 2 Registers Multiple Registers ! • Create M registers (power of 2) • Partition the Hash Value • First log(M) bits are index • Remainder is used to count

HyperLogLog: Add() • Add(H, Item): • Hash(Item) => {Index, NumZeros}
• Registers[Index] = Max(Registers[Index], NumZeros)

20 2 2 3 Registers HyperLogLog: Size() • Given M
registers • We must estimate a single value • We require a reduce function • Min/Max? • Average? Median?

Outliers • Sensitivity to outliers • Original LogLog proposes Geometric
Mean • Nth Root of (X1 * X2 *… Xn) • HyperLogLog proposes Harmonic Mean • N*(∑X-1)-1

HyperLogLog: Size() • Size(H): • HM = HarmonicMean(H.Registers) • Estimate
= A * M * HM • Where does A come from?

Hashing Expectations • Hash(X) => {0, 1, 0, …} •
Probability of N Zeros: (1/2)^N • What is the distribution of that probability? • Poisson Analysis

N=8, Expect 3 Zeros

Extra Zeros with! some probability

Adjusting for Bias • Math gets very hairy… • As
M (Registers) increases, Bias is reduced • A = 0.7213/(1 + 1.079/m) • Under-count mitigated by Max() functions • Over-count mitigated by Bias correction

DivideByZeroException • Harmonic Mean: N*(∑X-1)-1 • If X == 0,
(0)-1 is not deﬁned • Register value of 0 indicates very small sample size

Small-Range Correction • V = Number of 0 value registers
• Estimate = m * log(m/V)

Large-Range Correction • B = register width • As N
-> 22^B we expect more hash collisions (counters are nearly saturated) • Likely an underestimation, apply more correction • Mitigated by moving to 64bit Hash and 6bit registers

All this for What? • With bias correction error is
bounded by E = 1.04 / Sqrt(M) • M = 512 (5bit), E = .04596, 320 bytes • M = 4096 (6bit), E = .01625 , 3072 bytes • This is outrageous. But can we do better?

HyperLogLog++ “HyperLogLog in Practice: Algorithmic Engineering of a State of
The Art Cardinality Estimation Algorithm” - Heule, Nunkesser, Hall

HyperLogLog++ • 64bit Hash • Bias Adjustments • Sparse Representation

HLL++: 64bit Hash • Removes need for large-range correction •
Counts can go well into trillions • Increases register size from 5 to 6 bits

Bias Adjustment • Error is bounded, but we can do
better • Empirical analysis with thousands of data sets • Record RawEstimate, compute bias adjustment based on N • Apply empirical bias correction based on table lookups

Sparse Representation • Many registers are zero for small sets
• Sparse list vs array • Variable length encoding • Difference Encoding • Lots of extra complexity, trades time for memory • Only in sparse case

HLL in Practice • Cardinality of {Interval, Country, App} ~
1M • Hash Tables = 750TB • SBF = 11GB • HyperLogLog = 4GB (4K page aligned) • Faster! Single hash only

TL;DR: HyperLogLog • Cardinality estimators with tunable error • Trade
Errors for Space Utilization • github.com/armon/hlld

Broader Field • “Sketching” Data Structures • Trade exactness for
time / space • Interesting for large-scale or streaming processing • Muthu Muthukrishnan (http://www.cs.rutgers.edu/~muthu/)

Thanks! Q/A

PWLSF#13=> Armon Dadgar on Bloom Filters and Hy...

PWLSF#13=> Armon Dadgar on Bloom Filters and HyperLogLog

More Decks by Papers_We_Love

Other Decks in Technology

Featured

Transcript