Distributed COUNT(DISTINCT) with Hyperloglog on PostgreSQL | PGConf EU 2017 | Burak Yucesoy

Burak Yucesoy | Citus Data | PGConf EU Distributed COUNT(DISTINCT)
with HyperLogLog on PostgreSQL

Burak Yucesoy | Citus Data | PGConf EU What is
COUNT(DISTINCT)? • Number of unique elements (cardinality) in given data • Useful to find things like… ◦ Number of unique users visited your web page ◦ Number of unique products in your inventory

Burak Yucesoy | Citus Data | PGConf EU What is
distributed COUNT(DISTINCT)? Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003

Burak Yucesoy | Citus Data | PGConf EU Why do
we need distributed COUNT(DISTINCT)? • Your data is too big to fit in memory of single machine • Naive approach for COUNT(DISTINCT) needs too much memory

Burak Yucesoy | Citus Data | PGConf EU Why does
distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 Coordinator SELECT COUNT(*) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 600 100 200 300 SELECT COUNT(*) FROM ...;

distributed COUNT(DISTINCT) is difficult? Worker Node 1 logins_001 Coordinator SELECT COUNT(DISTINCT username) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 SELECT COUNT(DISTINCT user_id) FROM ...;

Burak Yucesoy | Citus Data | PGConf EU Some Possible
Approaches • Pull all distinct data to one node and count there. (Doesn’t scale) • Repartition data on the fly. (Scales but it’s very slow) • Use HyperLogLog. (Scales and fast)

Burak Yucesoy | Citus Data | PGConf EU HyperLogLog(HLL) HLL
is; • Approximation algorithm • Estimates cardinality of given data • Mathematically proven error bounds

Burak Yucesoy | Citus Data | PGConf EU Is it
OK to approximate? It depends…

Burak Yucesoy | Citus Data | PGConf EU HLL •
Very fast • Low memory footprint • Can work with streaming data • Can merge estimations of two separate datasets efficiently

Burak Yucesoy | Citus Data | PGConf EU How does
HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging

HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2

HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2

HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27

HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average.

Burak Yucesoy | Citus Data | PGConf EU

HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212

HLL work? Stochastic Averaging 01000101...010 First m bits to decide partition number Remaining bits to count leading zeros

Burak Yucesoy | Citus Data | PGConf EU Error rate
of HLL is damn good • Typical Error Rate: 1.04 / sqrt(number of partitions) • Memory need is number of partitions * log(log(max. value in hash space)) bit • Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes • Memory vs accuracy tradeoff

HLL work? It turns out, combination of lots of bad estimation is a good estimation

Burak Yucesoy | Citus Data | PGConf EU Some interesting
examples Alice Alice Alice … … … Alice Partition 1 Partition 3 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100110...001 ... ... ...

Burak Yucesoy | Citus Data | PGConf EU Some interesting
examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000000...000 ... ... ...

Burak Yucesoy | Citus Data | PGConf EU postgresql-hll •
https://github.com/aggregateknowledge/postgresql-hll • https://github.com/citusdata/postgresql-hll • Companies using postgresql-hll for their dashboard • Neustar • Cloudflare

Burak Yucesoy | Citus Data | PGConf EU postgresql-hll uses
a data structure, also called hll to keep maximum number of leading zeros of each partition. • Use hll_hash_bigint to hash elements. ◦ There are some other functions for other common data types. • Use hll_add_agg to aggregate hashed elements into hll data structure. • Use hll_cardinality to materialize hll data structure to actual distinct count. postgresql-hll in single node

Burak Yucesoy | Citus Data | PGConf EU What Happens
in Distributed Scenario?

Burak Yucesoy | Citus Data | PGConf EU How to
merge COUNT(DISTINCT) with HLL Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result

merge COUNT(DISTINCT) with HLL Shard 2 Shard 2 Partition 1 Shard 2 Partition 3 Shard 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result

merge COUNT(DISTINCT) with HLL 11 7 12 1053.255 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg

merge COUNT(DISTINCT) with HLL Shard 1 + Shard 2 Shard 1 Partition 1(7) + Shard 2 Partition 1(11) 11 7 12 1053.255 Estimation Shard 1 Partition 2(5) + Shard 2 Partition 2(7) Shard 1 Partition 3(12) + Shard 2 Partition 4(8)

Burak Yucesoy | Citus Data | PGConf EU 1. Separate
data into shards. postgresql-hll in distributed environment logins_001 logins_002 logins_003

Burak Yucesoy | Citus Data | PGConf EU 2. Put
shards into separate nodes. postgresql-hll in distributed environment Worker Node 1 Coordinator Worker Node 2 Worker Node 3 logins_001 logins_002 logins_003

Burak Yucesoy | Citus Data | PGConf EU 3. For
each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result

Burak Yucesoy | Citus Data | PGConf EU 4. Pull
intermediate results to a single node. postgresql-hll in distributed environment Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)

Burak Yucesoy | Citus Data | PGConf EU 5. Merge
separate hll data structures and materialize them postgresql-hll in distributed environment 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6)

Burak Yucesoy | Citus Data | PGConf EU Or use
Citus :) postgresql-hll in distributed environment

Burak Yucesoy | Citus Data | PGConf EU Burak Yucesoy
[email protected] @byucesoy Thank You citusdata.com | @citusdata

Distributed COUNT(DISTINCT) with Hyperloglog on...

Distributed COUNT(DISTINCT) with Hyperloglog on PostgreSQL | PGConf EU 2017 | Burak Yucesoy

More Decks by Citus Data

Other Decks in Technology

Featured

Transcript