What is HyperLogLog and Why You Will Love It | PostgreSQL Conference EU 2018 | Burak Yucesoy

Burak Yücesoy | Citus Data | PGConf EU 2018 |
October 2018 What is HyperLogLog and Why You Will Love It Burak Yücesoy

October 2018 • Number of unique elements (cardinality) in given data • Useful to find things like… • Number of unique users visited your web page • Number of unique products in your inventory What is COUNT(DISTINCT)? 2

October 2018 What is COUNT(DISTINCT)? 3 logins username | date ----------+----------- Alice | 2018-10-02 Bob | 2018-10-03 Alice | 2018-10-05 Eve | 2018-10-07 Bob | 2018-10-07 Bob | 2018-10-08 • Number of logins: 6 • Number of unique users who log in: 3

October 2018 • Slow • High memory footprint • Cannot work with appended/streaming data Problems with Traditional COUNT(DISTINCT) 4

October 2018 HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low memory footprint; • Approximation algorithm • Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data • Mathematically provable error bounds • It can estimate cardinalities well beyond 109 with 1% error rate using only 6 KB of memory There is better way! 5

October 2018 It depends... Is it OK to approximate? 6

October 2018 Is it OK to approximate? 7 • Count # of unique felonies associated to a person; Not OK • Count # of unique visits to my web page; OK

October 2018 HLL • Very fast • Low memory footprint • Can work with streaming data • Can merge estimations of two separate datasets efficiently 8

October 2018 How does HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging 9

October 2018 How does HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2 10

October 2018 How does HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2 11

October 2018 How does HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27 12

October 2018 How does HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average. 13

October 2018 14

October 2018 15

October 2018 How does HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212 16

October 2018 How does HLL work? Stochastic Averaging 01000101...010 First m bits to decide partition number Remaining bits to count leading zeros 17

October 2018 Error rate of HLL • Typical Error Rate: 1.04 / sqrt(number of partitions) • Memory need is number of partitions * log(log(max. value in hash space)) bit • Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes • Memory vs accuracy tradeoff 18

October 2018 Why does HLL work? It turns out, combination of lots of bad observation is a good observation 19

October 2018 Some interesting examples Alice Alice Alice … … … Alice Partition 1 Partition 8 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100110...001 ... ... ... 20

October 2018 Some interesting examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000000...000 ... ... ... 21

October 2018 HLL in PostgreSQL • https://github.com/citusdata/postgresql-hll 22

October 2018 postgresql-hll uses a data structure, also called hll to keep maximum number of leading zeros of each partition. • Use hll_hash_bigint to hash elements. • There are some other functions for other common data types. • Use hll_add_agg to aggregate hashed elements into hll data structure. • Use hll_cardinality to materialize hll data structure to actual distinct count. HLL in PostgreSQL 23

October 2018 Real Time Dashboard with HyperLogLog 24

October 2018 Precomputed aggregates for period of time and set of dimensions; What is Rollup? 25

October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 26

October 2018 30 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 30

October 2018 • Fast & indexed lookups of aggregates • Avoid expensive repeated computations • Rollups are compact (uses less space) and can be kept over longer periods • Rollups can be further aggregated Benefit of Rollup Tables 34

October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 35

October 2018 Rollup Table with HLL CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count hll, session_distinct_count hll, minute timestamp ); CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); 38

October 2018 39 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute Rollup Table with HLL 39

October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, hll_union_agg (device_distinct_count) AS device_distinct_count, hll_union_agg (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 40

October 2018 Interval 1 Interval 1 Partition 1 Interval 1 Partition 3 Interval 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 41

October 2018 Interval 2 Interval 2 Partition 1 Interval 2 Partition 3 Interval 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 42

October 2018 11 7 12 1053.255 Estimation 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg How to Merge COUNT(DISTINCT) with HLL 43

October 2018 Interval 1 + Interval 2 Interval 1 Partition 1(7) + Interval 2 Partition 1(11) 11 7 12 1053.255 Estimation Interval 1 Partition 2(5) + Interval 2 Partition 2(7) Interval 1 Partition 3(12) + Interval 2 Partition 4(8) How to Merge COUNT(DISTINCT) with HLL 44

October 2018 • What if ... • Without hll, you would have to maintain 2n - 1 rollup tables to cover all combinations in n columns (multiply this with number of time intervals). 45 Rollup Table with HLL 45

October 2018 What Happens in Distributed Scenario? 46

October 2018 1. Separate data into shards. events_001 events_002 events_003 postgresql-hll in distributed environment 47

October 2018 2. Put shards into separate nodes. Worker Node 1 Coordinator Worker Node 2 Worker Node 3 events_001 events_002 events_003 postgresql-hll in distributed environment 48

October 2018 3. For each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result 49

October 2018 4. Pull intermediate results to a single node. Worker Node 1 events_001 Coordinator Worker Node 2 events_002 Worker Node 3 events_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5) postgresql-hll in distributed environment 50

October 2018 5. Merge separate hll data structures and materialize them 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6) postgresql-hll in distributed environment 51

October 2018 Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 [email protected] Thanks & Questions @byucesoy Burak Yücesoy www.citusdata.com @citusdata

What is HyperLogLog and Why You Will Love It | ...

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference EU 2018 | Burak Yucesoy

More Decks by Citus Data

Other Decks in Technology

Featured

Transcript