Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is HyperLogLog and Why You Will Love It | ...

Citus Data
October 25, 2018

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference EU 2018 | Burak Yucesoy

In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional COUNT(DISTINCT) queries work well most of the time for such use cases, it has some drawbacks while working on large data sets which result in large memory requirements and/or slow execution time. It is also not easy to use traditional COUNT(DISTINCT) queries in a distributed environment.

In this talk, we will focus on the HyperLogLog (HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries within mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in a distributed environment. In this talk, Burak Y. from Citus Data (also maintainer of postgresql-hll extension) will talk about internals of HLL and how it estimates cardinality; Jarred from IronNet Cybersecurity will showcase HLL in PostgreSQL with real world examples and use cases from his experience of running HLL in production. We promise that at the end of this session, you will fall in love with this fun little data structure as the newest tool in your data science and analytics tool belt.

Citus Data

October 25, 2018
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What is HyperLogLog and Why You Will Love It Burak Yücesoy
  2. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 • Number of unique elements (cardinality) in given data • Useful to find things like… • Number of unique users visited your web page • Number of unique products in your inventory What is COUNT(DISTINCT)? 2
  3. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What is COUNT(DISTINCT)? 3 logins username | date ----------+----------- Alice | 2018-10-02 Bob | 2018-10-03 Alice | 2018-10-05 Eve | 2018-10-07 Bob | 2018-10-07 Bob | 2018-10-08 • Number of logins: 6 • Number of unique users who log in: 3
  4. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 • Slow • High memory footprint • Cannot work with appended/streaming data Problems with Traditional COUNT(DISTINCT) 4
  5. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low memory footprint; • Approximation algorithm • Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data • Mathematically provable error bounds • It can estimate cardinalities well beyond 109 with 1% error rate using only 6 KB of memory There is better way! 5
  6. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 It depends... Is it OK to approximate? 6
  7. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Is it OK to approximate? 7 • Count # of unique felonies associated to a person; Not OK • Count # of unique visits to my web page; OK
  8. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 HLL • Very fast • Low memory footprint • Can work with streaming data • Can merge estimations of two separate datasets efficiently 8
  9. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 How does HLL work? Steps; 1. Hash all elements a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare bit patterns 3. Stochastic averaging 9
  10. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 How does HLL work? - Observing rare bit patterns hash Alice 645403841 binary 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2 10
  11. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 How does HLL work? - Observing rare bit patterns hash Bob 1492309842 binary 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2 11
  12. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 How does HLL work? - Observing rare bit patterns ... Maximum number of leading zeros: 7 Cardinality Estimation: 27 12
  13. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 How does HLL work? Stochastic Averaging Measuring same thing repeatedly and taking average. 13
  14. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 How does HLL work? Stochastic Averaging Data Partition 1 Partition 3 Partition 2 7 5 12 228.968... Estimation 27 25 212 16
  15. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 How does HLL work? Stochastic Averaging 01000101...010 First m bits to decide partition number Remaining bits to count leading zeros 17
  16. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Error rate of HLL • Typical Error Rate: 1.04 / sqrt(number of partitions) • Memory need is number of partitions * log(log(max. value in hash space)) bit • Can estimate cardinalities well beyond 109 with 1% error rate while using a memory of only 6 kilobytes • Memory vs accuracy tradeoff 18
  17. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Why does HLL work? It turns out, combination of lots of bad observation is a good observation 19
  18. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Some interesting examples Alice Alice Alice … … … Alice Partition 1 Partition 8 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 hash Alice 645403841 binary 00100110...001 ... ... ... 20
  19. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Some interesting examples Charlie Partition 1 Partition 8 Partition 2 29 0 0 1.142... Harmonic Mean 229 20 20 hash Charlie 0 binary 00000000...000 ... ... ... 21
  20. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 HLL in PostgreSQL • https://github.com/citusdata/postgresql-hll 22
  21. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 postgresql-hll uses a data structure, also called hll to keep maximum number of leading zeros of each partition. • Use hll_hash_bigint to hash elements. • There are some other functions for other common data types. • Use hll_add_agg to aggregate hashed elements into hll data structure. • Use hll_cardinality to materialize hll data structure to actual distinct count. HLL in PostgreSQL 23
  22. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Real Time Dashboard with HyperLogLog 24
  23. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Precomputed aggregates for period of time and set of dimensions; What is Rollup? 25
  24. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 26
  25. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 27
  26. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 28
  27. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What is Rollup? CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); CREATE TABLE events ( id bigint, customer_id bigint, event_type varchar, country varchar, browser varchar, device_id bigint, session_id bigint, timestamp timestamp ); 29
  28. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 30 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 30
  29. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 31 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 31
  30. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 32 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 32
  31. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 33 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, COUNT (DISTINCT device_id) AS device_distinct_count, COUNT (DISTINCT session_id) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute What is Rollup? 33
  32. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 • Fast & indexed lookups of aggregates • Avoid expensive repeated computations • Rollups are compact (uses less space) and can be kept over longer periods • Rollups can be further aggregated Benefit of Rollup Tables 34
  33. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 35
  34. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 36
  35. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, SUM (device_distinct_count) AS device_distinct_count, SUM (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 37
  36. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Rollup Table with HLL CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count hll, session_distinct_count hll, minute timestamp ); CREATE TABLE rollup_events_5min ( customer_id bigint, event_type varchar, country varchar, browser varchar, event_count bigint, device_distinct_count bigint, session_distinct_count bigint, minute timestamp ); 38
  37. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 39 INSERT INTO rollup_events_5min SELECT customer_id, event_type, country, browser, COUNT(*) AS event_count, hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count, date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP 'epoch' AS minute FROM events WHERE timestamp >= $1 AND timestamp <=$2 GROUP BY customer_id, event_type, country, browser, minute Rollup Table with HLL 39
  38. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What if I want to get aggregation result for 1 hour period? SELECT customer_id, event_type, country, browser, SUM (event_count) AS event_count, hll_union_agg (device_distinct_count) AS device_distinct_count, hll_union_agg (session_distinct_count) AS session_distinct_count, date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS hour FROM rollup_events_5min GROUP BY customer_id, event_type, country, browser, minute 40
  39. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Interval 1 Interval 1 Partition 1 Interval 1 Partition 3 Interval 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 41
  40. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Interval 2 Interval 2 Partition 1 Interval 2 Partition 3 Interval 2 Partition 2 11 7 8 HLL(11, 7, 8) Intermediate Result How to Merge COUNT(DISTINCT) with HLL 42
  41. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 11 7 12 1053.255 Estimation 211 27 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 7, 12) hll_union_agg How to Merge COUNT(DISTINCT) with HLL 43
  42. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Interval 1 + Interval 2 Interval 1 Partition 1(7) + Interval 2 Partition 1(11) 11 7 12 1053.255 Estimation Interval 1 Partition 2(5) + Interval 2 Partition 2(7) Interval 1 Partition 3(12) + Interval 2 Partition 4(8) How to Merge COUNT(DISTINCT) with HLL 44
  43. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 • What if ... • Without hll, you would have to maintain 2n - 1 rollup tables to cover all combinations in n columns (multiply this with number of time intervals). 45 Rollup Table with HLL 45
  44. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 What Happens in Distributed Scenario? 46
  45. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 1. Separate data into shards. events_001 events_002 events_003 postgresql-hll in distributed environment 47
  46. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 2. Put shards into separate nodes. Worker Node 1 Coordinator Worker Node 2 Worker Node 3 events_001 events_002 events_003 postgresql-hll in distributed environment 48
  47. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 3. For each shard, calculate hll (but do not materialize). postgresql-hll in distributed environment Shard 1 Shard 1 Partition 1 Shard 1 Partition 3 Shard 1 Partition 2 7 5 12 HLL(7, 5, 12) Intermediate Result 49
  48. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 4. Pull intermediate results to a single node. Worker Node 1 events_001 Coordinator Worker Node 2 events_002 Worker Node 3 events_003 HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5) postgresql-hll in distributed environment 50
  49. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 5. Merge separate hll data structures and materialize them 11 13 12 10532.571... 211 213 212 HLL(11, 7, 8) HLL(7, 5, 12) HLL(11, 13, 12) HLL(8, 13, 6) postgresql-hll in distributed environment 51
  50. Burak Yücesoy | Citus Data | PGConf EU 2018 |

    October 2018 Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018 [email protected] Thanks & Questions @byucesoy Burak Yücesoy www.citusdata.com @citusdata