Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference EU 2018 | Burak Yucesoy

Citus Data
October 25, 2018

What is HyperLogLog and Why You Will Love It | PostgreSQL Conference EU 2018 | Burak Yucesoy

In applications, it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users or unique visits. While traditional COUNT(DISTINCT) queries work well most of the time for such use cases, it has some drawbacks while working on large data sets which result in large memory requirements and/or slow execution time. It is also not easy to use traditional COUNT(DISTINCT) queries in a distributed environment.

In this talk, we will focus on the HyperLogLog (HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries within mathematically provable error bounds. It is not only fast and memory-efficient but also has very interesting properties which especially shine in a distributed environment. In this talk, Burak Y. from Citus Data (also maintainer of postgresql-hll extension) will talk about internals of HLL and how it estimates cardinality; Jarred from IronNet Cybersecurity will showcase HLL in PostgreSQL with real world examples and use cases from his experience of running HLL in production. We promise that at the end of this session, you will fall in love with this fun little data structure as the newest tool in your data science and analytics tool belt.

Citus Data

October 25, 2018
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What is HyperLogLog and
    Why You Will Love It
    Burak Yücesoy

    View Slide

  2. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    • Number of unique elements (cardinality) in given data
    • Useful to find things like…
    • Number of unique users visited your web page
    • Number of unique products in your inventory
    What is COUNT(DISTINCT)?
    2

    View Slide

  3. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What is COUNT(DISTINCT)?
    3
    logins
    username | date
    ----------+-----------
    Alice | 2018-10-02
    Bob | 2018-10-03
    Alice | 2018-10-05
    Eve | 2018-10-07
    Bob | 2018-10-07
    Bob | 2018-10-08
    • Number of logins: 6
    • Number of unique users who log in: 3

    View Slide

  4. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    • Slow
    • High memory footprint
    • Cannot work with appended/streaming data
    Problems with Traditional COUNT(DISTINCT)
    4

    View Slide

  5. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    HyperLogLog(HLL) is faster alternative to COUNT(DISTINCT) with low
    memory footprint;
    • Approximation algorithm
    • Estimates cardinality (i.e. COUNT(DISTINCT) ) of given data
    • Mathematically provable error bounds
    • It can estimate cardinalities well beyond 109 with 1% error rate using only 6 KB of memory
    There is better way!
    5

    View Slide

  6. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    It depends...
    Is it OK to approximate?
    6

    View Slide

  7. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Is it OK to approximate?
    7
    • Count # of unique felonies associated to a person; Not OK
    • Count # of unique visits to my web page; OK

    View Slide

  8. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    HLL
    • Very fast
    • Low memory footprint
    • Can work with streaming data
    • Can merge estimations of two separate datasets efficiently
    8

    View Slide

  9. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    How does HLL work?
    Steps;
    1. Hash all elements
    a. Ensures uniform data distribution
    b. Can treat all data types same
    2. Observing rare bit patterns
    3. Stochastic averaging
    9

    View Slide

  10. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    How does HLL work? - Observing rare bit patterns
    hash
    Alice 645403841 binary
    0010...001
    Number of leading zeros: 2
    Maximum number of leading zeros: 2
    10

    View Slide

  11. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    How does HLL work? - Observing rare bit patterns
    hash
    Bob 1492309842 binary
    0101...010
    Number of leading zeros: 1
    Maximum number of leading zeros: 2
    11

    View Slide

  12. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    How does HLL work? - Observing rare bit patterns
    ...
    Maximum number of leading zeros: 7
    Cardinality Estimation: 27
    12

    View Slide

  13. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    How does HLL work? Stochastic Averaging
    Measuring same thing repeatedly and taking average.
    13

    View Slide

  14. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    14

    View Slide

  15. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    15

    View Slide

  16. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    How does HLL work? Stochastic Averaging
    Data
    Partition 1
    Partition 3
    Partition 2
    7
    5
    12
    228.968...
    Estimation
    27
    25
    212
    16

    View Slide

  17. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    How does HLL work? Stochastic Averaging
    01000101...010
    First m bits to decide
    partition number
    Remaining bits to
    count leading zeros
    17

    View Slide

  18. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Error rate of HLL
    • Typical Error Rate: 1.04 / sqrt(number of partitions)
    • Memory need is number of partitions * log(log(max. value in hash space)) bit
    • Can estimate cardinalities well beyond 109 with 1% error rate while using a
    memory of only 6 kilobytes
    • Memory vs accuracy tradeoff
    18

    View Slide

  19. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Why does HLL work?
    It turns out, combination of lots of bad observation is a
    good observation
    19

    View Slide

  20. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Some interesting examples
    Alice
    Alice
    Alice



    Alice
    Partition 1
    Partition 8
    Partition 2
    0
    2
    0
    1.103...
    Harmonic
    Mean
    20
    22
    20
    hash
    Alice 645403841 binary
    00100110...001
    ... ... ...
    20

    View Slide

  21. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Some interesting examples
    Charlie
    Partition 1
    Partition 8
    Partition 2
    29
    0
    0
    1.142...
    Harmonic
    Mean
    229
    20
    20
    hash
    Charlie 0 binary
    00000000...000
    ... ... ...
    21

    View Slide

  22. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    HLL in PostgreSQL
    ● https://github.com/citusdata/postgresql-hll
    22

    View Slide

  23. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    postgresql-hll uses a data structure, also called hll to keep maximum number of
    leading zeros of each partition.
    • Use hll_hash_bigint to hash elements.
    • There are some other functions for other common data types.
    • Use hll_add_agg to aggregate hashed elements into hll data structure.
    • Use hll_cardinality to materialize hll data structure to actual distinct count.
    HLL in PostgreSQL
    23

    View Slide

  24. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Real Time Dashboard with
    HyperLogLog
    24

    View Slide

  25. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Precomputed aggregates for period of time and set of dimensions;
    What is Rollup?
    25

    View Slide

  26. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What is Rollup?
    CREATE TABLE rollup_events_5min (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    event_count bigint,
    device_distinct_count bigint,
    session_distinct_count bigint,
    minute timestamp
    );
    CREATE TABLE events (
    id bigint,
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    device_id bigint,
    session_id bigint,
    timestamp timestamp
    );
    26

    View Slide

  27. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What is Rollup?
    CREATE TABLE rollup_events_5min (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    event_count bigint,
    device_distinct_count bigint,
    session_distinct_count bigint,
    minute timestamp
    );
    CREATE TABLE events (
    id bigint,
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    device_id bigint,
    session_id bigint,
    timestamp timestamp
    );
    27

    View Slide

  28. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What is Rollup?
    CREATE TABLE rollup_events_5min (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    event_count bigint,
    device_distinct_count bigint,
    session_distinct_count bigint,
    minute timestamp
    );
    CREATE TABLE events (
    id bigint,
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    device_id bigint,
    session_id bigint,
    timestamp timestamp
    );
    28

    View Slide

  29. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What is Rollup?
    CREATE TABLE rollup_events_5min (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    event_count bigint,
    device_distinct_count bigint,
    session_distinct_count bigint,
    minute timestamp
    );
    CREATE TABLE events (
    id bigint,
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    device_id bigint,
    session_id bigint,
    timestamp timestamp
    );
    29

    View Slide

  30. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    30
    INSERT INTO rollup_events_5min
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    COUNT(*) AS event_count,
    COUNT (DISTINCT device_id) AS device_distinct_count,
    COUNT (DISTINCT session_id) AS session_distinct_count,
    date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
    'epoch' AS minute
    FROM events
    WHERE timestamp >= $1 AND timestamp <=$2
    GROUP BY customer_id, event_type, country, browser, minute
    What is Rollup?
    30

    View Slide

  31. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    31
    INSERT INTO rollup_events_5min
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    COUNT(*) AS event_count,
    COUNT (DISTINCT device_id) AS device_distinct_count,
    COUNT (DISTINCT session_id) AS session_distinct_count,
    date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
    'epoch' AS minute
    FROM events
    WHERE timestamp >= $1 AND timestamp <=$2
    GROUP BY customer_id, event_type, country, browser, minute
    What is Rollup?
    31

    View Slide

  32. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    32
    INSERT INTO rollup_events_5min
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    COUNT(*) AS event_count,
    COUNT (DISTINCT device_id) AS device_distinct_count,
    COUNT (DISTINCT session_id) AS session_distinct_count,
    date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
    'epoch' AS minute
    FROM events
    WHERE timestamp >= $1 AND timestamp <=$2
    GROUP BY customer_id, event_type, country, browser, minute
    What is Rollup?
    32

    View Slide

  33. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    33
    INSERT INTO rollup_events_5min
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    COUNT(*) AS event_count,
    COUNT (DISTINCT device_id) AS device_distinct_count,
    COUNT (DISTINCT session_id) AS session_distinct_count,
    date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
    'epoch' AS minute
    FROM events
    WHERE timestamp >= $1 AND timestamp <=$2
    GROUP BY customer_id, event_type, country, browser, minute
    What is Rollup?
    33

    View Slide

  34. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    • Fast & indexed lookups of aggregates
    • Avoid expensive repeated computations
    • Rollups are compact (uses less space) and can be kept over longer periods
    • Rollups can be further aggregated
    Benefit of Rollup Tables
    34

    View Slide

  35. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What if I want to get aggregation result for 1 hour period?
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    SUM (event_count) AS event_count,
    SUM (device_distinct_count) AS device_distinct_count,
    SUM (session_distinct_count) AS session_distinct_count,
    date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
    hour
    FROM rollup_events_5min
    GROUP BY customer_id, event_type, country, browser, minute
    35

    View Slide

  36. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What if I want to get aggregation result for 1 hour period?
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    SUM (event_count) AS event_count,
    SUM (device_distinct_count) AS device_distinct_count,
    SUM (session_distinct_count) AS session_distinct_count,
    date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
    hour
    FROM rollup_events_5min
    GROUP BY customer_id, event_type, country, browser, minute
    36

    View Slide

  37. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What if I want to get aggregation result for 1 hour period?
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    SUM (event_count) AS event_count,
    SUM (device_distinct_count) AS device_distinct_count,
    SUM (session_distinct_count) AS session_distinct_count,
    date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
    hour
    FROM rollup_events_5min
    GROUP BY customer_id, event_type, country, browser, minute
    37

    View Slide

  38. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Rollup Table with HLL
    CREATE TABLE rollup_events_5min (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    event_count bigint,
    device_distinct_count hll,
    session_distinct_count hll,
    minute timestamp
    );
    CREATE TABLE rollup_events_5min (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    event_count bigint,
    device_distinct_count bigint,
    session_distinct_count bigint,
    minute timestamp
    );
    38

    View Slide

  39. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    39
    INSERT INTO rollup_events_5min
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    COUNT(*) AS event_count,
    hll_add_agg(hll_hash_bigint(device_id)) AS device_distinct_count,
    hll_add_agg(hll_hash_bigint(session_id)) AS session_distinct_count,
    date_trunc('seconds', (timestamp - TIMESTAMP 'epoch') / 300) * 300 + TIMESTAMP
    'epoch' AS minute
    FROM events
    WHERE timestamp >= $1 AND timestamp <=$2
    GROUP BY customer_id, event_type, country, browser, minute
    Rollup Table with HLL
    39

    View Slide

  40. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What if I want to get aggregation result for 1 hour period?
    SELECT
    customer_id,
    event_type,
    country,
    browser,
    SUM (event_count) AS event_count,
    hll_union_agg (device_distinct_count) AS device_distinct_count,
    hll_union_agg (session_distinct_count) AS session_distinct_count,
    date_trunc('minutes', (minute - TIMESTAMP 'epoch') / 12) * 12 + TIMESTAMP 'epoch' AS
    hour
    FROM rollup_events_5min
    GROUP BY customer_id, event_type, country, browser, minute
    40

    View Slide

  41. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Interval 1
    Interval 1
    Partition 1
    Interval 1
    Partition 3
    Interval 1
    Partition 2
    7
    5
    12
    HLL(7, 5, 12)
    Intermediate
    Result
    How to Merge COUNT(DISTINCT) with HLL
    41

    View Slide

  42. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Interval 2
    Interval 2
    Partition 1
    Interval 2
    Partition 3
    Interval 2
    Partition 2
    11
    7
    8
    HLL(11, 7, 8)
    Intermediate
    Result
    How to Merge COUNT(DISTINCT) with HLL
    42

    View Slide

  43. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    11
    7
    12
    1053.255
    Estimation
    211
    27
    212
    HLL(11, 7, 8)
    HLL(7, 5, 12)
    HLL(11, 7, 12)
    hll_union_agg
    How to Merge COUNT(DISTINCT) with HLL
    43

    View Slide

  44. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Interval 1
    +
    Interval 2
    Interval 1
    Partition 1(7)
    +
    Interval 2
    Partition 1(11)
    11
    7
    12
    1053.255
    Estimation
    Interval 1
    Partition 2(5)
    +
    Interval 2
    Partition 2(7)
    Interval 1
    Partition 3(12)
    +
    Interval 2
    Partition 4(8)
    How to Merge COUNT(DISTINCT) with HLL
    44

    View Slide

  45. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    • What if ...
    • Without hll, you would have to maintain 2n - 1 rollup tables to cover all
    combinations in n columns (multiply this with number of time intervals).
    45
    Rollup Table with HLL
    45

    View Slide

  46. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    What Happens in Distributed
    Scenario?
    46

    View Slide

  47. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    1. Separate data into shards.
    events_001 events_002 events_003
    postgresql-hll in distributed environment
    47

    View Slide

  48. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    2. Put shards into separate nodes.
    Worker
    Node 1
    Coordinator
    Worker
    Node 2
    Worker
    Node 3
    events_001 events_002 events_003
    postgresql-hll in distributed environment
    48

    View Slide

  49. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    3. For each shard, calculate hll (but do not materialize).
    postgresql-hll in distributed environment
    Shard 1
    Shard 1
    Partition 1
    Shard 1
    Partition 3
    Shard 1
    Partition 2
    7
    5
    12
    HLL(7, 5, 12)
    Intermediate
    Result
    49

    View Slide

  50. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    4. Pull intermediate results to a single node.
    Worker
    Node 1
    events_001
    Coordinator
    Worker
    Node 2
    events_002
    Worker
    Node 3
    events_003
    HLL(6, 4, 11) HLL(10, 6, 7) HLL(7, 12, 5)
    postgresql-hll in distributed environment
    50

    View Slide

  51. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    5. Merge separate hll data structures and materialize them
    11
    13
    12
    10532.571...
    211
    213
    212
    HLL(11, 7, 8)
    HLL(7, 5, 12)
    HLL(11, 13, 12)
    HLL(8, 13, 6)
    postgresql-hll in distributed environment
    51

    View Slide

  52. Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    Burak Yücesoy | Citus Data | PGConf EU 2018 | October 2018
    [email protected]
    Thanks
    &
    Questions
    @byucesoy
    Burak Yücesoy
    www.citusdata.com @citusdata

    View Slide