Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Optimizing Distinct Count in PostgreSQL with HLL | SF Bay Area PostgreSQL Meetup | Sai Srirampur

Citus Data
October 16, 2018

Optimizing Distinct Count in PostgreSQL with HLL | SF Bay Area PostgreSQL Meetup | Sai Srirampur

Sai Srirampur, a solution architect for PostgreSQL and Citus databases, will present "Optimizing distinct counts on PostgreSQL with HLL".

In this talk, we will focus on HyperLogLog (HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries in mathematically provable error bounds. HLL is not only fast and memory-efficient but also has very interesting properties which especially shine in a distributed environment. During the talk, first, we’ll look at the internals of the HLL. Then, we will look to understand why HLL algorithm is useful to get efficient pre-aggregations and distinct counts in scalable way. Finally, we will look at how HLL can be used in a distributed Postgres database cluster with Citus.

Citus Data

October 16, 2018
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Sai Srirampur at The SF Bay Area Postgres Meetup
    16 October 2018
    Optimizing Distinct Count in
    PostgreSQL with HLL
    @srirampur @citusdata @BayAreaPostgres

    View full-size slide

  2. •Sai Srirampur a.k.a Sai
    •Early Engineer at Citus Data
    •Joined so people won’t have to
    worry about scaling their
    Postgres database
    •Love Ping-pong
    •@saisrirampur, @citusdata,
    @BayAreaPostgres

    View full-size slide

  3. Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  4. Happy to be at SF Bay Area PostgreSQL
    Meetup to talk about HyperLogLog
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  5. Optimizing Distinct Count in PostgreSQL w/HLL
    • What is HyperLogLog (HLL)?
    • postgresql-hll
    • Rollups for COUNT(DISTINCT)
    • Distributed COUNT(DISTINCT)
    • Q&A
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  6. What is hll?
    6

    View full-size slide

  7. What is COUNT(DISTINCT)?
    ● Number of unique elements (cardinality) in given data
    ● Useful to find things like…
    ○ # unique users visited your web page
    ○ # unique products in your inventory
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
    apple
    apple
    apple
    orange 3
    orange
    banana
    banana

    View full-size slide

  8. Problems with the traditional way to do
    COUNT(DISTINCT) in Postgres
    ● Slow
    ● High memory footprint
    ● Cannot work with streaming data
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  9. There is a better way!!!
    HyperLogLog(HLL) in PostgreSQL
    ● Approximation algorithm (“sketch” algorithm)
    ● Estimates COUNT(DISTINCT)/cardinality of given data
    ● Mathematically proven error bounds
    ● Link to Paper
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  10. Is it OK to
    approximate?
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  11. Is it OK to approximate?
    It depends…
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  12. Examples of when OK to approximate?
    • Count the felonies associated to a person - NOT OK
    • Unique sessions on my website - OK
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  13. 4 reasons I HLL — vs. COUNT(DISTINCT)
    ● Very fast
    ● Low memory footprint
    ● Can work with streaming data
    ● Can merge estimations of two separate datasets
    efficiently
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  14. How does HLL work? The Steps
    1. Hash all elements
    a. Ensures uniform data distribution
    b. Can treat all data types same
    2. Observing rare-bit patterns
    3. Stochastic averaging
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  15. How does HLL work? - Observing rare bit
    patterns (1)
    Alice --------------> 645403841 --------------> 0010...001
    Number of leading zeros: 2
    Maximum number of leading zeros: 2
    hash binary
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  16. How does HLL work? - Observing rare bit
    patterns (2)
    Bob --------------> 1492309842 --------------> 0101...010
    Number of leading zeros: 1
    Maximum number of leading zeros: 2
    hash binary
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  17. How does HLL work? - Observing rare bit
    patterns
    Maximum number of leading zeros: 7
    Cardinality Estimation: 27
    ...
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  18. Cardinality Estimation - Probabilistic
    If I see 2n elements I can probabilistically say that at least 1 element has n leading zeros.
    # leading zeros probability cardinality
    1 1/21 2
    2 1/22 22
    3 1/23 23
    ……...
    n 1/2n 2n
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  19. Problem….
    >>> SUPPOSE DATASET WITH ONLY 1 ELEMENT
    Charlie --------------> 0 --------------> 00000...0000
    Number of leading zeros: 32
    Maximum number of leading zeros: 32
    Cardinality Estimation: 232
    hash binary
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  20. Solution: Stochastic Averaging
    Measuring same thing repeatedly & taking the
    average.
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  21. Burak Yucesoy | Citus Data | PGConf EU

    View full-size slide

  22. Burak Yucesoy | Citus Data | PGConf EU

    View full-size slide

  23. Solution: Stochastic Averaging
    • Can be done via a multiple hash functions.
    • Hard to generate & maintain.
    • Trick is:
    Partition your data!
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  24. How does HLL work? Stochastic Averaging
    Data
    Partition 1
    Partition 3
    Partition 2
    7
    228.968...
    Estimation
    27
    25
    212
    5
    12
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  25. How does HLL work? Stochastic Averaging
    01000101...010
    First m bits to decide
    partition number
    Remaining bits to
    count leading zeros
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  26. Error rate of HLL
    ● Typical Error Rate: 1.04 / sqrt(number of partitions)
    ● Memory need is
    # partitions * log(log(max. value in hash space))
    ● Memory vs accuracy tradeoff
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  27. Error rate of HLL...
    Estimate cardinalities well beyond 109
    with 1% error rate
    while using a memory of only 6 kilobytes
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  28. Why does HLL work?
    It turns out, combination of lots of bad
    estimation is a good estimation
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  29. Some interesting examples...
    Alice
    Alice
    Alice



    Alice
    Partition 1
    Partition 8
    Partition 2
    0
    2
    0
    1.103...
    Harmonic
    Mean
    20
    22
    20
    ... ... ...
    Alice --------------> 645403841 --------------> 00100110...001
    hash binary
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  30. Some interesting examples...
    Charlie
    Partition 1
    Partition 8
    Partition 2
    29
    0
    0
    1.142...
    Harmonic
    Mean
    229
    20
    20
    ... ... ...
    Charlie --------------> 0 --------------> 00000000...000
    hash binary
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  31. postgresql-hll
    31

    View full-size slide

  32. postgresql-hll
    ● Implementation of the hll algorithm for postgres.
    ● Extension to perform COUNT(DISTINCT) approximations in
    Postgres
    ● Introduces hll data type.
    “1280 bytes hll can estimate the count of tens of billions of distinct
    values with only a few percent error.”
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  33. hll_hash_(element) returns hll_hash
    hll_add(hash) returns hll
    hll_union(hll) returns hll
    hll_add_agg(hash) returns hll
    hll_union_agg(hll) returns hll
    hll_cardinality(hll) returns integer
    postgresql-hll: functions & aggregations
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  34. About the postgresql-hll extension to Postgres
    • https://github.com/citusdata/postgresql-hll
    (created by engineers at Neustar, now maintained by Citus Data)
    ● Companies using postgresql-hll for dashboards
    ● Algolia
    ● Microsoft for Windows telemetry
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  35. Rollups for count
    distincts
    35

    View full-size slide

  36. Pre-computed aggregates for a period and set of (group by) dimensions.
    Can be further filtered and aggregated to generate charts.
    What is a rollup?
    36
    Period Customer Country Site Hit Count
    SELECT…
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  37. • Fast (indexed) lookups of aggregates
    • Avoid expensive repeated computation
    • Rollups can be further aggregated
    • Rollups are compact, can be kept over longer periods
    What are the benefits of rollups?
    37 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  38. Append new data to a raw events table:
    COPY events FROM ...
    Periodically aggregate events into rollup table:
    INSERT INTO rollup SELECT … FROM events … GROUP BY …
    Application queries the rollup table:
    SELECT … FROM rollup WHERE customer_id = 1238 …
    Postgres recipe for real-time analytics
    38 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  39. Using COUNT(DISTINCT) to do incremental rollups
    CREATE TABLE hourly_rollup (
    customer_id bigint not null,
    period timestamptz not null,
    unique_ips bigint not null,
    PRIMARY KEY (customer_id, period)
    );
    SELECT sum(unique_ips) from hourly_rollup;
    Incremental aggregation on COUNT(DISTINCT)
    39 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  40. Use hll_add_agg & hll_union to do incremental rollups.
    CREATE TABLE hourly_rollup (
    customer_id bigint not null,
    period timestamptz not null,
    unique_ips hll not null,
    PRIMARY KEY (customer_id, period)
    );
    INSERT INTO hourly_rollup
    SELECT customer_id, date_trunc('hour', created_at), hll_add_agg(ip)
    FROM page_views
    WHERE event_time BETWEEN start_time AND end_time
    GROUP BY 1, 2 ON CONFLICT (customer_id, period)
    DO UPDATE SET unique_ips = hll_union(unique_ips, EXCLUDED.unique_ips);
    Incremental aggregation using HLL
    40 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  41. Use hll_union_agg to merge HLL objects.
    Use hll_cardinality to extract distinct count.
    -- HLL
    SELECT period::date, hll_cardinality(hll_union_agg(unique_ips)) AS uniques
    FROM hourly_rollup
    WHERE customer_id = 1283 AND period >= now() - interval '1 week'
    GROUP BY 1 ORDER BY 1;
    period │ unqiues
    ────────────┼─────────
    2018-08-29 │ 14712
    2018-08-30 │ 33280

    (7 rows)
    Dashboard queries with HLL
    41 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  42. Define a table for storing raw events:
    CREATE TABLE events(
    event_id bigserial,
    event_time timestamptz default now(),
    customer_id bigint,
    event_type text,
    country text,
    browser text,
    device_id bigint,
    session_id bigint,
    details jsonb
    );
    Raw data table
    42 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  43. Rollup table
    “hll datatype to store intermediate results.”
    CREATE TABLE rollup_events_1hr (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    hour timestamp,
    event_count bigint,
    device_distinct_count hll,
    session_distinct_count hll
    );

    View full-size slide

  44. Rollup table
    “hll datatype to store intermediate results.”
    CREATE TABLE rollup_events_1hr (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    hour timestamp,
    event_count bigint,
    device_distinct_count hll,
    session_distinct_count hll
    );

    View full-size slide

  45. Rollup table
    “hll datatype to store intermediate results.”
    CREATE TABLE rollup_events_1hr (
    customer_id bigint,
    event_type varchar,
    country varchar,
    browser varchar,
    hour timestamp,
    event_count bigint,
    device_distinct_count hll,
    session_distinct_count hll
    );

    View full-size slide

  46. Rollup query
    INSERT INTO rollup_events_1hr
    SELECT customer_id,
    Event_type, country, browser,
    date_trunc('hour', event_time) as hour,
    count(*) as event_count,
    hll_add_agg(hll_hash_bigint(device_id)) as device_distinct_count,
    hll_add_agg(hll_hash_bigint(session_id)) as session_distinct_count
    FROM events
    WHERE event_time BETWEEN now() AND now() - ‘1hr’
    GROUP BY customer_id,event_type,country,browser,hour
    ON CONFLICT (customer_id,event_type,country,browser,hour)
    DO UPDATE
    SET event_count=rollup_events_1hr.event_count+excluded.event_count,
    device_distinct_count = hll_union(rollup_events_1hr.device_distinct_count,excluded.device_distinct_count),
    session_distinct_count= hll_union(rollup_events_1hr.session_distinct_count,excluded.session_distinct_count);

    View full-size slide

  47. Distributed Count
    Distinct
    47

    View full-size slide

  48. What is distributed COUNT(DISTINCT)?
    48 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
    Worker Node
    1
    logins_001
    Coordinator
    Worker Node
    2
    logins_002
    Worker Node
    3
    logins_003

    View full-size slide

  49. • Your data is too big to fit in memory of single machine
    • Naive approach for COUNT(DISTINCT) requires too
    much memory
    Why do we need distributed COUNT(DISTINCT)?
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  50. Why is distributed COUNT(DISTINCT) so difficult?
    Worker
    Node 1
    logins_001
    Coordinator
    SELECT COUNT(DISTINCT username) FROM logins;
    Worker
    Node 2
    logins_002
    Worker
    Node 3
    logins_003
    SELECT COUNT(DISTINCT username) FROM
    ...;
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  51. Why is distributed COUNT(DISTINCT) so difficult?
    Worker Node 1
    logins_001
    username | date
    ----------+-----------
    Alice | 2017-01-02
    Bob | 2017-01-03
    Charlie | 2017-01-05
    Eve | 2017-01-07
    Worker Node 3
    logins_003
    username | date
    ----------+-----------
    Dave | 2017-03-23
    Eve | 2017-03-29
    Charlie | 2017-03-02
    Charlie | 2017-03-03
    Worker Node 2
    logins_002
    username | date
    ----------+-----------
    Bob | 2017-02-11
    Bob | 2017-02-13
    Dave | 2017-02-17
    Alice | 2017-02-19
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  52. ● Pull all distinct data to one node & count there.
    (Doesn’t scale)
    ● Repartition data on the fly. (Scales but it’s very slow)
    ● Use HyperLogLog. (Scales & fast)
    Some possible approaches
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  53. COUNT (DISTINCT) Aggregate in Citus
    53
    Coordinator
    events
    SELECT count(distinct username) from events;
    SELECT hll_add_agg(hll_hash(username))
    result from events_001
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
    SELECT
    hll_cardinality(hll_union_agg(result)) from
    intermediate_results;

    SELECT hll_add_agg(hll_hash(username)) result
    from events_002

    View full-size slide

  54. INSERT … SELECT can be parallelised across shards.
    Computing rollup tables in Citus
    54
    Coordinator
    events
    INSERT INTO rollup
    SELECT … FROM events
    GROUP BY …
    rollup
    INSERT INTO
    rollup_102182
    SELECT … FROM
    events_102010
    GROUP BY …
    INSERT INTO rollup_102180
    SELECT … FROM
    events_102008
    GROUP BY …
    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  55. SELECT on rollup for a particular customer (from the
    dashboard) can be routed to the appropriate shard.
    Querying rollups in Citus
    55
    Coordinator
    events SELECT … FROM rollup
    WHERE tenant_id = 12834 …

    rollup
    SELECT … FROM
    events_102180
    WHERE tenant_id =
    1283 …

    Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

    View full-size slide

  56. Thank you!!
    @saisrirampur | @BayAreaPostgres | @citusdata

    View full-size slide