Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Optimizing Distinct Count in PostgreSQL with HL...

Citus Data
October 16, 2018

Optimizing Distinct Count in PostgreSQL with HLL | SF Bay Area PostgreSQL Meetup | Sai Srirampur

Sai Srirampur, a solution architect for PostgreSQL and Citus databases, will present "Optimizing distinct counts on PostgreSQL with HLL".

In this talk, we will focus on HyperLogLog (HLL) algorithm and its PostgreSQL extension postgresql-hll. HLL can provide approximate answers to COUNT(DISTINCT) queries in mathematically provable error bounds. HLL is not only fast and memory-efficient but also has very interesting properties which especially shine in a distributed environment. During the talk, first, we’ll look at the internals of the HLL. Then, we will look to understand why HLL algorithm is useful to get efficient pre-aggregations and distinct counts in scalable way. Finally, we will look at how HLL can be used in a distributed Postgres database cluster with Citus.

Citus Data

October 16, 2018
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Sai Srirampur at The SF Bay Area Postgres Meetup 16

    October 2018 Optimizing Distinct Count in PostgreSQL with HLL @srirampur @citusdata @BayAreaPostgres
  2. •Sai Srirampur a.k.a Sai •Early Engineer at Citus Data •Joined

    so people won’t have to worry about scaling their Postgres database •Love Ping-pong •@saisrirampur, @citusdata, @BayAreaPostgres
  3. Happy to be at SF Bay Area PostgreSQL Meetup to

    talk about HyperLogLog Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  4. Optimizing Distinct Count in PostgreSQL w/HLL • What is HyperLogLog

    (HLL)? • postgresql-hll • Rollups for COUNT(DISTINCT) • Distributed COUNT(DISTINCT) • Q&A Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  5. What is COUNT(DISTINCT)? • Number of unique elements (cardinality) in

    given data • Useful to find things like… ◦ # unique users visited your web page ◦ # unique products in your inventory Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup apple apple apple orange 3 orange banana banana
  6. Problems with the traditional way to do COUNT(DISTINCT) in Postgres

    • Slow • High memory footprint • Cannot work with streaming data Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  7. There is a better way!!! HyperLogLog(HLL) in PostgreSQL • Approximation

    algorithm (“sketch” algorithm) • Estimates COUNT(DISTINCT)/cardinality of given data • Mathematically proven error bounds • Link to Paper Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  8. Is it OK to approximate? It depends… Sai Srirampur |

    Citus Data | SF Bay Area Postgres Meetup
  9. Examples of when OK to approximate? • Count the felonies

    associated to a person - NOT OK • Unique sessions on my website - OK Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  10. 4 reasons I HLL — vs. COUNT(DISTINCT) • Very fast

    • Low memory footprint • Can work with streaming data • Can merge estimations of two separate datasets efficiently Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  11. How does HLL work? The Steps 1. Hash all elements

    a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare-bit patterns 3. Stochastic averaging Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  12. How does HLL work? - Observing rare bit patterns (1)

    Alice --------------> 645403841 --------------> 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  13. How does HLL work? - Observing rare bit patterns (2)

    Bob --------------> 1492309842 --------------> 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  14. How does HLL work? - Observing rare bit patterns Maximum

    number of leading zeros: 7 Cardinality Estimation: 27 ... Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  15. Cardinality Estimation - Probabilistic If I see 2n elements I

    can probabilistically say that at least 1 element has n leading zeros. # leading zeros probability cardinality 1 1/21 2 2 1/22 22 3 1/23 23 ……... n 1/2n 2n Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  16. Problem…. >>> SUPPOSE DATASET WITH ONLY 1 ELEMENT Charlie -------------->

    0 --------------> 00000...0000 Number of leading zeros: 32 Maximum number of leading zeros: 32 Cardinality Estimation: 232 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  17. Solution: Stochastic Averaging Measuring same thing repeatedly & taking the

    average. Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  18. Solution: Stochastic Averaging • Can be done via a multiple

    hash functions. • Hard to generate & maintain. • Trick is: Partition your data! Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  19. How does HLL work? Stochastic Averaging Data Partition 1 Partition

    3 Partition 2 7 228.968... Estimation 27 25 212 5 12 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  20. How does HLL work? Stochastic Averaging 01000101...010 First m bits

    to decide partition number Remaining bits to count leading zeros Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  21. Error rate of HLL • Typical Error Rate: 1.04 /

    sqrt(number of partitions) • Memory need is # partitions * log(log(max. value in hash space)) • Memory vs accuracy tradeoff Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  22. Error rate of HLL... Estimate cardinalities well beyond 109 with

    1% error rate while using a memory of only 6 kilobytes Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  23. Why does HLL work? It turns out, combination of lots

    of bad estimation is a good estimation Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup “
  24. Some interesting examples... Alice Alice Alice … … … Alice

    Partition 1 Partition 8 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 ... ... ... Alice --------------> 645403841 --------------> 00100110...001 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  25. Some interesting examples... Charlie Partition 1 Partition 8 Partition 2

    29 0 0 1.142... Harmonic Mean 229 20 20 ... ... ... Charlie --------------> 0 --------------> 00000000...000 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  26. postgresql-hll • Implementation of the hll algorithm for postgres. •

    Extension to perform COUNT(DISTINCT) approximations in Postgres • Introduces hll data type. “1280 bytes hll can estimate the count of tens of billions of distinct values with only a few percent error.” Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  27. hll_hash_<data_type>(element) returns hll_hash hll_add(hash) returns hll hll_union(hll) returns hll hll_add_agg(hash)

    returns hll hll_union_agg(hll) returns hll hll_cardinality(hll) returns integer postgresql-hll: functions & aggregations Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  28. About the postgresql-hll extension to Postgres • https://github.com/citusdata/postgresql-hll (created by

    engineers at Neustar, now maintained by Citus Data) • Companies using postgresql-hll for dashboards • Algolia • Microsoft for Windows telemetry Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  29. Pre-computed aggregates for a period and set of (group by)

    dimensions. Can be further filtered and aggregated to generate charts. What is a rollup? 36 Period Customer Country Site Hit Count SELECT… Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  30. • Fast (indexed) lookups of aggregates • Avoid expensive repeated

    computation • Rollups can be further aggregated • Rollups are compact, can be kept over longer periods What are the benefits of rollups? 37 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  31. Append new data to a raw events table: COPY events

    FROM ... Periodically aggregate events into rollup table: INSERT INTO rollup SELECT … FROM events … GROUP BY … Application queries the rollup table: SELECT … FROM rollup WHERE customer_id = 1238 … Postgres recipe for real-time analytics 38 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  32. Using COUNT(DISTINCT) to do incremental rollups CREATE TABLE hourly_rollup (

    customer_id bigint not null, period timestamptz not null, unique_ips bigint not null, PRIMARY KEY (customer_id, period) ); SELECT sum(unique_ips) from hourly_rollup; Incremental aggregation on COUNT(DISTINCT) 39 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  33. Use hll_add_agg & hll_union to do incremental rollups. CREATE TABLE

    hourly_rollup ( customer_id bigint not null, period timestamptz not null, unique_ips hll not null, PRIMARY KEY (customer_id, period) ); INSERT INTO hourly_rollup SELECT customer_id, date_trunc('hour', created_at), hll_add_agg(ip) FROM page_views WHERE event_time BETWEEN start_time AND end_time GROUP BY 1, 2 ON CONFLICT (customer_id, period) DO UPDATE SET unique_ips = hll_union(unique_ips, EXCLUDED.unique_ips); Incremental aggregation using HLL 40 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  34. Use hll_union_agg to merge HLL objects. Use hll_cardinality to extract

    distinct count. -- HLL SELECT period::date, hll_cardinality(hll_union_agg(unique_ips)) AS uniques FROM hourly_rollup WHERE customer_id = 1283 AND period >= now() - interval '1 week' GROUP BY 1 ORDER BY 1; period │ unqiues ────────────┼───────── 2018-08-29 │ 14712 2018-08-30 │ 33280 … (7 rows) Dashboard queries with HLL 41 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  35. Define a table for storing raw events: CREATE TABLE events(

    event_id bigserial, event_time timestamptz default now(), customer_id bigint, event_type text, country text, browser text, device_id bigint, session_id bigint, details jsonb ); Raw data table 42 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  36. Rollup table “hll datatype to store intermediate results.” CREATE TABLE

    rollup_events_1hr ( customer_id bigint, event_type varchar, country varchar, browser varchar, hour timestamp, event_count bigint, device_distinct_count hll, session_distinct_count hll );
  37. Rollup table “hll datatype to store intermediate results.” CREATE TABLE

    rollup_events_1hr ( customer_id bigint, event_type varchar, country varchar, browser varchar, hour timestamp, event_count bigint, device_distinct_count hll, session_distinct_count hll );
  38. Rollup table “hll datatype to store intermediate results.” CREATE TABLE

    rollup_events_1hr ( customer_id bigint, event_type varchar, country varchar, browser varchar, hour timestamp, event_count bigint, device_distinct_count hll, session_distinct_count hll );
  39. Rollup query INSERT INTO rollup_events_1hr SELECT customer_id, Event_type, country, browser,

    date_trunc('hour', event_time) as hour, count(*) as event_count, hll_add_agg(hll_hash_bigint(device_id)) as device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) as session_distinct_count FROM events WHERE event_time BETWEEN now() AND now() - ‘1hr’ GROUP BY customer_id,event_type,country,browser,hour ON CONFLICT (customer_id,event_type,country,browser,hour) DO UPDATE SET event_count=rollup_events_1hr.event_count+excluded.event_count, device_distinct_count = hll_union(rollup_events_1hr.device_distinct_count,excluded.device_distinct_count), session_distinct_count= hll_union(rollup_events_1hr.session_distinct_count,excluded.session_distinct_count);
  40. What is distributed COUNT(DISTINCT)? 48 Sai Srirampur | Citus Data

    | SF Bay Area Postgres Meetup Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003
  41. • Your data is too big to fit in memory

    of single machine • Naive approach for COUNT(DISTINCT) requires too much memory Why do we need distributed COUNT(DISTINCT)? Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  42. Why is distributed COUNT(DISTINCT) so difficult? Worker Node 1 logins_001

    Coordinator SELECT COUNT(DISTINCT username) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 SELECT COUNT(DISTINCT username) FROM ...; Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  43. Why is distributed COUNT(DISTINCT) so difficult? Worker Node 1 logins_001

    username | date ----------+----------- Alice | 2017-01-02 Bob | 2017-01-03 Charlie | 2017-01-05 Eve | 2017-01-07 Worker Node 3 logins_003 username | date ----------+----------- Dave | 2017-03-23 Eve | 2017-03-29 Charlie | 2017-03-02 Charlie | 2017-03-03 Worker Node 2 logins_002 username | date ----------+----------- Bob | 2017-02-11 Bob | 2017-02-13 Dave | 2017-02-17 Alice | 2017-02-19 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  44. • Pull all distinct data to one node & count

    there. (Doesn’t scale) • Repartition data on the fly. (Scales but it’s very slow) • Use HyperLogLog. (Scales & fast) Some possible approaches Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  45. COUNT (DISTINCT) Aggregate in Citus 53 Coordinator events SELECT count(distinct

    username) from events; SELECT hll_add_agg(hll_hash(username)) result from events_001 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup SELECT hll_cardinality(hll_union_agg(result)) from intermediate_results; … SELECT hll_add_agg(hll_hash(username)) result from events_002
  46. INSERT … SELECT can be parallelised across shards. Computing rollup

    tables in Citus 54 Coordinator events INSERT INTO rollup SELECT … FROM events GROUP BY … rollup INSERT INTO rollup_102182 SELECT … FROM events_102010 GROUP BY … INSERT INTO rollup_102180 SELECT … FROM events_102008 GROUP BY … Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup
  47. SELECT on rollup for a particular customer (from the dashboard)

    can be routed to the appropriate shard. Querying rollups in Citus 55 Coordinator events SELECT … FROM rollup WHERE tenant_id = 12834 … … rollup SELECT … FROM events_102180 WHERE tenant_id = 1283 … … Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup