Optimizing Distinct Count in PostgreSQL with HLL | SF Bay Area PostgreSQL Meetup | Sai Srirampur

Sai Srirampur at The SF Bay Area Postgres Meetup 16
October 2018 Optimizing Distinct Count in PostgreSQL with HLL @srirampur @citusdata @BayAreaPostgres

•Sai Srirampur a.k.a Sai •Early Engineer at Citus Data •Joined
so people won’t have to worry about scaling their Postgres database •Love Ping-pong •@saisrirampur, @citusdata, @BayAreaPostgres

Sai Srirampur | Citus Data | SF Bay Area Postgres
Meetup

Happy to be at SF Bay Area PostgreSQL Meetup to
talk about HyperLogLog Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Optimizing Distinct Count in PostgreSQL w/HLL • What is HyperLogLog
(HLL)? • postgresql-hll • Rollups for COUNT(DISTINCT) • Distributed COUNT(DISTINCT) • Q&A Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

What is hll? 6

What is COUNT(DISTINCT)? • Number of unique elements (cardinality) in
given data • Useful to find things like… ◦ # unique users visited your web page ◦ # unique products in your inventory Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup apple apple apple orange 3 orange banana banana

Problems with the traditional way to do COUNT(DISTINCT) in Postgres
• Slow • High memory footprint • Cannot work with streaming data Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

There is a better way!!! HyperLogLog(HLL) in PostgreSQL • Approximation
algorithm (“sketch” algorithm) • Estimates COUNT(DISTINCT)/cardinality of given data • Mathematically proven error bounds • Link to Paper Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Is it OK to approximate? Sai Srirampur | Citus Data
| SF Bay Area Postgres Meetup

Is it OK to approximate? It depends… Sai Srirampur |
Citus Data | SF Bay Area Postgres Meetup

Examples of when OK to approximate? • Count the felonies
associated to a person - NOT OK • Unique sessions on my website - OK Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

4 reasons I HLL — vs. COUNT(DISTINCT) • Very fast
• Low memory footprint • Can work with streaming data • Can merge estimations of two separate datasets efficiently Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

How does HLL work? The Steps 1. Hash all elements
a. Ensures uniform data distribution b. Can treat all data types same 2. Observing rare-bit patterns 3. Stochastic averaging Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

How does HLL work? - Observing rare bit patterns (1)
Alice --------------> 645403841 --------------> 0010...001 Number of leading zeros: 2 Maximum number of leading zeros: 2 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

How does HLL work? - Observing rare bit patterns (2)
Bob --------------> 1492309842 --------------> 0101...010 Number of leading zeros: 1 Maximum number of leading zeros: 2 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

How does HLL work? - Observing rare bit patterns Maximum
number of leading zeros: 7 Cardinality Estimation: 27 ... Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Cardinality Estimation - Probabilistic If I see 2n elements I
can probabilistically say that at least 1 element has n leading zeros. # leading zeros probability cardinality 1 1/21 2 2 1/22 22 3 1/23 23 ……... n 1/2n 2n Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Problem…. >>> SUPPOSE DATASET WITH ONLY 1 ELEMENT Charlie -------------->
0 --------------> 00000...0000 Number of leading zeros: 32 Maximum number of leading zeros: 32 Cardinality Estimation: 232 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Solution: Stochastic Averaging Measuring same thing repeatedly & taking the
average. Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Burak Yucesoy | Citus Data | PGConf EU

Solution: Stochastic Averaging • Can be done via a multiple
hash functions. • Hard to generate & maintain. • Trick is: Partition your data! Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

How does HLL work? Stochastic Averaging Data Partition 1 Partition
3 Partition 2 7 228.968... Estimation 27 25 212 5 12 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

How does HLL work? Stochastic Averaging 01000101...010 First m bits
to decide partition number Remaining bits to count leading zeros Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Error rate of HLL • Typical Error Rate: 1.04 /
sqrt(number of partitions) • Memory need is # partitions * log(log(max. value in hash space)) • Memory vs accuracy tradeoff Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Error rate of HLL... Estimate cardinalities well beyond 109 with
1% error rate while using a memory of only 6 kilobytes Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Why does HLL work? It turns out, combination of lots
of bad estimation is a good estimation Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup “

Some interesting examples... Alice Alice Alice … … … Alice
Partition 1 Partition 8 Partition 2 0 2 0 1.103... Harmonic Mean 20 22 20 ... ... ... Alice --------------> 645403841 --------------> 00100110...001 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Some interesting examples... Charlie Partition 1 Partition 8 Partition 2
29 0 0 1.142... Harmonic Mean 229 20 20 ... ... ... Charlie --------------> 0 --------------> 00000000...000 hash binary Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

postgresql-hll 31

postgresql-hll • Implementation of the hll algorithm for postgres. •
Extension to perform COUNT(DISTINCT) approximations in Postgres • Introduces hll data type. “1280 bytes hll can estimate the count of tens of billions of distinct values with only a few percent error.” Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

hll_hash_<data_type>(element) returns hll_hash hll_add(hash) returns hll hll_union(hll) returns hll hll_add_agg(hash)
returns hll hll_union_agg(hll) returns hll hll_cardinality(hll) returns integer postgresql-hll: functions & aggregations Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

About the postgresql-hll extension to Postgres • https://github.com/citusdata/postgresql-hll (created by
engineers at Neustar, now maintained by Citus Data) • Companies using postgresql-hll for dashboards • Algolia • Microsoft for Windows telemetry Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Rollups for count distincts 35

Pre-computed aggregates for a period and set of (group by)
dimensions. Can be further filtered and aggregated to generate charts. What is a rollup? 36 Period Customer Country Site Hit Count SELECT… Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

• Fast (indexed) lookups of aggregates • Avoid expensive repeated
computation • Rollups can be further aggregated • Rollups are compact, can be kept over longer periods What are the benefits of rollups? 37 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Append new data to a raw events table: COPY events
FROM ... Periodically aggregate events into rollup table: INSERT INTO rollup SELECT … FROM events … GROUP BY … Application queries the rollup table: SELECT … FROM rollup WHERE customer_id = 1238 … Postgres recipe for real-time analytics 38 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Using COUNT(DISTINCT) to do incremental rollups CREATE TABLE hourly_rollup (
customer_id bigint not null, period timestamptz not null, unique_ips bigint not null, PRIMARY KEY (customer_id, period) ); SELECT sum(unique_ips) from hourly_rollup; Incremental aggregation on COUNT(DISTINCT) 39 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Use hll_add_agg & hll_union to do incremental rollups. CREATE TABLE
hourly_rollup ( customer_id bigint not null, period timestamptz not null, unique_ips hll not null, PRIMARY KEY (customer_id, period) ); INSERT INTO hourly_rollup SELECT customer_id, date_trunc('hour', created_at), hll_add_agg(ip) FROM page_views WHERE event_time BETWEEN start_time AND end_time GROUP BY 1, 2 ON CONFLICT (customer_id, period) DO UPDATE SET unique_ips = hll_union(unique_ips, EXCLUDED.unique_ips); Incremental aggregation using HLL 40 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Use hll_union_agg to merge HLL objects. Use hll_cardinality to extract
distinct count. -- HLL SELECT period::date, hll_cardinality(hll_union_agg(unique_ips)) AS uniques FROM hourly_rollup WHERE customer_id = 1283 AND period >= now() - interval '1 week' GROUP BY 1 ORDER BY 1; period │ unqiues ────────────┼───────── 2018-08-29 │ 14712 2018-08-30 │ 33280 … (7 rows) Dashboard queries with HLL 41 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Define a table for storing raw events: CREATE TABLE events(
event_id bigserial, event_time timestamptz default now(), customer_id bigint, event_type text, country text, browser text, device_id bigint, session_id bigint, details jsonb ); Raw data table 42 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Rollup table “hll datatype to store intermediate results.” CREATE TABLE
rollup_events_1hr ( customer_id bigint, event_type varchar, country varchar, browser varchar, hour timestamp, event_count bigint, device_distinct_count hll, session_distinct_count hll );

Rollup query INSERT INTO rollup_events_1hr SELECT customer_id, Event_type, country, browser,
date_trunc('hour', event_time) as hour, count(*) as event_count, hll_add_agg(hll_hash_bigint(device_id)) as device_distinct_count, hll_add_agg(hll_hash_bigint(session_id)) as session_distinct_count FROM events WHERE event_time BETWEEN now() AND now() - ‘1hr’ GROUP BY customer_id,event_type,country,browser,hour ON CONFLICT (customer_id,event_type,country,browser,hour) DO UPDATE SET event_count=rollup_events_1hr.event_count+excluded.event_count, device_distinct_count = hll_union(rollup_events_1hr.device_distinct_count,excluded.device_distinct_count), session_distinct_count= hll_union(rollup_events_1hr.session_distinct_count,excluded.session_distinct_count);

Distributed Count Distinct 47

What is distributed COUNT(DISTINCT)? 48 Sai Srirampur | Citus Data
| SF Bay Area Postgres Meetup Worker Node 1 logins_001 Coordinator Worker Node 2 logins_002 Worker Node 3 logins_003

• Your data is too big to fit in memory
of single machine • Naive approach for COUNT(DISTINCT) requires too much memory Why do we need distributed COUNT(DISTINCT)? Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Why is distributed COUNT(DISTINCT) so difficult? Worker Node 1 logins_001
Coordinator SELECT COUNT(DISTINCT username) FROM logins; Worker Node 2 logins_002 Worker Node 3 logins_003 SELECT COUNT(DISTINCT username) FROM ...; Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

• Pull all distinct data to one node & count
there. (Doesn’t scale) • Repartition data on the fly. (Scales but it’s very slow) • Use HyperLogLog. (Scales & fast) Some possible approaches Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

COUNT (DISTINCT) Aggregate in Citus 53 Coordinator events SELECT count(distinct
username) from events; SELECT hll_add_agg(hll_hash(username)) result from events_001 Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup SELECT hll_cardinality(hll_union_agg(result)) from intermediate_results; … SELECT hll_add_agg(hll_hash(username)) result from events_002

INSERT … SELECT can be parallelised across shards. Computing rollup
tables in Citus 54 Coordinator events INSERT INTO rollup SELECT … FROM events GROUP BY … rollup INSERT INTO rollup_102182 SELECT … FROM events_102010 GROUP BY … INSERT INTO rollup_102180 SELECT … FROM events_102008 GROUP BY … Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

SELECT on rollup for a particular customer (from the dashboard)
can be routed to the appropriate shard. Querying rollups in Citus 55 Coordinator events SELECT … FROM rollup WHERE tenant_id = 12834 … … rollup SELECT … FROM events_102180 WHERE tenant_id = 1283 … … Sai Srirampur | Citus Data | SF Bay Area Postgres Meetup

Demo 56

Thank you!! @saisrirampur | @BayAreaPostgres | @citusdata

Optimizing Distinct Count in PostgreSQL with HL...

Optimizing Distinct Count in PostgreSQL with HLL | SF Bay Area PostgreSQL Meetup | Sai Srirampur

More Decks by Citus Data

Other Decks in Technology

Featured

Transcript