Real-time Analytics on PostgreSQL at Any Scale | Postgres User Group NL | Marco Slot

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November
2018 Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Real-time Analytics on PostgreSQL at any Scale Marco Slot <[email protected]>

2018 You offer a product or service (e.g. SaaS, IoT platform, network telemetry, …) that generates large volumes of time series data. How to build an analytical dashboard for your customers that: • Supports a large number of concurrent users • Reflects new data within minutes • Has subsecond response times • Supports advanced analytics What is real-time analytics? 2 (Heap Analytics)

2018 Pipeline of Collect - Aggregate - Query: Real-time analytics architecture 3 Event source Event source Event source Event source Storage (Database) Aggregate Rollups (Database) Dashboard (App) Collect Queries

2018 Pipeline of Collect - Aggregate - Query: Real-time analytics architecture 4 Event source Event source Event source Event source Storage (Database) Aggregate Rollups (Database) Dashboard (App) Collect Queries Postgres/Citus

2018 Define a table for storing raw events: CREATE TABLE events ( event_id bigserial, event_time timestamptz default now(), customer_id bigint, event_type text, … event_details jsonb ); Raw data table 5

2018 COPY is by far the fastest way of loading data. COPY events (customer_id, event_time, … ) FROM STDIN; A few parallel COPY streams can load hundreds of thousands of events per second! Load data 6

2018 To achieve fast data loading: • Use COPY • Don’t use indexes To achieve fast reading of new events for aggregation: • Use an index Fast data loading 7

2018 To achieve fast data loading: • Use COPY • Don’t use large indexes To achieve fast reading of new events for aggregation: • Use an index Block-range index is suitable for ordered columns: CREATE INDEX event_time_idx ON events USING BRIN (event_time); Fast data loading 8

2018 Pre-computed aggregates for a period and set of (group by) dimensions. Can be further filtered and aggregated to generate charts. What is a rollup? 9 Period Customer Country Site Hit Count SELECT…

2018 Append new data to a raw events table (avoid indexes!): COPY events FROM ... Periodically aggregate events into rollup table (index away!): INSERT INTO rollup SELECT … FROM events … GROUP BY … Application queries the rollup table: SELECT … FROM rollup WHERE customer_id = 1238 … Postgres recipe for real-time analytics 10

2018 Keep your data sorted into buckets Partitioning 11

2018 Partitioning keeps indexes small by dividing tables into partitions: Benefits: • Avoid fragmentation • Smaller indexes • Partition pruning for queries that filter by partition column • Drop old data quickly, without bloat/fragmentation Partitioning 12 COPY COPY

2018 Defining a partitioned table: CREATE TABLE events (...) PARTITION BY (event_time); Setting up hourly partitioning with pg_partman: SELECT partman.create_parent('public.events', 'event_time', 'native', 'hourly'); https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-creating-a-sca lable-time-series-database-on-PostgreSQL/ CREATE EXTENSION pg_partman 13

2018 If you’re using partitioning, pg_partman can drop old partitions: UPDATE partman.part_config SET retention_keep_table = false, retention = '1 month' WHERE parent_table = 'public.events'; Periodically run maintenance: SELECT partman.run_maintenance(); Expiring old data in a partitioned table 14

2018 Run pg_partman maintenance every hour using pg_cron: SELECT cron.schedule('3 * * * *', $$ SELECT partman.run_maintenance() $$); https://github.com/citusdata/pg_cron Periodic partitioning maintenance 15

2018 High vs. Low Cardinality Designing Rollup Tables 16

2018 Define rollup tables containing aggregates: CREATE TABLE rollup_by_period_and_dimensions ( <period> <dimensions> <aggregates> primary key (<dimensions>,<period>) ); Primary key index covers many queries, can also add additional indices: CREATE INDEX usc_idx ON rollup (customer_id, site_id); Rollup table 17

2018 Two rollups are smaller than one: A*B + A*C < A*B*C But… up to 2x more aggregation work. Choosing granularity and dimensions 18 Time Customer Country Aggregates Time Customer Site Aggregates Time Customer Country Site Aggregates ~100 rows per period/customer ~20 rows per period/customer ~20*100=2000 rows per period/customer

2018 Find balance between query performance and table management. 1. Identify dimensions, metrics (aggregates) 2. Try rollup with all dimensions: 3. Test compression/performance (goal is >5x smaller) 4. If too slow / too big, split rollup table based on query patterns 5. Go to 3 Usually ends up with 5-10 rollup tables Guidelines for designing rollups 19

2018 Append-only vs. Incremental Running Aggregations 20

2018 Use INSERT INTO rollup SELECT … FROM events … to populate rollup table. Append-only aggregation (insert): Supports all aggregates, including exact distinct, percentiles Harder to handle late data Incremental aggregation (upsert): Supports late data Cannot handle all aggregates (though can approximate using HLL, TopN) Append-only vs. Incremental 21

2018 Aggregate events for a particular time period and append them to the rollup table, once all the data for the period is available. INSERT INTO rollup SELECT period, dimensions, aggregates FROM events WHERE event_time::date = '2018-09-04' GROUP BY period, dimensions; Should keep track of which periods have been aggregated. Append-only Aggregation 22

2018 Aggregate new events and upsert into rollup table. INSERT INTO rollup SELECT period, dimensions, aggregates FROM events WHERE event_id BETWEEN s AND e GROUP BY period, dimensions ON CONFLICT (dimensions, period) DO UPDATE SET aggregates = aggregates + EXCLUDED.aggregates; Need to be able to incrementally build aggregates. Need to keep track of which events have been aggregated. Incremental Aggregation 23

2018 Incremental aggregation Technique for incremental aggregation using a sequence number shown on the Citus Data blog. Incrementally approximate distinct count: HyperLogLog extension Incrementally approximate top N: TopN extension 24

2018 CREATE EXTENSION Citus Scaling out your analytics pipeline 25

2018 Citus is an open source extension to Postgres (9.6, 10, 11) for transparently distributing tables across many Postgres servers. CREATE EXTENSION citus 26 Coordinator create_distributed_table('events', 'customer_id'); events

2018 Multi-tenancy Tenant ID provides a natural sharding dimension for many applications. Citus automatically co-locates event and rollup data for the same SELECT create_distributed_table('events', 'tenant_id'); SELECT create_distributed_table('rollup', 'tenant_id'); Aggregations can be done locally, without network traffic: INSERT INTO rollup SELECT tenant_id, … FROM events … Dashboard queries are always for a particular tenant: SELECT … FROM rollup WHERE tenant_id = 1238 … 27

2018 COPY asynchronously scatters rows to different shards Data loading in Citus 28 Coordinator COPY events

2018 INSERT … SELECT can be parallelised across shards. Aggregation in Citus 29 Coordinator events create_distributed_table('rollup', 'customer_id'); INSERT INTO rollup SELECT … FROM events GROUP BY customer_id, … rollup INSERT INTO rollup_102182 SELECT … FROM events_102010 GROUP BY … INSERT INTO rollup_102180 SELECT … FROM events_102008 GROUP BY …

2018 SELECT on rollup for a particular customer (from the dashboard) can be routed to the appropriate shard. Querying rollups in Citus 30 Coordinator events SELECT … FROM rollup WHERE customer_id = 12834 … … rollup SELECT … FROM events_102180 WHERE customer_id = 1283 … …

2018 You should use: • COPY to load raw data into a table • BRIN index to find new events during aggregation • Partitioning with pg_partman to expire old data • Rollup tables built from raw event data • Append-only aggregation if you need exact percentile/distinct count • Incremental aggregation if you can have late data • HLL to incrementally approximate distinct count • TopN to incrementally approximate heavy hitters • Citus to scale out Summary 31

2018 [email protected] Q&A 32

Real-time Analytics on PostgreSQL at Any Scale ...

Real-time Analytics on PostgreSQL at Any Scale | Postgres User Group NL | Marco Slot

Citus Data

More Decks by Citus Data

Other Decks in Technology

Featured

Transcript

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November