Slide 1

Slide 1 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Real-time Analytics on PostgreSQL at any Scale Marco Slot

Slide 2

Slide 2 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 You offer a product or service (e.g. SaaS, IoT platform, network telemetry, …) that generates large volumes of time series data. How to build an analytical dashboard for your customers that: • Supports a large number of concurrent users • Reflects new data within minutes • Has subsecond response times • Supports advanced analytics What is real-time analytics? 2 (Heap Analytics)

Slide 3

Slide 3 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Pipeline of Collect - Aggregate - Query: Real-time analytics architecture 3 Event source Event source Event source Event source Storage (Database) Aggregate Rollups (Database) Dashboard (App) Collect Queries

Slide 4

Slide 4 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Pipeline of Collect - Aggregate - Query: Real-time analytics architecture 4 Event source Event source Event source Event source Storage (Database) Aggregate Rollups (Database) Dashboard (App) Collect Queries Postgres/Citus

Slide 5

Slide 5 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Define a table for storing raw events: CREATE TABLE events ( event_id bigserial, event_time timestamptz default now(), customer_id bigint, event_type text, … event_details jsonb ); Raw data table 5

Slide 6

Slide 6 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 COPY is by far the fastest way of loading data. COPY events (customer_id, event_time, … ) FROM STDIN; A few parallel COPY streams can load hundreds of thousands of events per second! Load data 6

Slide 7

Slide 7 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 To achieve fast data loading: • Use COPY • Don’t use indexes To achieve fast reading of new events for aggregation: • Use an index Fast data loading 7

Slide 8

Slide 8 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 To achieve fast data loading: • Use COPY • Don’t use large indexes To achieve fast reading of new events for aggregation: • Use an index Block-range index is suitable for ordered columns: CREATE INDEX event_time_idx ON events USING BRIN (event_time); Fast data loading 8

Slide 9

Slide 9 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Pre-computed aggregates for a period and set of (group by) dimensions. Can be further filtered and aggregated to generate charts. What is a rollup? 9 Period Customer Country Site Hit Count SELECT…

Slide 10

Slide 10 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Append new data to a raw events table (avoid indexes!): COPY events FROM ... Periodically aggregate events into rollup table (index away!): INSERT INTO rollup SELECT … FROM events … GROUP BY … Application queries the rollup table: SELECT … FROM rollup WHERE customer_id = 1238 … Postgres recipe for real-time analytics 10

Slide 11

Slide 11 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Keep your data sorted into buckets Partitioning 11

Slide 12

Slide 12 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Partitioning keeps indexes small by dividing tables into partitions: Benefits: • Avoid fragmentation • Smaller indexes • Partition pruning for queries that filter by partition column • Drop old data quickly, without bloat/fragmentation Partitioning 12 COPY COPY

Slide 13

Slide 13 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Defining a partitioned table: CREATE TABLE events (...) PARTITION BY (event_time); Setting up hourly partitioning with pg_partman: SELECT partman.create_parent('public.events', 'event_time', 'native', 'hourly'); https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-creating-a-sca lable-time-series-database-on-PostgreSQL/ CREATE EXTENSION pg_partman 13

Slide 14

Slide 14 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 If you’re using partitioning, pg_partman can drop old partitions: UPDATE partman.part_config SET retention_keep_table = false, retention = '1 month' WHERE parent_table = 'public.events'; Periodically run maintenance: SELECT partman.run_maintenance(); Expiring old data in a partitioned table 14

Slide 15

Slide 15 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Run pg_partman maintenance every hour using pg_cron: SELECT cron.schedule('3 * * * *', $$ SELECT partman.run_maintenance() $$); https://github.com/citusdata/pg_cron Periodic partitioning maintenance 15

Slide 16

Slide 16 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 High vs. Low Cardinality Designing Rollup Tables 16

Slide 17

Slide 17 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Define rollup tables containing aggregates: CREATE TABLE rollup_by_period_and_dimensions ( primary key (,) ); Primary key index covers many queries, can also add additional indices: CREATE INDEX usc_idx ON rollup (customer_id, site_id); Rollup table 17

Slide 18

Slide 18 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Two rollups are smaller than one: A*B + A*C < A*B*C But… up to 2x more aggregation work. Choosing granularity and dimensions 18 Time Customer Country Aggregates Time Customer Site Aggregates Time Customer Country Site Aggregates ~100 rows per period/customer ~20 rows per period/customer ~20*100=2000 rows per period/customer

Slide 19

Slide 19 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Find balance between query performance and table management. 1. Identify dimensions, metrics (aggregates) 2. Try rollup with all dimensions: 3. Test compression/performance (goal is >5x smaller) 4. If too slow / too big, split rollup table based on query patterns 5. Go to 3 Usually ends up with 5-10 rollup tables Guidelines for designing rollups 19

Slide 20

Slide 20 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Append-only vs. Incremental Running Aggregations 20

Slide 21

Slide 21 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Use INSERT INTO rollup SELECT … FROM events … to populate rollup table. Append-only aggregation (insert): Supports all aggregates, including exact distinct, percentiles Harder to handle late data Incremental aggregation (upsert): Supports late data Cannot handle all aggregates (though can approximate using HLL, TopN) Append-only vs. Incremental 21

Slide 22

Slide 22 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Aggregate events for a particular time period and append them to the rollup table, once all the data for the period is available. INSERT INTO rollup SELECT period, dimensions, aggregates FROM events WHERE event_time::date = '2018-09-04' GROUP BY period, dimensions; Should keep track of which periods have been aggregated. Append-only Aggregation 22

Slide 23

Slide 23 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Aggregate new events and upsert into rollup table. INSERT INTO rollup SELECT period, dimensions, aggregates FROM events WHERE event_id BETWEEN s AND e GROUP BY period, dimensions ON CONFLICT (dimensions, period) DO UPDATE SET aggregates = aggregates + EXCLUDED.aggregates; Need to be able to incrementally build aggregates. Need to keep track of which events have been aggregated. Incremental Aggregation 23

Slide 24

Slide 24 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Incremental aggregation Technique for incremental aggregation using a sequence number shown on the Citus Data blog. Incrementally approximate distinct count: HyperLogLog extension Incrementally approximate top N: TopN extension 24

Slide 25

Slide 25 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 CREATE EXTENSION Citus Scaling out your analytics pipeline 25

Slide 26

Slide 26 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Citus is an open source extension to Postgres (9.6, 10, 11) for transparently distributing tables across many Postgres servers. CREATE EXTENSION citus 26 Coordinator create_distributed_table('events', 'customer_id'); events

Slide 27

Slide 27 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 Multi-tenancy Tenant ID provides a natural sharding dimension for many applications. Citus automatically co-locates event and rollup data for the same SELECT create_distributed_table('events', 'tenant_id'); SELECT create_distributed_table('rollup', 'tenant_id'); Aggregations can be done locally, without network traffic: INSERT INTO rollup SELECT tenant_id, … FROM events … Dashboard queries are always for a particular tenant: SELECT … FROM rollup WHERE tenant_id = 1238 … 27

Slide 28

Slide 28 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 COPY asynchronously scatters rows to different shards Data loading in Citus 28 Coordinator COPY events

Slide 29

Slide 29 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 INSERT … SELECT can be parallelised across shards. Aggregation in Citus 29 Coordinator events create_distributed_table('rollup', 'customer_id'); INSERT INTO rollup SELECT … FROM events GROUP BY customer_id, … rollup INSERT INTO rollup_102182 SELECT … FROM events_102010 GROUP BY … INSERT INTO rollup_102180 SELECT … FROM events_102008 GROUP BY …

Slide 30

Slide 30 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 SELECT on rollup for a particular customer (from the dashboard) can be routed to the appropriate shard. Querying rollups in Citus 30 Coordinator events SELECT … FROM rollup WHERE customer_id = 12834 … … rollup SELECT … FROM events_102180 WHERE customer_id = 1283 … …

Slide 31

Slide 31 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 You should use: • COPY to load raw data into a table • BRIN index to find new events during aggregation • Partitioning with pg_partman to expire old data • Rollup tables built from raw event data • Append-only aggregation if you need exact percentile/distinct count • Incremental aggregation if you can have late data • HLL to incrementally approximate distinct count • TopN to incrementally approximate heavy hitters • Citus to scale out Summary 31

Slide 32

Slide 32 text

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018 [email protected] Q&A 32