Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Analytics on PostgreSQL at Any Scale | Postgres User Group NL | Marco Slot

Citus Data
November 29, 2018

Real-time Analytics on PostgreSQL at Any Scale | Postgres User Group NL | Marco Slot

Building a dashboard that provides real-time insights into a large data stream is a challenging problem. The database needs to support high ingest rates, handle low latency (subsecond) analytical queries from many concurrent users, and reflect new data as soon as possible, while keeping data over long periods. This talk will discuss how you can build a scalable real-time analytics pipeline using PostgreSQL with extensions such as HLL and pg_partman, and how you can scale out across many servers using Citus.

Citus Data

November 29, 2018
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Real-time Analytics on
    PostgreSQL at any Scale
    Marco Slot

    View Slide

  2. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    You offer a product or service (e.g. SaaS, IoT platform, network telemetry, …)
    that generates large volumes of time series data.
    How to build an analytical dashboard for your customers that:
    • Supports a large number of concurrent users
    • Reflects new data within minutes
    • Has subsecond response times
    • Supports advanced analytics
    What is real-time analytics?
    2
    (Heap Analytics)

    View Slide

  3. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Pipeline of Collect - Aggregate - Query:
    Real-time analytics architecture
    3
    Event source
    Event source
    Event source
    Event source
    Storage
    (Database) Aggregate
    Rollups
    (Database)
    Dashboard
    (App)
    Collect
    Queries

    View Slide

  4. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Pipeline of Collect - Aggregate - Query:
    Real-time analytics architecture
    4
    Event source
    Event source
    Event source
    Event source
    Storage
    (Database) Aggregate
    Rollups
    (Database)
    Dashboard
    (App)
    Collect
    Queries
    Postgres/Citus

    View Slide

  5. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Define a table for storing raw events:
    CREATE TABLE events (
    event_id bigserial,
    event_time timestamptz default now(),
    customer_id bigint,
    event_type text,

    event_details jsonb
    );
    Raw data table
    5

    View Slide

  6. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    COPY is by far the fastest way of loading data.
    COPY events (customer_id, event_time, … ) FROM STDIN;
    A few parallel COPY streams can load hundreds of thousands of events per
    second!
    Load data
    6

    View Slide

  7. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    To achieve fast data loading:
    • Use COPY
    • Don’t use indexes
    To achieve fast reading of new events for aggregation:
    • Use an index
    Fast data loading
    7

    View Slide

  8. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    To achieve fast data loading:
    • Use COPY
    • Don’t use large indexes
    To achieve fast reading of new events for aggregation:
    • Use an index
    Block-range index is suitable for ordered columns:
    CREATE INDEX event_time_idx ON events USING BRIN (event_time);
    Fast data loading
    8

    View Slide

  9. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Pre-computed aggregates for a period and set of (group by) dimensions.
    Can be further filtered and aggregated to generate charts.
    What is a rollup?
    9
    Period Customer Country Site Hit Count
    SELECT…

    View Slide

  10. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Append new data to a raw events table (avoid indexes!):
    COPY events FROM ...
    Periodically aggregate events into rollup table (index away!):
    INSERT INTO rollup SELECT … FROM events … GROUP BY …
    Application queries the rollup table:
    SELECT … FROM rollup WHERE customer_id = 1238 …
    Postgres recipe for real-time analytics
    10

    View Slide

  11. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Keep your data sorted into buckets
    Partitioning
    11

    View Slide

  12. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Partitioning keeps indexes small by dividing tables into partitions:
    Benefits:
    • Avoid fragmentation
    • Smaller indexes
    • Partition pruning for queries that filter by partition column
    • Drop old data quickly, without bloat/fragmentation
    Partitioning
    12
    COPY COPY

    View Slide

  13. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Defining a partitioned table:
    CREATE TABLE events (...) PARTITION BY (event_time);
    Setting up hourly partitioning with pg_partman:
    SELECT partman.create_parent('public.events', 'event_time',
    'native', 'hourly');
    https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-creating-a-sca
    lable-time-series-database-on-PostgreSQL/
    CREATE EXTENSION pg_partman
    13

    View Slide

  14. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    If you’re using partitioning, pg_partman can drop old partitions:
    UPDATE partman.part_config
    SET retention_keep_table = false, retention = '1 month'
    WHERE parent_table = 'public.events';
    Periodically run maintenance:
    SELECT partman.run_maintenance();
    Expiring old data in a partitioned table
    14

    View Slide

  15. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Run pg_partman maintenance every hour using pg_cron:
    SELECT cron.schedule('3 * * * *', $$
    SELECT partman.run_maintenance()
    $$);
    https://github.com/citusdata/pg_cron
    Periodic partitioning maintenance
    15

    View Slide

  16. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    High vs. Low Cardinality
    Designing Rollup Tables
    16

    View Slide

  17. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Define rollup tables containing aggregates:
    CREATE TABLE rollup_by_period_and_dimensions (



    primary key (,)
    );
    Primary key index covers many queries, can also add additional indices:
    CREATE INDEX usc_idx ON rollup (customer_id, site_id);
    Rollup table
    17

    View Slide

  18. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Two rollups are smaller than one: A*B + A*C < A*B*C
    But… up to 2x more aggregation work.
    Choosing granularity and dimensions
    18
    Time Customer Country Aggregates
    Time Customer Site Aggregates
    Time Customer Country Site Aggregates
    ~100 rows per period/customer
    ~20 rows per period/customer
    ~20*100=2000 rows per period/customer

    View Slide

  19. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Find balance between query performance and table management.
    1. Identify dimensions, metrics (aggregates)
    2. Try rollup with all dimensions:
    3. Test compression/performance (goal is >5x smaller)
    4. If too slow / too big, split rollup table based on query patterns
    5. Go to 3
    Usually ends up with 5-10 rollup tables
    Guidelines for designing rollups
    19

    View Slide

  20. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Append-only vs. Incremental
    Running Aggregations
    20

    View Slide

  21. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Use INSERT INTO rollup SELECT … FROM events … to populate rollup table.
    Append-only aggregation (insert):
    Supports all aggregates, including exact distinct, percentiles
    Harder to handle late data
    Incremental aggregation (upsert):
    Supports late data
    Cannot handle all aggregates (though can approximate using HLL, TopN)
    Append-only vs. Incremental
    21

    View Slide

  22. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Aggregate events for a particular time period and append them to the rollup
    table, once all the data for the period is available.
    INSERT INTO rollup
    SELECT period, dimensions, aggregates
    FROM events
    WHERE event_time::date = '2018-09-04'
    GROUP BY period, dimensions;
    Should keep track of which periods have been aggregated.
    Append-only Aggregation
    22

    View Slide

  23. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Aggregate new events and upsert into rollup table.
    INSERT INTO rollup
    SELECT period, dimensions, aggregates
    FROM events
    WHERE event_id BETWEEN s AND e
    GROUP BY period, dimensions
    ON CONFLICT (dimensions, period) DO UPDATE
    SET aggregates = aggregates + EXCLUDED.aggregates;
    Need to be able to incrementally build aggregates.
    Need to keep track of which events have been aggregated.
    Incremental Aggregation
    23

    View Slide

  24. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Incremental aggregation
    Technique for incremental aggregation using a sequence number shown on
    the Citus Data blog.
    Incrementally approximate distinct count:
    HyperLogLog extension
    Incrementally approximate top N:
    TopN extension
    24

    View Slide

  25. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    CREATE EXTENSION Citus
    Scaling out your analytics pipeline
    25

    View Slide

  26. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Citus is an open source extension to Postgres (9.6, 10, 11) for transparently
    distributing tables across many Postgres servers.
    CREATE EXTENSION citus
    26
    Coordinator
    create_distributed_table('events', 'customer_id');
    events

    View Slide

  27. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    Multi-tenancy
    Tenant ID provides a natural sharding dimension for many applications.
    Citus automatically co-locates event and rollup data for the same
    SELECT create_distributed_table('events', 'tenant_id');
    SELECT create_distributed_table('rollup', 'tenant_id');
    Aggregations can be done locally, without network traffic:
    INSERT INTO rollup SELECT tenant_id, … FROM events …
    Dashboard queries are always for a particular tenant:
    SELECT … FROM rollup WHERE tenant_id = 1238 …
    27

    View Slide

  28. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    COPY asynchronously scatters rows to different shards
    Data loading in Citus
    28
    Coordinator
    COPY
    events

    View Slide

  29. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    INSERT … SELECT can be parallelised across shards.
    Aggregation in Citus
    29
    Coordinator
    events
    create_distributed_table('rollup', 'customer_id');
    INSERT INTO rollup
    SELECT … FROM events
    GROUP BY customer_id, …
    rollup
    INSERT INTO rollup_102182
    SELECT … FROM events_102010
    GROUP BY …
    INSERT INTO rollup_102180
    SELECT … FROM events_102008
    GROUP BY …

    View Slide

  30. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    SELECT on rollup for a particular customer (from the dashboard) can be
    routed to the appropriate shard.
    Querying rollups in Citus
    30
    Coordinator
    events SELECT … FROM rollup
    WHERE customer_id = 12834 …

    rollup
    SELECT … FROM events_102180
    WHERE customer_id = 1283 …

    View Slide

  31. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    You should use:
    • COPY to load raw data into a table
    • BRIN index to find new events during aggregation
    • Partitioning with pg_partman to expire old data
    • Rollup tables built from raw event data
    • Append-only aggregation if you need exact percentile/distinct count
    • Incremental aggregation if you can have late data
    • HLL to incrementally approximate distinct count
    • TopN to incrementally approximate heavy hitters
    • Citus to scale out
    Summary
    31

    View Slide

  32. Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
    [email protected]
    Q&A
    32

    View Slide