Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Distributed Geospatial Time Series Database | pgDay Paris 2018 | Marco Slot

A Distributed Geospatial Time Series Database | pgDay Paris 2018 | Marco Slot

Citus Data

March 15, 2018
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. A Distributed Geospatial
    Time series Database
    Postgres Extension Synergy

    View full-size slide

  2. Postgres has many third-party extensions:
    - PostGIS: Geospatial data
    - pg_partman: Auto-partitioning
    - pg_cron: Periodic jobs
    - Citus: Sharding
    - cstore_fdw: Columnar storage
    - mysql_fdw, oracle_fdw, tds_fdw: Query other database
    - …
    Postgres extensions
    2 Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  3. PostgreSQL 10 + PostGIS + pg_partman + Citus =
    ...
    3 Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  4. Postgres extensions
    4

    View full-size slide

  5. PostgreSQL 10 + PostGIS + pg_partman + Citus =
    A pretty good database for real-time analytics on
    large volumes of geospatial sensor data...
    5 Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  6. Geospatial queries using PostGIS
    6
    CREATE TABLE trips (

    pickup geometry(Point,4326),
    dropoff geometry(Point,4326)
    );
    Query next
    Neighbourhoods
    WHERE ST_Within(pickup,
    …)
    Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  7. Auto-partitioning using pg_partman
    7
    Disk
    Drop old data really fast
    Smaller indexes,
    faster insertion
    Faster SELECTs on recent data
    SELECT create_parent('trips', …)
    Optimise your database for time series data
    Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  8. Sharding using Citus
    8
    Nodes
    Distribute and parallelise all the things
    Always enough memory,
    CPU, storage, I/O throughput
    COPY with
    ingest parallelism
    Parallel rollups
    using INSERT...SELECT
    SELECT create_distributed_table('trips', …)
    Parallel SELECT
    Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  9. Shard by ID, Partition by time
    9
    Partitioning
    Disk x =
    Nodes
    Sharding
    Distributed time series database
    Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  10. Taxi data from:
    https://github.com/toddwschneider/nyc-taxi-data/
    Blog post:
    “Analyzing 1.1 Billion NYC Taxi and Uber Trips”
    Citus Cloud formation with 4 nodes, each 60GB of memory and 4 cores.
    A distributed geospatial time series database
    10 Marco Slot | Citus Data | pgDay Paris 2018

    View full-size slide

  11. 11
    PostGIS
    pg_partman
    Citus

    View full-size slide

  12. 12
    Data loading tool does a lot of
    pre-processing using PostGIS functions

    View full-size slide

  13. 13
    Distributed SELECT scans 30 million rows/sec

    View full-size slide