Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PostgreSQL at any scale | Warsaw PostgreSQL User Group | Marco Slot

Citus Data
August 05, 2020

PostgreSQL at any scale | Warsaw PostgreSQL User Group | Marco Slot

Citus is an open source extension for PostgreSQL that turns it into a distributed database. Citus can shard or replicate tables across a cluster of PostgreSQL servers and transparently routes or parallelizes queries across the cluster, allowing you to horizontally scale your database without losing any of the powerful PostgreSQL functionality. Citus is especially suitable for scaling Software-as-a-Service (SaaS) applications and applications that require real-time analytics on large data volumes. In this talk, I'll describe the internals of Citus and show you how you can use it to power data-intensive applications. I'll also give a live demo using the Hyperscale (Citus) service on Azure.

Marco Slot is a Principal Software Engineer on the Citus team at Microsoft and is the lead engineer on the Citus extension. He has been working on PostgreSQL extensions including Citus, pg_cron, and pg_auto_failover since 2014 when he joined Citus Data. Prior to Citus Data, Marco did a PhD in cooperative self-driving cars at Trinity College Dublin and helped build CloudFront and Route 53 at Amazon Web Services.

Citus Data

August 05, 2020
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. View Slide

  2. https://github.com/citusdata/citus
    https://azure.microsoft.com/en-us/services/postgresql/

    View Slide

  3. View Slide

  4. View Slide

  5. APPLICATION
    COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    A Citus cluster consists of multiple PostgreSQL servers with the Citus extension.

    View Slide

  6. SELECT create_distributed_table(
    'table_name',
    'distribution_column');

    View Slide

  7. APPLICATION
    CREATE TABLE campaigns (…);
    SELECT create_distributed_table(
    'campaigns','company_id');
    METADATA
    COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    CREATE TABLE
    campaigns_102
    CREATE TABLE
    campaigns_105
    CREATE TABLE
    campaigns_101
    CREATE TABLE
    campaigns_104
    CREATE TABLE
    campaigns_103
    CREATE TABLE
    campaigns_106
    How Citus distributes tables across the database cluster

    View Slide

  8. Distributed tables
    create_distributed_table('companies', 'company_id')
    Data is hash-partitioned across worker nodes based on distribution
    column values.
    Limitations:
    ž Unique constraints must include the distribution column
    ž Foreign keys/join (performance) limitations (more on that later)
    ž Missing features: Triggers, table inheritance (but you can use them...)

    View Slide

  9. INSERT INTO
    VALUES
    campaigns (company_id, …)
    ('Microsoft', …); METADATA
    W1
    W2
    W3 …
    Wn
    INSERT INTO
    campaigns_102
    VALUES …
    APPLICATION
    COORDINATOR
    NODE
    WORKER NODES
    hashtext('Microsoft') = 76743714

    [0,134217727] campaigns_102
    [134217728,268435455] campaigns_103

    View Slide

  10. APPLICATION
    COPY
    FROM
    WITH
    campaigns
    STDIN
    (format 'csv');
    METADATA
    COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    COPY campaigns_102
    FROM STDIN WITH
    (format 'binary')
    COPY campaigns_101
    FROM STDIN WITH
    (format 'binary')
    COPY campaigns_103
    FROM STDIN WITH
    (format 'binary')
    How Citus performs bulk loading across the database cluster
    Microsoft,2938,13,3.7,…
    Acme,386,13,12.8,…

    View Slide

  11. Citus COPY performance
    0 200000 400000 600000 800000 1000000 1200000 1400000 1600000
    Citus (160 cores)
    Citus (64 cores)
    PostgreSQL (64 cores)
    Rows loaded using COPY per second - higher is better
    Multi-threaded (10 threads) COPY with a primary key

    View Slide

  12. APPLICATION
    SELECT
    FROM
    GROUP BY
    company_id,
    sum(spend) AS campaign_spend
    campaigns
    company_id;
    METADATA
    COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    SELECT company_id
    sum(spend),
    FROM
    campaigns_102 …
    SELECT company_id
    sum(spend)
    FROM
    campaigns_101 …
    SELECT company_id
    sum(spend),
    FROM
    campaigns_103 …
    How Citus distributes queries across the database cluster

    View Slide

  13. Citus parallel SELECT performance
    0 50 100 150 200 250 300 350 400
    Citus (160 cores)
    Citus (64 cores)
    PostgreSQL (64 cores)
    Query time in seconds - lower is better
    Simple analytical query on 100GB of data

    View Slide

  14. How Citus handles transactions in a multi-node cluster
    BEGIN;
    UPDATE
    SET
    WHERE

    COMMIT;
    campaigns
    start_date = '2020-03-17'
    company_id = 'Microsoft';
    METADATA
    W1
    W2
    W3 …
    Wn
    BEGIN; UPDATE
    campaigns_102
    SET …;

    COMMIT;
    APPLICATION
    COORDINATOR
    NODE
    WORKER NODES

    View Slide

  15. Citus HammerDB TPC-C performance
    606k
    434k
    207k
    254k
    0 100000 200000 300000 400000 500000 600000 700000
    Citus (256 cores)
    Citus (160 cores)
    Citus (64 cores)
    PostgreSQL (64 cores)
    New Order Transactions Per Minute (NOPM) - higher is better
    HammerDB TPC-C performance
    (1000 warehouses, 400 vusers)

    View Slide

  16. Distributed tables can be co-located
    create_distributed_table('campaigns', 'company_id',
    colocate_with := 'companies')
    Distributed tables can be co-located with other distributed tables.
    Same hash range is always on the same worker node.
    Co-location enables:
    ž full query push down when filtering all tables by distribution column
    ž foreign keys between distributed tables with distribution column
    ž efficient joins between distributed tables on distribution column

    View Slide

  17. Reference tables
    create_reference_table('categories')
    Reference tables are replicated to all worker nodes.
    Enables:
    ž Foreign keys from distributed (or reference) tables on any column
    ž Efficient joins with distributed (or reference) tables on any column

    View Slide

  18. Co-located joins

    View Slide

  19. Router queries

    View Slide

  20. Router queries power multi-tenant applications

    View Slide

  21. Benefits of Citus for multi-tenant applications

    View Slide

  22. Citus helps ASB onboard
    customers 20x faster
    “After migrating to Citus, we can onboard
    Vonto customers 20X faster, in 2 minutes vs.
    the 40+ minutes it used to take.
    And with the launch of Hyperscale (Citus) on
    Azure Database for PostgreSQL, we are
    excited to see what we can build next on
    Azure.”
    100 GB+ data
    Multi-tenant SaaS application
    Milliseconds latency

    View Slide

  23. Parallel operations power real-time analytics
    https://www.citusdata.com/blog/2018/06/14/scalable-incremental-data-aggregation/

    View Slide

  24. INSERT..SELECT transforms the data inside the
    database in parallel
    Network
    Network

    View Slide

  25. Benefits of Citus for real-time analytics
    Operation Achievable performance
    Bulk loading 1M rows/sec
    SELECT from raw data 10-100M rows/sec
    INSERT..SELECT into rollup 1-100M rows/sec
    SELECT on rollup table 1k-100k queries/sec

    View Slide

  26. Microsoft Windows relies on
    Citus for mission-critical
    decisions
    ž “Ship/no-ship decisions for Microsoft
    Windows are made using Hyperscale
    (Citus), where our team runs on-the-fly
    analytics on billions of JSON events with
    sub-second responses.
    ž Distributed PostgreSQL is a game changer.”
    1 PB+ data
    6M+ queries per day;
    75% of queries completing < 0.2 secs

    View Slide

  27. View Slide

  28. View Slide

  29. https://github.com/citusdata/citus
    https://docs.citusdata.com/

    View Slide