Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Citus: Distributed PostgreSQL as an Extension | Vaccination Database Talks Carnegie Mellon | Marco Slot

Citus Data
February 22, 2021

Citus: Distributed PostgreSQL as an Extension | Vaccination Database Talks Carnegie Mellon | Marco Slot

One of the defining characteristics of PostgreSQL is its extensibility, which enables developers to add new database functionality without forking from the original project.

Citus is an open source PostgreSQL extension that transforms PostgreSQL into a distributed database. The goal of Citus is to make the versatile set of data processing capabilities in PostgreSQL available at any scale. Citus can scale transactional workloads by routing transactions across nodes, and analytical workloads by parallelizing operations across all cores in the cluster. Citus also has distributed data placement features such as co-location and reference tables to efficiently scale relational database features such as foreign keys and joins.

In this talk, you’ll learn how Citus uses the PostgreSQL extension APIs, such as the planner and executor hooks, in order to implement: a sharding layer, a distributed query planner, an adaptive query executor, and distributed transactions—all in a way that is transparent to the application without changing a line of Postgres code. We will also discuss the design and trade-offs of each component and what we learned along the way.

Citus Data

February 22, 2021
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Citus
    Marco Slot
    [email protected]

    View Slide

  2. What is Citus?
    Citus is an extension (plug-in) to PostgreSQL which adds:
    ž Distributed tables
    ž Reference tables
    Simplicity and flexibility of using PostgreSQL, at scale.
    Multi-purpose:
    ž Scale transactional workloads through routing / delegation
    ž Scale analytical workloads through parallelism and columnar storage

    View Slide

  3. Why be an extension / not a fork?
    PostgreSQL is a core project and a vast ecosystem
    https://github.com/EfficiencyGeek/postgresql-ecosystem
    13

    View Slide

  4. Why Citus?
    Capacity / execution time issues:
    ž Working set does not fit in memory
    ž Reaching limits of network-attached storage (IOPS) / CPU
    ž Analytical query takes too long
    ž Data transformations are single-threaded (e.g. insert..select)
    ž Autovacuum cannot keep up with transactional workload
    ž …
    1) PostgreSQL is limited to a single server

    View Slide

  5. • Joins
    • Functions
    • Constraints
    • Indexes: B-tree, GIN,
    BRIN, & GiST
    • Partial Indexes
    • Other extensions
    • PostGIS
    • Rich datatypes
    • JSONB
    • Window functions
    • CTEs
    • Atomic update / delete
    • Partitioning
    • Interactive transactions
    • Open source
    • …
    Why Citus?
    2) Systems that are not PostgreSQL may be lacking one of:

    View Slide

  6. Citus workload patterns
    Workloads that benefit
    from PostgreSQL
    Workloads that benefit
    from scaling out
    Citus workloads

    View Slide

  7. Citus workload patterns
    Multi-tenant
    Software-as-a-service
    Real-time Analytics
    Customer-facing dashboards
    High performance CRUD
    Microservices
    Data warehouse
    Analytical reports

    View Slide

  8. Citus workload requirements
    Capability Multi-tenant Real-time Analytics High perf. CRUD Data warehouse
    Distributed tables Yes Yes Yes Yes
    Reference tables Yes Yes Yes Yes
    Co-location Yes Yes Yes Yes
    Parallel query Yes Yes
    Parallel DML Yes
    Query routing Yes Yes Yes
    Fast full table scans Yes
    Fast bulk loading Yes Yes
    Connection scaling Yes
    Foreign keys Yes

    View Slide

  9. How to transform PostgreSQL into a
    distributed database as an extension?

    View Slide

  10. View Slide

  11. planner, executor,
    transactions
    Background workers
    foreign data wrappers

    View Slide

  12. What is an extension?
    Extensions consist of:
    1. SQL objects (tables, functions, types, …)
    2. Shared library
    CREATE TABLE pg_dist_node (…);
    CREATE TABLE pg_dist_partition (…);
    CREATE FUNCTION citus_add_node(…)
    RETURNS void LANGUAGE c
    AS '$libdir/citus',
    $function$citus_add_node$function$;
    CREATE FUNCTION create_distributed_table(…)
    RETURNS void LANGUAGE c
    AS '$libdir/citus',
    $function$create_distributed_table$function$;
    #include "postgres.h"
    Datum citus_add_node(…)
    {

    }
    Datum create_distributed_table(…)
    {

    }
    citus.sql citus.c

    View Slide

  13. COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    A Citus cluster consists of multiple PostgreSQL servers with the Citus extension.
    CREATE EXTENSION citus;
    SELECT citus_add_node(…);
    SELECT citus_add_node(…);
    SELECT citus_add_node(…);
    CREATE EXTENSION citus;
    CREATE EXTENSION citus;
    CREATE EXTENSION citus;

    View Slide

  14. COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    Worker nodes can also act as coordinator
    Load balancer
    APPLICATION
    ADMINISTRATION

    View Slide

  15. APPLICATION
    COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    Typical production deployment
    ADMINISTRATION
    Blob storage
    WAL archive

    View Slide

  16. APPLICATION
    CREATE TABLE campaigns (…);
    SELECT create_distributed_table(
    'campaigns', 'company_id');
    METADATA
    COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    CREATE TABLE
    campaigns_102
    CREATE TABLE
    campaigns_105
    CREATE TABLE
    campaigns_101
    CREATE TABLE
    campaigns_104
    CREATE TABLE
    campaigns_103
    CREATE TABLE
    campaigns_106
    How Citus distributes tables across the database cluster

    View Slide

  17. Citus Table Types
    Distributed tables with co-location:
    SELECT create_distributed_table(
    'campaigns', 'company_id');
    SELECT create_distributed_table(
    'ads', 'company_id');
    Reference table:
    SELECT create_reference_table(
    'ad_types');
    Items-4
    Items-4
    Items-4
    campai_4 ad_types_
    Items-4
    Items-4
    Items-4
    ads_4
    Items-4
    Items-4
    Items-4
    campai_8
    Items-4
    Items-4
    Items-4
    ads_8
    ad_types_
    foreign keys
    Worker 1 Worker 2
    Coordinator
    ad_types
    campaigns
    sessions
    ads

    View Slide

  18. APPLICATION
    SELECT
    FROM
    GROUP BY
    campaign_id,
    avg(spend) AS avg_campaign_spend
    campaigns
    campaign_id;
    METADATA
    COORDINATOR
    NODE
    WORKER NODES
    W1
    W2
    W3 …
    Wn
    SELECT company_id
    sum(spend),
    count(spend) …
    FROM
    campaigns_102 …
    SELECT company_id
    sum(spend),
    count(spend) …
    FROM
    campaigns_101 …
    SELECT company_id
    sum(spend),
    count(spend) …
    FROM
    campaigns_103 …
    How Citus distributes queries across the database cluster

    View Slide

  19. Query planner hook
    PostgreSQL defines function pointers that are called at critical points.
    #include "postgres.h"
    void _PG_init(void)
    {

    planner_hook = distributed_planner;

    }
    PlannedStmt *
    distributed_planner(Query *parse, …)
    {

    }
    citus.c
    postgres.c
    planner_hook_type planner_hook = NULL;
    PlannedStmt *
    planner(Query *parse, …)
    {
    PlannedStmt *result;
    if (planner_hook)
    result = (*planner_hook) (parse, …);
    else
    result = standard_planner(parse, …);
    return result;
    }

    View Slide

  20. Executor hook: CustomScan
    Citus integrates into the PostgreSQL planner via a CustomScan
    HashAggregate
    CustomScan
    SELECT ad_id, count(*) FROM clicks GROUP BY ad_id;
    Adaptive Executor
    W1
    W2
    SELECT ad_id, count(*)
    FROM clicks_102
    GROUP BY ad_id;

    View Slide

  21. Citus query planners
    Queries go through several planning layers
    Fast path Router
    (CRUD operations) (Single tenant operations)

    View Slide

  22. Citus query planners
    Query pushdown Join order
    Recursive planning
    (Real-time analytics) (Data warehouse)
    Logical planner

    View Slide

  23. Query pushdown planning
    Complex queries can be embarrassingly parallel
    SELECT campaign_id, sum(clicks.count) FROM
    ads LEFT JOIN LATERAL (
    SELECT campaign_id, count(*) FROM clicks JOIN sites USING (site_id)
    WHERE sites.language = 'nl'
    AND clicks.click_time >= now() – interval '1 hour'
    AND ads.campaign_id = clicks.campaign_id AND clicks.ad_id = ads.ad_id
    GROUP BY campaign_id
    ) clicks ON (true)
    JOIN campaigns USING (campaign_id)
    WHERE campaigns.customer_id = 4
    GROUP BY 1 ORDER BY 2 DESC LIMIT 10;

    View Slide

  24. Query pushdown planning
    Complex queries can be embarrassingly parallel
    Limit
    CustomScan
    Adaptive Executor
    Sort
    W1
    W2
    SELECT campaign_id, ad_id, sum(clicks.count) FROM
    ads_303 LEFT JOIN LATERAL (
    SELECT campaign_id, count(*)
    FROM clicks_402 JOIN sites_1 USING (site_id)
    WHERE sites.language = 'nl'
    AND clicks.click_time >= now() – interval '1 hour'
    AND ads.campaign_id = clicks.campaign_id
    AND clicks.ad_id = ads.ad_id
    ) clicks ON (true)
    JOIN campaigns_102 USING (campaign_id)
    WHERE campaigns.customer_id = 4
    GROUP BY 1, 2 ORDER BY 2 LIMIT 10;

    View Slide

  25. Recursive planning
    Plan non-pushdownable subqueries separately
    SELECT ads.* FROM ads JOIN (
    SELECT ad_id, click_count FROM click_counts ORDER BY 2 DESC LIMIT 10
    ) top_ads USING (ad_id);
    SELECT ads.* FROM ads JOIN
    read_intermediate_result('top_ads')
    USING (ad_id);
    SELECT ad_id, click_count
    FROM click_counts
    ORDER BY 2 DESC LIMIT 10
    Main query Subplan query

    View Slide

  26. Subplan execution (broadcast)
    CustomScan
    Subplan (broadcast)
    W1
    W2
    Adaptive Executor

    COPY "top_ads" FROM STDIN WITH (format 'result')
    SELECT ads.* FROM ads_102 JOIN
    read_intermediate_result('top_ads')
    USING (ad_id);

    View Slide

  27. How Citus handles transactions in a multi-node cluster
    BEGIN;
    UPDATE
    SET
    WHERE
    COMMIT;
    campaigns
    start_date = '2018-03-17'
    company_id = 'Pat Co';
    METADATA
    W1
    W2
    W3 …
    Wn
    BEGIN; UPDATE
    Campaigns_2012
    SET …;
    COMMIT;
    APPLICATION
    COORDINATOR
    NODE
    WORKER NODES
    callbacks:
    - pre-commit
    - post-commit
    - abort

    View Slide

  28. APPLICATION
    BEGIN;
    UPDATE
    SET
    WHERE
    UPDATE
    SET
    WHERE
    COMMIT;
    campaigns
    started = true
    campaign_id = 2;
    ads
    finished = true
    campaign_id = 1;
    METADATA
    W1
    W2
    W3 …
    Wn
    BEGIN …
    assign_distributed_
    transaction_id …
    UPDATE campaigns_102 …
    PREPARE TRANSACTION…
    COMMIT PREPARED…
    BEGIN …
    assign_distributed_
    transaction_id …
    UPDATE campaigns_203 …
    PREPARE TRANSACTION…
    COMMIT PREPARED…
    COORDINATOR
    NODE
    WORKER NODES
    How Citus distributes transactions in a multi-node cluster
    callbacks:
    - pre-commit
    - post-commit
    - abort

    View Slide

  29. METADATA
    W1
    W2
    W3 …
    Wn
    BEGIN …
    assign_distributed_
    transaction_id …
    UPDATE campaigns_102 …
    PREPARE TRANSACTION citus_0_2431;
    COMMIT PREPARED…
    BEGIN …
    assign_distributed_
    transaction_id …
    UPDATE campaigns_203 …
    PREPARE TRANSACTION citus_0_2431;
    COMMIT PREPARED …;
    COORDINATOR
    NODE
    WORKER NODES
    worker Prepared xact
    W1 citus_0_2413
    W2 citus_0_2413
    SELECT gid FROM pg_prepared_xacts
    WHERE gid LIKE 'citus_%d_%'
    2PC recovery
    Compare

    View Slide

  30. METADATA
    W1
    W2
    W3 …
    Wn
    BEGIN …
    assign_distributed_
    transaction_id …
    UPDATE campaigns_102 …
    BEGIN …
    assign_distributed_
    transaction_id …
    UPDATE campaigns_203 …
    COORDINATOR
    NODE
    WORKER NODES
    SELECT * FROM local_wait_edges();
    Deadlock
    detection
    Detect cycles in lock graph

    View Slide

  31. APPLICATION
    COPY
    FROM
    clicks
    STDIN WITH CSV
    METADATA
    W1
    W2
    W3 …
    Wn
    -- Data loading
    COORDINATOR
    NODE
    WORKER NODES
    How Citus parallelizes bulk loading
    Bulk loading using COPY
    ProcessUtility_hook

    View Slide

  32. Creating rollup tables with INSERT..SELECT
    INSERT INTO click_counts
    SELECT campaign_id, date_trunc('hour', click_time), count(*)
    FROM clicks
    WHERE ingest_time BETWEEN '2021-02-22 00:00' AND '2021-02-22 00:05’
    GROUP BY 1, 2
    ON CONFLICT (campaign_id, hour) DO UPDATE count = rollup.count + EXCLUDED.count;
    Distributed, transactional, parallel data transformation:

    View Slide

  33. Demo

    View Slide

  34. Lessons learned
    ž PostgreSQL is extensible enough to build a comprehensive
    distributed database system.
    ž Distribution choices are essential to scale, but difficult for users.
    ž Help customers adopt your database, get paid when they scale out.
    ž If you’re going for compatibility, look at what ORMs are doing.
    ž Solve adoption blockers.
    ž Be prescriptive about the problem/workload you’re solving.

    View Slide

  35. Open challenges
    ž Hybrid local-distributed databases
    ž Picking good distribution columns automatically
    ž Creating good rollups automatically
    ž Distributed geospatial (PostGIS) join optimization
    ž Nested distributed transactions (e.g. in triggers)
    ž Arbitrary foreign keys
    ž Linear connection scaling

    View Slide

  36. Questions?
    [email protected]
    Citus GitHub
    github.com/citusdata/citus
    Citus Slack
    slack.citusdata.com
    Citus Docs
    docs.citusdata.com
    Azure - Hyperscale (Citus) Docs
    docs.microsoft.com/azure/postgresql/hyperscale/
    PostgreSQL hooks documentation
    github.com/AmatanHead/psql-hooks

    View Slide