Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Citus: Distributed PostgreSQL as an Extension | Vaccination Database Talks Carnegie Mellon | Marco Slot

Citus Data
February 22, 2021

Citus: Distributed PostgreSQL as an Extension | Vaccination Database Talks Carnegie Mellon | Marco Slot

One of the defining characteristics of PostgreSQL is its extensibility, which enables developers to add new database functionality without forking from the original project.

Citus is an open source PostgreSQL extension that transforms PostgreSQL into a distributed database. The goal of Citus is to make the versatile set of data processing capabilities in PostgreSQL available at any scale. Citus can scale transactional workloads by routing transactions across nodes, and analytical workloads by parallelizing operations across all cores in the cluster. Citus also has distributed data placement features such as co-location and reference tables to efficiently scale relational database features such as foreign keys and joins.

In this talk, you’ll learn how Citus uses the PostgreSQL extension APIs, such as the planner and executor hooks, in order to implement: a sharding layer, a distributed query planner, an adaptive query executor, and distributed transactions—all in a way that is transparent to the application without changing a line of Postgres code. We will also discuss the design and trade-offs of each component and what we learned along the way.

Citus Data

February 22, 2021
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. What is Citus? Citus is an extension (plug-in) to PostgreSQL

    which adds: ž Distributed tables ž Reference tables Simplicity and flexibility of using PostgreSQL, at scale. Multi-purpose: ž Scale transactional workloads through routing / delegation ž Scale analytical workloads through parallelism and columnar storage
  2. Why be an extension / not a fork? PostgreSQL is

    a core project and a vast ecosystem https://github.com/EfficiencyGeek/postgresql-ecosystem 13
  3. Why Citus? Capacity / execution time issues: ž Working set

    does not fit in memory ž Reaching limits of network-attached storage (IOPS) / CPU ž Analytical query takes too long ž Data transformations are single-threaded (e.g. insert..select) ž Autovacuum cannot keep up with transactional workload ž … 1) PostgreSQL is limited to a single server
  4. • Joins • Functions • Constraints • Indexes: B-tree, GIN,

    BRIN, & GiST • Partial Indexes • Other extensions • PostGIS • Rich datatypes • JSONB • Window functions • CTEs • Atomic update / delete • Partitioning • Interactive transactions • Open source • … Why Citus? 2) Systems that are not PostgreSQL may be lacking one of:
  5. Citus workload requirements Capability Multi-tenant Real-time Analytics High perf. CRUD

    Data warehouse Distributed tables Yes Yes Yes Yes Reference tables Yes Yes Yes Yes Co-location Yes Yes Yes Yes Parallel query Yes Yes Parallel DML Yes Query routing Yes Yes Yes Fast full table scans Yes Fast bulk loading Yes Yes Connection scaling Yes Foreign keys Yes …
  6. What is an extension? Extensions consist of: 1. SQL objects

    (tables, functions, types, …) 2. Shared library CREATE TABLE pg_dist_node (…); CREATE TABLE pg_dist_partition (…); CREATE FUNCTION citus_add_node(…) RETURNS void LANGUAGE c AS '$libdir/citus', $function$citus_add_node$function$; CREATE FUNCTION create_distributed_table(…) RETURNS void LANGUAGE c AS '$libdir/citus', $function$create_distributed_table$function$; #include "postgres.h" Datum citus_add_node(…) { … } Datum create_distributed_table(…) { … } citus.sql citus.c
  7. COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn A

    Citus cluster consists of multiple PostgreSQL servers with the Citus extension. CREATE EXTENSION citus; SELECT citus_add_node(…); SELECT citus_add_node(…); SELECT citus_add_node(…); CREATE EXTENSION citus; CREATE EXTENSION citus; CREATE EXTENSION citus;
  8. COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn Worker

    nodes can also act as coordinator Load balancer APPLICATION ADMINISTRATION
  9. APPLICATION COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn

    Typical production deployment ADMINISTRATION Blob storage WAL archive
  10. APPLICATION CREATE TABLE campaigns (…); SELECT create_distributed_table( 'campaigns', 'company_id'); METADATA

    COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn CREATE TABLE campaigns_102 CREATE TABLE campaigns_105 CREATE TABLE campaigns_101 CREATE TABLE campaigns_104 CREATE TABLE campaigns_103 CREATE TABLE campaigns_106 How Citus distributes tables across the database cluster
  11. Citus Table Types Distributed tables with co-location: SELECT create_distributed_table( 'campaigns',

    'company_id'); SELECT create_distributed_table( 'ads', 'company_id'); Reference table: SELECT create_reference_table( 'ad_types'); Items-4 Items-4 Items-4 campai_4 ad_types_ Items-4 Items-4 Items-4 ads_4 Items-4 Items-4 Items-4 campai_8 Items-4 Items-4 Items-4 ads_8 ad_types_ foreign keys Worker 1 Worker 2 Coordinator ad_types campaigns sessions ads
  12. APPLICATION SELECT FROM GROUP BY campaign_id, avg(spend) AS avg_campaign_spend campaigns

    campaign_id; METADATA COORDINATOR NODE WORKER NODES W1 W2 W3 … Wn SELECT company_id sum(spend), count(spend) … FROM campaigns_102 … SELECT company_id sum(spend), count(spend) … FROM campaigns_101 … SELECT company_id sum(spend), count(spend) … FROM campaigns_103 … How Citus distributes queries across the database cluster
  13. Query planner hook PostgreSQL defines function pointers that are called

    at critical points. #include "postgres.h" void _PG_init(void) { … planner_hook = distributed_planner; … } PlannedStmt * distributed_planner(Query *parse, …) { … } citus.c postgres.c planner_hook_type planner_hook = NULL; PlannedStmt * planner(Query *parse, …) { PlannedStmt *result; if (planner_hook) result = (*planner_hook) (parse, …); else result = standard_planner(parse, …); return result; }
  14. Executor hook: CustomScan Citus integrates into the PostgreSQL planner via

    a CustomScan HashAggregate CustomScan SELECT ad_id, count(*) FROM clicks GROUP BY ad_id; Adaptive Executor W1 W2 SELECT ad_id, count(*) FROM clicks_102 GROUP BY ad_id;
  15. Citus query planners Queries go through several planning layers Fast

    path Router (CRUD operations) (Single tenant operations)
  16. Query pushdown planning Complex queries can be embarrassingly parallel SELECT

    campaign_id, sum(clicks.count) FROM ads LEFT JOIN LATERAL ( SELECT campaign_id, count(*) FROM clicks JOIN sites USING (site_id) WHERE sites.language = 'nl' AND clicks.click_time >= now() – interval '1 hour' AND ads.campaign_id = clicks.campaign_id AND clicks.ad_id = ads.ad_id GROUP BY campaign_id ) clicks ON (true) JOIN campaigns USING (campaign_id) WHERE campaigns.customer_id = 4 GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
  17. Query pushdown planning Complex queries can be embarrassingly parallel Limit

    CustomScan Adaptive Executor Sort W1 W2 SELECT campaign_id, ad_id, sum(clicks.count) FROM ads_303 LEFT JOIN LATERAL ( SELECT campaign_id, count(*) FROM clicks_402 JOIN sites_1 USING (site_id) WHERE sites.language = 'nl' AND clicks.click_time >= now() – interval '1 hour' AND ads.campaign_id = clicks.campaign_id AND clicks.ad_id = ads.ad_id ) clicks ON (true) JOIN campaigns_102 USING (campaign_id) WHERE campaigns.customer_id = 4 GROUP BY 1, 2 ORDER BY 2 LIMIT 10;
  18. Recursive planning Plan non-pushdownable subqueries separately SELECT ads.* FROM ads

    JOIN ( SELECT ad_id, click_count FROM click_counts ORDER BY 2 DESC LIMIT 10 ) top_ads USING (ad_id); SELECT ads.* FROM ads JOIN read_intermediate_result('top_ads') USING (ad_id); SELECT ad_id, click_count FROM click_counts ORDER BY 2 DESC LIMIT 10 Main query Subplan query
  19. Subplan execution (broadcast) CustomScan Subplan (broadcast) W1 W2 Adaptive Executor

    … COPY "top_ads" FROM STDIN WITH (format 'result') SELECT ads.* FROM ads_102 JOIN read_intermediate_result('top_ads') USING (ad_id);
  20. How Citus handles transactions in a multi-node cluster BEGIN; UPDATE

    SET WHERE COMMIT; campaigns start_date = '2018-03-17' company_id = 'Pat Co'; METADATA W1 W2 W3 … Wn BEGIN; UPDATE Campaigns_2012 SET …; COMMIT; APPLICATION COORDINATOR NODE WORKER NODES callbacks: - pre-commit - post-commit - abort
  21. APPLICATION BEGIN; UPDATE SET WHERE UPDATE SET WHERE COMMIT; campaigns

    started = true campaign_id = 2; ads finished = true campaign_id = 1; METADATA W1 W2 W3 … Wn BEGIN … assign_distributed_ transaction_id … UPDATE campaigns_102 … PREPARE TRANSACTION… COMMIT PREPARED… BEGIN … assign_distributed_ transaction_id … UPDATE campaigns_203 … PREPARE TRANSACTION… COMMIT PREPARED… COORDINATOR NODE WORKER NODES How Citus distributes transactions in a multi-node cluster callbacks: - pre-commit - post-commit - abort
  22. METADATA W1 W2 W3 … Wn BEGIN … assign_distributed_ transaction_id

    … UPDATE campaigns_102 … PREPARE TRANSACTION citus_0_2431; COMMIT PREPARED… BEGIN … assign_distributed_ transaction_id … UPDATE campaigns_203 … PREPARE TRANSACTION citus_0_2431; COMMIT PREPARED …; COORDINATOR NODE WORKER NODES worker Prepared xact W1 citus_0_2413 W2 citus_0_2413 SELECT gid FROM pg_prepared_xacts WHERE gid LIKE 'citus_%d_%' 2PC recovery Compare
  23. METADATA W1 W2 W3 … Wn BEGIN … assign_distributed_ transaction_id

    … UPDATE campaigns_102 … BEGIN … assign_distributed_ transaction_id … UPDATE campaigns_203 … COORDINATOR NODE WORKER NODES SELECT * FROM local_wait_edges(); Deadlock detection Detect cycles in lock graph
  24. APPLICATION COPY FROM clicks STDIN WITH CSV METADATA W1 W2

    W3 … Wn -- Data loading COORDINATOR NODE WORKER NODES How Citus parallelizes bulk loading Bulk loading using COPY ProcessUtility_hook
  25. Creating rollup tables with INSERT..SELECT INSERT INTO click_counts SELECT campaign_id,

    date_trunc('hour', click_time), count(*) FROM clicks WHERE ingest_time BETWEEN '2021-02-22 00:00' AND '2021-02-22 00:05’ GROUP BY 1, 2 ON CONFLICT (campaign_id, hour) DO UPDATE count = rollup.count + EXCLUDED.count; Distributed, transactional, parallel data transformation:
  26. Lessons learned ž PostgreSQL is extensible enough to build a

    comprehensive distributed database system. ž Distribution choices are essential to scale, but difficult for users. ž Help customers adopt your database, get paid when they scale out. ž If you’re going for compatibility, look at what ORMs are doing. ž Solve adoption blockers. ž Be prescriptive about the problem/workload you’re solving.
  27. Open challenges ž Hybrid local-distributed databases ž Picking good distribution

    columns automatically ž Creating good rollups automatically ž Distributed geospatial (PostGIS) join optimization ž Nested distributed transactions (e.g. in triggers) ž Arbitrary foreign keys ž Linear connection scaling
  28. Questions? [email protected] Citus GitHub github.com/citusdata/citus Citus Slack slack.citusdata.com Citus Docs

    docs.citusdata.com Azure - Hyperscale (Citus) Docs docs.microsoft.com/azure/postgresql/hyperscale/ PostgreSQL hooks documentation github.com/AmatanHead/psql-hooks