Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beginner's Guide to Partitioning vs. Sharding i...

Beginner's Guide to Partitioning vs. Sharding in Postgres | Claire Giordano | PGConf NYC 2024

Slides from a conference presentation at PGConf NYC 2024. Abstract: Partitioning has come a long way in Postgres since the Postgres 10 days, as has sharding via the Citus extension. If you need to scale your Postgres, your friends may recommend you look into partitioning and/or sharding. But what’s the difference between these two approaches?

This beginner’s guide is for those who want to start with the basics and advance from there. You’ll learn what partitioning and sharding are in Postgres; how they’re different; as well as why (and when) these capabilities can help improve query performance. You’ll learn why some people call partitioning invaluable and others call sharding a lifesaver.

But there are gotchas and they’re not for every use case. As David Rowley says, “Understanding why partitioning helps is key to making the decision about whether to partition, what to partition on, and the type and number of partitions.” Similarly, understanding why sharding helps is key to deciding whether sharding is right for you. The concept of data locality will be touched on as well. And for those who prefer to learn by example and not just theory, yes, there will be examples.

Claire Giordano

September 30, 2024
Tweet

More Decks by Claire Giordano

Other Decks in Technology

Transcript

  1. 1 CLAIRE GIORDANO @clairegiordano • @PosetteConf • @AzureDBPostgres • @citusdata

    Beginner's Guide to Partitioning vs. Sharding in Postgres
  2. So I wrote a thing for PGSQL Phriday “Understanding Partitioning

    and Sharding in Postgres and Citus” aka.ms/blog-partitioning-sharding-claire
  3. 6 Quite a lot of Postgres work @ Microsoft PostgreSQL

    core Citus Open Source PG Ecosystem Azure Database for PostgreSQL Azure CosmosDB for PostgreSQL MANAGED DATABASE SERVICES OPEN SOURCE WORK fully-managed database service for Postgres service powered by Citus extension to Postgres NEW CAPABILITIES in FLEXIBLE SERVER NEW CAPABILITIES Contribute to PG open source (and review patches on many other people's work!) Citus open-source extension to Postgres gives you Postgres at any scale. (Think: distributed Postgres.) Postgres extension & tooling our PG team at Microsoft maintains or contributed to in last ~8 months aka.ms/blog-pg-at-microsoft “Flexible Server” PG Community Contribute to growth & knowledge of the PostgreSQL open-source developer & user communities.
  4. 7 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance This Beginner Guide has _8_ chapters What is Postgres partitioning?
  5. 8 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 1 What is Postgres partitioning?
  6. What is Postgres table partitioning What if we could split

    this tree trunk into many smaller logs, & only work with the ones we need?
  7. • Splits large tables into many smaller tables (partitions) •

    Built-in, native feature in Postgres • Needs to be declared up-front when you first CREATE TABLE
  8. So many improvements to Postgres partitioning in last 5-6 years

    Shout-out to @Brandur for this blog post... aka.ms/blog-partitioning-2022-brandur
  9. 13 What are the benefits of Postgres Partitioning? Can improve

    query performance Improves performance of bulk deletes /if drop partitions Improves autovacuum performance /in so many cases!! Enables you to optimize storage costs (cheap vs. expensive)
  10. In Postgres, trying to remove old rows from a large,

    hot table is flitting with disaster. A long running query must iterate through and mark each one as dead, and even then nothing is reclaimed until an equally expensive vacuum runs through and frees space, and only when it's allowed to after rows are no longer visible to any other query in the system, “ — Brandur whether they're making use of the large table or not. Each row removal land in the WAL, resulting in significant amplification. But with partitions, deletion becomes a simple DROP TABLE. It executes instantly, and with negligible costs (partitioning has other benefits too). The trade-off is maintenance." From Brandur.org Title of post: “Partitioning in Postgres, 2022 Edition”
  11. 15 Declarative Postgres partitioning has 3 partitioning methods PARTITION BY

    RANGE PARTITION BY LIST PARTITION BY HASH Time-series & IOT Countries, company divisions,… When partitioning by range & by list do not work... when NOT an obvious partitioning key
  12. 16 Tracking monthly concert revenue for Taylor Swift Step 1:

    Decide on partition method Step 2: Decide on partition key Step 3: Create a partitioned table Step 4: Create partitions
  13. 17 Let’s FIRST create a partitioned table CREATE TABLE concert_revenue

    ( city_id int not null, sale_date date not null, merch_sales int, ticket_sales int ) PARTITION BY RANGE (sale_date);
  14. 18 Let’s create a partition for June 2024 CREATE TABLE

    concert_revenue_cy2024m06 PARTITION OF concert_revenue FOR VALUES FROM ('2024-06-01') TO ('2024-07-01'); TIP: lower bound is INCLUSIVE TIP: upper bound (2024-07-01) is NOT inclusive
  15. 19 Let’s create 3 partitions for Jun / Jul /

    Aug 2024 CREATE TABLE concert_revenue_cy2024m06 PARTITION OF concert_revenue FOR VALUES FROM ('2024-06-01') TO ('2024-07-01'); CREATE TABLE concert_revenue_cy2024m07 PARTITION OF concert_revenue FOR VALUES FROM ('2024-07-01') TO ('2024-08-01'); CREATE TABLE concert_revenue_cy2024m08 PARTITION OF concert_revenue FOR VALUES FROM ('2024-08-01') TO ('2024-09-01');
  16. 22 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 2 What is Postgres partitioning?
  17. What is sharding? 24 Splitting Postgres tables into smaller tables

    (“shards”) Distributing shards horizontally across multiple nodes AND
  18. Alternative / Equivalent terms for “Sharding” in Postgres 25 Horizontal

    Sharding Scaling out Horizontally Distributing Postgres
  19. Back to previous metaphor If you have one big ginormous

    tree trunk & you want to split it into smaller pieces & distribute it across multiple parts of forest
  20. 28 3 common ways to shard Postgres 2 1 Partitions

    + postgres_fdw Manual sharding sometimes called “sharding at application layer” Create hash-partitioned Postgres table in which every partition is a foreign table using postgres_fdw 3 Citus Extension to Postgres Open source, & also a managed service
  21. Bulk of sharding work done by... 2 1 3 DBA

    Citus Dev App Dev Partitions + postgres_fdw Manual sharding Citus
  22. 31 Diagram of sharding Postgres with Citus Postgres Single Node

    Distributed Citus cluster with shards (smaller tables) on each node in cluster
  23. 33 Benefit of sharding Postgres with Citus: >>>> Get more

    Performance & Scale than you can eek out on a single node Can improve query performance @ scale //due to all the memory, CPU, & disk from multi-node cluster Enables your application to grow & scale // not just now, but in future Improves performance of autovacuum //since it runs in parallel across all the nodes in the cluster
  24. 34 How to distribute tables with Citus for row-based sharding

    SELECT create_distributed_table( 'table_name', 'sharding_key'); N.B. “sharding key” is often called a “distribution column”
  25. 35 New to Citus 12.0, a 2nd way to shard:

    schema-based sharding SELECT citus_schema_distribute( 'name');
  26. 36 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 3 What is Postgres partitioning?
  27. 38 Can you Partition & Shard together? Single PG node

    Partitions on a single Postgres node Sharded Postgres partitions on a distributed 4-node Citus cluster Node 1 Node 2 Node 3 Node 4
  28. 39 These two Citus UDFs simplify partition management create_time_partitions(table_name regclass,

    partition_interval interval, end_at timestamp with time zone, start_from timestamp with time zone DEFAULT now()) #1—Create as many partitions as necessary for given time range
  29. 40 These 2 Citus UDFs simplify partition management drop_old_time_partitions(table_name regclass,

    older_than timestamp with time zone) #2—Drop all partitions older than given timestamp
  30. 41 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 4 What is Postgres partitioning?
  31. Postgres native Partitioning Sharding with Citus extension Partitioning + Sharding

    Combination 2 1 3 0 Aspects or attributes of these technologies
  32. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  33. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  34. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  35. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  36. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  37. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  38. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  39. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  40. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  41. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  42. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  43. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  44. Attribute Partitioning Citus Sharding Partition + Shard Combo Built into

    Postgres Extension to Postgres Single node Multi-node Drop old data quickly Parallel, distributed SQL/DDL/DML Partition/Shard Pruning Parallel autovacuum Better index cache hit ratios Automatic maintenance Time series apps (e.g. IOT) Multi-tenant SaaS apps
  45. 57 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 5 What is Postgres partitioning?
  46. 61 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 6 What is Postgres partitioning?
  47. When partitioning helps improve query performance Just like when you

    bake a cake, need essential ingredients to get the right result
  48. When partitioning helps improve query performance Only when you use

    a WHERE clause that prunes (excludes) partitions from the query
  49. 64 Remember that Taylor Swift concert example... Tracking monthly concert

    revenues... What if we wanted to know what the August revenue was from ticket sales + merch sales?
  50. 65 This EXPLAIN query is missing WHERE clause EXPLAIN (costs

    off) SELECT count(*) FROM concert_revenue;
  51. 66 So it has to scan all 3 of the

    partitions  EXPLAIN (costs off) SELECT count(*) FROM concert_revenue; QUERY PLAN --------------------------------------------------------------------- Aggregate -> Append -> Seq Scan on concert_revenue_cy2024m06 concert_revenue_1 -> Seq Scan on concert_revenue_cy2024m07 concert_revenue_2 -> Seq Scan on concert_revenue_cy2024m08 concert_revenue_3 (5 rows)
  52. 67 And doesn’t benefit from partition pruning  EXPLAIN (costs

    off) SELECT count(*) FROM concert_revenue; QUERY PLAN --------------------------------------------------------------------- Aggregate -> Append -> Seq Scan on concert_revenue_cy2024m06 concert_revenue_1 -> Seq Scan on concert_revenue_cy2024m07 concert_revenue_2 -> Seq Scan on concert_revenue_cy2024m08 concert_revenue_3 (5 rows)
  53. 68 This EXPLAIN query includes a WHERE clause EXPLAIN (costs

    off) SELECT count(*) FROM concert_revenue WHERE sale_date BETWEEN '2024-08-01' AND '2024-08-31';
  54. 69 And so it only needs to scan the Aug

    partition EXPLAIN (costs off) SELECT count(*) FROM concert_revenue WHERE sale_date BETWEEN '2024-08-01' AND '2024-08-31'; QUERY PLAN --------------------------------------------------------------------- ---------------------- Aggregate -> Seq Scan on concert_revenue_cy2024m08 concert_revenue Filter: ((sale_date >= '2024-08-01'::date) AND (sale_date <= '2024-08-31'::date)) (3 rows)
  55. 70 And so it only needs to scan the Aug

    partition EXPLAIN (costs off) SELECT count(*) FROM concert_revenue WHERE sale_date BETWEEN '2024-08-01' AND '2024-08-31'; QUERY PLAN --------------------------------------------------------------------- ---------------------- Aggregate -> Seq Scan on concert_revenue_cy2024m08 concert_revenue Filter: ((sale_date >= '2024-08-01'::date) AND (sale_date <= '2024-08-31'::date)) (3 rows)
  56. 71 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 7 What is Postgres partitioning?
  57. Why sharding can help query performance Let’s say this huge

    Redwood Tree is like a big Postgres table... More CPU, Memory, Disk
  58. Why sharding can help query performance Let’s say this huge

    Redwood Tree is like a big Postgres table... More CPU, Memory, Disk
  59. Why sharding can help query performance Let’s say this huge

    Redwood Tree is like a big Postgres table... Shard Pruning Logic
  60. Why sharding can help query performance Let’s say this huge

    Redwood Tree is like a big Postgres table... Shard Pruning Logic Co-Located Tables More CPU, Memory, Disk
  61. 77 What is sharding? Partitioning + sharding together How partitioning

    & sharding are different Why partitioning helps query performance When partitioning helps query performance Why sharding helps query performance When sharding helps query performance Part 8 What is Postgres partitioning?
  62. Common example: “Cache hit ratio” has been going down When

    sharding can help improve query performance When your application needs more cpu, memory, or disk than is possible on a single Postgres node CPU Memory Disk
  63. When sharding can help improve query performance CPU Memory Disk

    CPU Memory Disk CPU Memory Disk CPU Memory Disk CPU Memory Disk CPU Memory Disk CPU Memory Disk + + + + + CPU Memory Disk CPU Memory Disk CPU Memory Disk + + When your application benefits from more cpu, memory, or disk...
  64. 81 Takeaways about PG Partitioning & Sharding 1. Partitioning &

    Sharding can be “Invaluable” and “Lifesavers” 3. Planning required 2. You don’t have to pick between Partitioning & Sharding 4. Keys Matter
  65. 82 Takeaways about PG Partitioning & Sharding 1. Partitioning &

    Sharding can be “Invaluable” and “Lifesavers” 3. Planning required 2. You don’t have to pick between Partitioning & Sharding 4. Keys Matter
  66. 83 Takeaways about PG Partitioning & Sharding 1. Partitioning &

    Sharding can be “Invaluable” and “Lifesavers” 3. Planning required 2. You don’t have to pick between Partitioning & Sharding 4. Keys Matter
  67. 84 Takeaways about PG Partitioning & Sharding 1. Partitioning &

    Sharding can be “Invaluable” and “Lifesavers” 3. Planning required 2. You don’t have to pick between Partitioning & Sharding 4. Keys Matter
  68. People to thank for inspiration &or reviews Daniel Gustafsson David

    Rowley Isaac Alves Jelte Fennema-Nio Marco Slot Melanie Plageman Robert Treat Ryan Booz Thomas Munro Umur Cubukcu Charles Feddersen
  69. Have you heard of this podcast for developers who love

    Postgres?  Monthly  episodes  A-ma-zing guests  Recorded LIVE TalkingPostgres.com Hosted by Claire Giordano 19+
  70. with topics like....  Becoming expert in PostgreSQL  From

    dev to Postgres specialist  Journey into Performance benchmarking  My Journey into Postgres monitoring  How I got started as a developer  Why people care about PostGIS TalkingPostgres.com Hosted by Claire Giordano
  71. 92