Partitioning Billions of Rows Without Downtime

by Andrew Atkinson

Slide 1

Slide 1 text

Partitioning Billions Of Rows Without Downtime Andrew Atkinson #PartPG2023 bit.ly/PartPG2023 Slide 1 of 60

Slide 2

Slide 2 text

bit.ly/PartPG2023 Slide 2 of 60

Slide 3

Slide 3 text

Audience Polling #1: Who has used table partitioning in production in any capacity? #2: Converted from unpartitioned to partitioned? #3: Online conversion with pgslice or similar? bit.ly/PartPG2023 Slide 3 of 60

Slide 4

Slide 4 text

Assumptions Audience has some familiarity with what PostgreSQL Table Partitioning is. Crunchy Playground Partitioning Large Table Partitioning with PostgreSQL and Django How to use table partitioning to scale PostgreSQL A slice of life: table partitioning in PostgreSQL databases Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 4 of 60

Slide 5

Slide 5 text

Learner Outcomes PostgreSQL Table Partitioning Types, History, and Case Study Review Online table conversion and swap, practical details Lessons from our challenges and mistakes How partitioned tables help with cost management and data lifecycle management Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 5 of 60

Slide 6

Slide 6 text

App Background and Context HR Tech App, SaaS, B2B, Monolithic ( ~100K LOC , > 90% test coverage) Running Ruby on Rails 6.1 Using public_activity Ruby gem 3K High growth activities table (Not partitioned) Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 6 of 60

Slide 7

Slide 7 text

Event log for applicants in their hiring process Events are visible to hiring managers and recruiters A lot of data. Denormalized YAML. There can be around ~40 activities per applicant, and there are millions of applicants Errors in this area of the app are highly visible to users Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 7 of 60

Slide 8

Slide 8 text

bit.ly/PartPG2023 Slide 8 of 60

Slide 9

Slide 9 text

Source: public_activity Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 9 of 60

Slide 10

Slide 10 text

PostgreSQL 13.x ~10 production databases each with activities table (sizes from 50 GB to 1.5 TB ) Mix of regular RDS PostgreSQL and Aurora Sum of rows across ~10 DBs is ~3 billion rows and ~2 TB of data PK column is id , uses sequence, bigint Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 10 of 60

Slide 11

Slide 11 text

Problems activities table was very large, modifications like adding indexes are painful Range queries can time out Despite 8 years of data from 2015, users interact most with recent data, and thus most data is never accessed Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 11 of 60

Slide 12

Slide 12 text

We'd like to stop paying forever increasing costs storing data that's not accessed We'd like to reduce overall storage costs, including from consumption, backups, and transfers We'd like to modify the table easily Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 12 of 60

Slide 13

Slide 13 text

Considering Table Partitioning Minimum table size to benefit from partitioning? Postgres.fm recommends >= 100GB PG Docs Guideline "table size exceeds server memory size" ( 1.25TB > ~384 GB ) Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 13 of 60

Slide 14

Slide 14 text

Isn't Partitioning too much work? Old method is a lot of work, Inheritence based, a.k.a "trigger based" partitioning Requires functions, triggers, CHECK constraints to INSERT data in correct spot Uses table inheritance, the INHERITS keyword Considered a "Legacy Method" Sources: #1 Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 14 of 60

Slide 15

Slide 15 text

How to get started? Engineering's job is to build a reliable and scalable SaaS platform. 1. Form hypotheses about acceptable changes for customers 2. Test and validate hypotheses Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 15 of 60

Slide 16

Slide 16 text

Hypothesis: 2 years of data is more than enough Added Feature Flag to LIMIT queries, to test hypothesis Scream test that for a few weeks After weeks passed w/o concerns being raised, we moved forward with partitioning Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 16 of 60

Slide 17

Slide 17 text

Partition Strategy Use RANGE type Partition on created_at timestamp column Monthly partitions Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 17 of 60

Slide 18

Slide 18 text

Declarative Partitioning Launched in Version 10 (2017) LIST , HASH (11), we used RANGE Achieves slicing up data from one big table, into parent and child tables, with less to do No triggers, functions, or CHECK constraints needed Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 18 of 60

Slide 19

Slide 19 text

Declarative Partitioning Improvements 10 Introduction 11 Added DEFAULT partition, HASH type 11 Support for PK , FK constraints, Indexes, Triggers 11 INSERT to correct partition, UPDATE between partitions 12 FK can reference partitions Sources: #1, #2 Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 19 of 60

Slide 20

Slide 20 text

Declarative Partitioning Improvements 12 ATTACH PARTITION supports CONCURRENTLY 13 Row-level BEFORE triggers on partitioned tables 13 Logical replication on partitioned tables 14 DETACH PARTITION supports CONCURRENTLY 14 More operations use Partition Pruning 15 Vacuum partitions in parallel Source: Percona Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 20 of 60

Slide 21

Slide 21 text

Partitioning Options pg_partman 1.5k (w/ pg_cron 1.5k ) pg_party 400 No libraries or extensions, roll your own: John Nunemaker Blog, Brandur Blog pgslice 405 Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 21 of 60

Slide 22

Slide 22 text

Selected pgslice CLI program, written in Ruby, language used by team, open source Small and minimal library Non-Destructive, Row Copying approach Iterative rollout, ability to step forward and backward Leaves unmodified original table as retired Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 22 of 60

Slide 23

Slide 23 text

pgslice Quick Tour prep creates _intermediate add_partitions Adds past and future partitions fill (Batched copies, used --where option to limit to 2 years) analyze Perform SQL ANALYZE on all partitions swap Swap table names Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 23 of 60

Slide 24

Slide 24 text

pgslice Implementation Added gem to Gemfile Created binstub: bundle exec binstubs pgslice , e.g. bin/pgslice Added helper Ruby class ( PgsliceHelper ) to help app integration Invoked helper with rails runner Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 24 of 60

Slide 25

Slide 25 text

Downsides of pgslice and Row Copying Intended for Append Only tables, i.e. no UPDATE or DELETE statements during copy By copying rows, need more storage space Online migration adds load to server, need spare capacity Requires a lot of advanced development and testing Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 25 of 60

Slide 26

Slide 26 text

Managing Partitions In Prod and Dev We wanted developers to use a partitioned table to discover bugs or performance problems, but not have to manage partitions In Prod, we wanted 2 years of partitions Dev doesn't need partitions at all Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 26 of 60

Slide 27

Slide 27 text

Selected DEFAULT Partition (11+) Keeps table Partitioned for Dev Avoids partition maintenance in Dev A nice middle ground Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 27 of 60

Slide 28

Slide 28 text

Partition Management In Prod, Cron Job Creates future partitions ( monitor this! ) In Dev, DEFAULT partition info is dumped to db/structure.sql A Cron job could DETACH partitions to cycle out old ones as well Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 28 of 60

Slide 29

Slide 29 text

Source: GitHub Example Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 29 of 60

Slide 30

Slide 30 text

Source: GitHub Example Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 30 of 60

Slide 31

Slide 31 text

Implementation Notes Added retire / unretire DEFAULT partition Added customization of schema Configured where clause to fill only 2 years Added dump ( pg_dump ) and DROP support for future partition management Source: andyatkinson/rideshare/pgslice Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 31 of 60

Slide 32

Slide 32 text

Rails and CI Integration No Rails integration No Migrations for _intermediate table or DEFAULT partition No automatic partitions or Cron guidance. (We used whenever) No test or CI guidance Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 32 of 60

Slide 33

Slide 33 text

test: ... - image: cimg/ruby:3.2.0 - image: cimg/postgres:13.6 environment: PGSLICE_URL: "postgres://root@localhost:5432/rideshare_test" steps: ... - run: name: Database setup command: bundle exec rails db:schema:load --trace - run: name: Partition conversion command: sh bin/partition_conversion.sh - run: name: run tests command: bin/rails test Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 33 of 60

Slide 34

Slide 34 text

Find Bugs Using Test Suite Should have performed partition conversion in CI database from the beginning Code compatibility was not as "seamless" as expected with partitioned table Surprises like UPDATE of records, YAML serialized data in column Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 34 of 60

Slide 35

Slide 35 text

Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 35 of 60

Slide 36

Slide 36 text

Code Compatibility Problems PRIMARY KEY changes activerecord-import 405 did not work Rails needed to know how to find the id Serialized content in text column required careful escaping Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 36 of 60

Slide 37

Slide 37 text

Code Compatibility Fixes Explicit primary key configuration self.primary_key = :id (dual support) Replaced activerecord-import with .insert_all() for multi-row insert Set unique_by: [] option for .insert_all() ON CONFLICT clause, adding UNIQUE constraint/index Added created_at to INSERT statements Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 37 of 60

Slide 38

Slide 38 text

Fixing Test Code Tests write activities records as the base data. Unfortunately they hard-coded historical dates, INSERT fails due to missing partition Updated dozens of tests to not use hard-coded dates, e.g. used Time.current Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 38 of 60

Slide 39

Slide 39 text

Ready To Go General Steps 1. Intermediate table is ready to fill 2. Add env vars like PGSLICE_URL 3. Run steps one by one at off-peak time, coordinating with team Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 39 of 60

Slide 40

Slide 40 text

Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 40 of 60

Slide 41

Slide 41 text

Run pgslice fill in a tmux session, detach, monitor INSERT INTO ... SELECT * FROM WHERE id > 0 AND id < 10000 Options: --batch-size , --where , --sleep Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 41 of 60

Slide 42

Slide 42 text

pgslice analyze pgslice swap BEGIN; Rename _retired _intermediate COMMIT; Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 42 of 60

Slide 43

Slide 43 text

Customer Accomodation Out of hundreds of customers, one wanted old data Solution was to add more partitions and fill from further back for customer Added a customer identifier column to use with pgslice and the --where "customer_id = 123" Added partial index, populated column Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 43 of 60

Slide 44

Slide 44 text

ALTER TABLE activities ADD COLUMN account_id INTEGER; ALTER TABLE activities_intermediate ADD COLUMN account_id INTEGER; CREATE SCHEMA temp; CREATE TABLE temp.customer ( id SERIAL PRIMARY KEY, applicant_id BIGINT UNIQUE, -- JOIN on activities.owner_id account_id INTEGER); -- fill temp.customer INSERT INTO temp.customer (applicant_id, account_id) SELECT id AS applicant_id, account_id FROM WHERE account_id = AND id > 0 AND id < 10000; Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 44 of 60

Slide 45

Slide 45 text

pgslice multiple fill passes pgslice only does INSERT statements Stops when it finds duplicates, so can't make multiple passes Solution was to TRUNCATE some recent partitions, then fill again from there Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 45 of 60

Slide 46

Slide 46 text

Beware of Schema Cache Following pgslice swap , running app had stale Active Record Schema Cache Solution was to restart app pods, but also catch an exception and force a reset of the Schema Cache while rolling restart completed Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 46 of 60

Slide 47

Slide 47 text

Speeding Up Row Copies DROP all or most indexes from intermediate table. Set the table as unlogged. ALTER TABLE activities_intermediate SET UNLOGGED; When copying is done, add indexes back manually ( CREATE INDEX and match source table columns) Set table back to LOGGED Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 47 of 60

Slide 48

Slide 48 text

Analyzing Results With only 2 years of data, the replacement table was much smaller 215 GB 39 GB 82% decrease 163 GB 45 GB 72% decrease 1500 GB 500 GB 67% decrease relation_size_extended.sql Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 48 of 60

Slide 49

Slide 49 text

AWS PostgreSQL Costs and Savings RDS PostgreSQL and Aurora costs are different Storage and IOPS are combined for RDS General Purpose SSD $0.115 per GB-month, Backups: $0.095 per GB-month, Also: Snapshot exports, transfers S3: $0.025 per GB-month 1TB comparison: RDS: $115. S3: $25. 78% less. Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 49 of 60

Slide 50

Slide 50 text

AWS RDS Cost Reduction Pay for what's provisioned, not for I/O used Storage auto-scaling (2019), you keep paying With space reduction, still pay for what's provisioned Not "easy" to pay less. Need new instance, but less space comes with less IOPS. Source: CloudZero Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 50 of 60

Slide 51

Slide 51 text

Aurora Cost Reduction You can reduce storage charges by dropping tables, indexes, databases, and so on that you no longer need. 10 GB increments Set up partitioning plus regularly archive old partitions $0.10 per GB-month. 1TB for 1 month = $100 Perhaps most useful as a "cost control mechanism" Source: #1 Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 51 of 60

Slide 52

Slide 52 text

bit.ly/PartPG2023 Slide 52 of 60

Slide 53

Slide 53 text

Cleaning Up Remove transitional code Remove feature flagging Document steps for team members Add monitoring for partition management What about the _retired table? Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 53 of 60

Slide 54

Slide 54 text

Unlocked Capabilities Retired table can now be archived to low cost storage like S3 Old partitions can be automatically detached and archived By regularly archiving old partitions, overall growth slows and may flatten Steps like DETACH CONCURRENTLY , pg_dump , and DROP TABLE can all be automated Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 54 of 60

Slide 55

Slide 55 text

pg_dump \ --format=c \ # c="custom", easy pg_restore --table activities_retired \ --file activities_retired.dump \ $PGSLICE_URL psql -c "DROP TABLE activities_retired" $PGSLICE_URL Docs: pg_dump Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 55 of 60

Slide 56

Slide 56 text

Archive to S3 SELECT aws_commons.create_s3_uri( '', '', 'us-west-2' -- region ) AS s3_uri_1 \gset Sources: AWS Exports Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 56 of 60

Slide 57

Slide 57 text

Summary Partitioned a 2 billion rows table using pgslice without planned downtime With tables smaller by up to 80%, modifications, queries and maintenance, may be faster Regularly exporting unneeded data is now possible Helped improve our data lifecycle management practices Intro and App Problem Solution Implementation Next Steps bit.ly/PartPG2023 Slide 57 of 60

Slide 58

Slide 58 text

SQL Goodie Bag To Go Find the tables 10 largest tables Find append onlys Number of DMLs (append only) Index restore assistance List CREATE INDEX statements Partitions row counts List partition row counts Partition size estimates Estimated sizes for partitions following pattern bit.ly/PartPG2023 Slide 58 of 60

Slide 59

Slide 59 text

Special Recognition! PG Day Chicago organizers for accepting this talk! Team members that supported this project Andrew, the creator of pgslice, and reviewer! Reviewers: Haki, Ryan, Elizabeth bit.ly/PartPG2023 Slide 59 of 60

Slide 60

Slide 60 text

Thanks! andyatkinson.com/contact #PartPG2023 bit.ly/PartPG2023 Slide 60 of 60