Why Zero-Downtime Migrations? ● Global service with users around the world ○ Users using the service 27/4 from different time zones ○ We want our service to always be available
Why Zero-Downtime Migrations? ● Global service with users around the world ○ Users using the service 27/4 from different time zones ○ We want our service to always be available ● Normal migration process locks table ○ We have a lot of user data (~300 GB on disk, biggest table has 500M rows) ○ Normal migration locks the whole table and could take service down
Why Zero-Downtime Migrations? ● Global service with users around the world ○ Users using the service 27/4 from different time zones ○ We want our service to always be available ● Normal migration process locks table ○ We have a lot of user data (~300 GB on disk, biggest table has 500M rows) ○ Normal migration locks the whole table and could take service down ● We want to be able to do continuous deployment ○ No “code freeze”, every merges to master goes to production right away ○ Normal migration will cause deployment disruption throughout the day
Percona Online Schema Change ● Also known as: pt-osc ● Part of Percona Toolkit (http://www.percona.com/software/) ○ Collection of advanced command-line tools used by Percona support staff — README.md
Percona Online Schema Change ● Also known as: pt-osc ● Part of Percona Toolkit (http://www.percona.com/software/) ○ Collection of advanced command-line tools used by Percona support staff — README.md ● Provide command to perform online schema change ○ Create new temporary table with updated schema ○ Copy data over to the new table ○ Drop old table, and rename temporary table to old table’s name (atomic)
Departure ● Hooks into Active Record's connection adapter syntax ○ No need to write custom migration! ● Construct arguments for pt-online-schema-change ○ Then runs pt-online-schema-change to perform the migration
One Morning in Slack ● A deploy to staging was going on for 3 hours ● Deployment was stuck at migration step ○ We tried to add a new column to a table that has about 100 million rows
Percona Online Schema Change ● Also known as: pt-osc ● Part of Percona Toolkit (http://www.percona.com/software/) ○ Collection of advanced command-line tools used by Percona support staff — README.md ● Provide command to perform online schema change ○ Create new temporary table with updated schema ○ Copy data over to the new table ○ Drop old table, and rename temporary table to old table’s name (atomic)
One Morning in Slack ● A deploy to staging was going on for 3 hours ● Deployment was stuck at migration step ○ We tried to add a new column to a table that has about 100M rows ○ It takes a long time to copy over data of 100M rows to the new table
One Morning in Slack ● A deploy to staging was going on for 3 hours ● Deployment was stuck at migration step ○ We tried to add a new column to a table that has about 100M rows ○ It takes a long time to copy over data of 100M rows to the new table ● Deployment canceled, SRE team took over the database migration ○ SRE team runs pt-online-schema-change directly on the server
InnoDB Online DDL Operations ● Available from MySQL 5.6+ ○ We are using Amazon RDS for MySQL 5.6+ ● Various operations can be performed online ○ CREATE INDEX, DROP INDEX ○ ADD COLUMN, DROP COLUMN* ○ Rename column, reorder column* ○ SET DEFAULT, DROP DEFAULT, change column nullability ○ Set AUTO_INCREMENT* ○ Add/remove foreign key ○ RENAME TABLE*
InnoDB Online DDL Operations ● Available from MySQL 5.6+ ○ We are using Amazon RDS for MySQL 5.6+ ● Various operations can be performed online ○ CREATE INDEX, DROP INDEX, ADD COLUMN, DROP COLUMN, … ● Some DDL queries need to be modified to run online
InnoDB Online DDL Operations ALTER TABLE tbl_name ALTER COLUMN col SET DEFAULT literal; ALTER TABLE tbl_name ALTER COLUMN col SET DEFAULT literal, ALGORITHM=INPLACE, LOCK=NONE;
Column Operations ● ADD COLUMN, DROP COLUMN, rename column, change default value ● Can’t run directly as Rails’ generated SQL locks table ● Open new issue with SRE team, asking to run migration DDL with ALGORITHM=INPLACE, LOCK=NONE; ○ Then, perform INSERT into schema_migrations with migration version directly
Online DDL Operations are faster, but ... ● We now require SRE’s help to perform migration tasks ● Too many steps to perform by human ○ Open PR, get it approved ○ Open ticket with SRE with online SQL command to run ○ SRE has to run command in MySQL console, has to manually insert migration version ○ Merges PR
Online DDL Operations are faster, but ... ● We now require SRE’s help to perform migration tasks ● Too many steps required to perform by human ○ Open PR, get it approved ○ Open ticket with SRE with online SQL command to run ○ SRE has to run command in MySQL console, has to manually insert migration version ○ Merges PR ○ Mistake can happen if not done properly
Improvements ● Create Departure-like gem to hook into Active Record’s Migration ○ Generate SQL that tells MySQL to perform non-locking DDL operations ○ Migration runs automatically during deployment just like normal migrations ○ No need to ask for SRE help
Improvements ● Create Departure-like gem to hook into Active Record’s Migration ○ Generate SQL that tells MySQL to perform non-locking DDL operations ○ Migration runs automatically during deployment just like normal migrations ○ No need to ask for SRE help ● Separate migration deploy from code deployment ○ Prevent blocking code deployment to production ○ Deploy database migration first on a separate deploy process ■ Mark that migration as finished, so migration in code PR will be no-op ○ Needs to write code such that it still works before and after database migration ■ Multiple steps process for renaming column/table.
Alternatives ● Use external tool to manage database schema instead ○ Ridgepole (https://github.com/winebarrel/ridgepole) ○ Store database schema in a separate database ○ Database migrations can be apply independently from code changes