Schema Evolutions at Gilt Groupe

Schema Evolutions how we do it at Gilt Groupe PGDay
NYC 2013 March 22, 2013 Michael Bryzek [email protected] @mbryzek http://tech.gilt.com

Today's Talk •  Context / Scaling as teams grow • 
Schema Evolution Manager (open source) •  Supporting QA and Development environments https://github.com/gilt/schema-evolution-manager

It always starts small One app One database One schema
(public) One user (superuser) Couple developers

Schema Evolution In App "Great - don't need to worry
about this"

release app + schema at same time > cap production
release ...upgrade db... ...deploy app... Easy (but not simple!)

What user's see

Luckily we have Rollback! up: alter table users drop
column email_address; down: alter table users add column email_address text;

Oops... That table had DATA in production. Good thing we
have reliable backups.

Meanwhile...

We then really try code review

Works great at first

If we're lucky, business grows. So does Database. Aren't we
lucky to have this problem? Fake chart showing # of tables in 1 database over time

Schema change commits dwarfed in volume

What happened to app performance?

Some stats from gilt •  Primary DB > 1000 tables
• pg_dump of schema is 6MB by itself • Largest tables > 1B rows historical data, 100s of million rows active data • 100+ developers need data stores

How can we fix this?

IDEA: Lots of Small Applications A core strategy at gilt
(LOSA) •  "micro services" •  primarily solving the problem of how to scale a technical organization •  requires significant automation •  creates new problems, but we think they are easier to solve

Treat Schema as 1st Class Code •  Lots of Small
Databases •  Dedicated git repositories •  Code review by the right people •  Small incremental changes applied frequently •  Real releases - tags and artifacts

So you want to add a table? •  One application
/ major feature •  One database / application •  One git repository / database Requires automation and good tools to manage

Schema Evolution Manager (SEM) •  Manages schema evolutions as text
files •  100% command line based •  Creates releases as simple .tar.gz files •  Minimal dependencies https://github.com/gilt/schema-evolution-manager Open sourced today!

Typical workflow 1. New git repo (e.g. schema-users) 2. Write a sql
script w/ incremental change 3. Create a release 4. Test the release 5. Deploy to production 6. Iterate 100% independent of application lifecycle

Create a migration = new.sql

add script

git status

Apply to local db [snip]

git commit

Create a release

Release to QA/Stage 1. scp dist/sample-0.0.2.tar.gz <server>:~/ 2. ssh <server> 3. tar xfvz
sample-0.0.2.tar.gz 4. cd sample-0.0.2 5. sem-apply ... 6. Optional "--dry_run" flag

Schema Evolution Manager (SEM) •  If useful, we'd love your
feedback! https://github.com/gilt/schema-evolution-manager

Multi step schema changes Only real challenge is managing multi
step deploys across schema / application for changes

Renaming a column Release 1: Deploy schema to add new
column Release 2: Deploy app to use old + new Release 3: Deploy schema to drop old column Release 4: Deploy app to use only new More complicated, but each step is simpler

Real benefits •  Clear team ownership of individual databases • 
Much higher fidelity code reviews - standards and conventions easily enforced across databases and teams •  Each individual database is smaller and thus easier to manage (horizontal scale) •  ~100% of application deploys have simple rollback •  Schema changes rarely affect production application response time anymore.

Overall Workflow 1.  Create your script 2.  sem-add new.sql 3. 
git commit -m "New script to blah" 4.  git push origin HEAD:refs/for/master 5.  automated tests verify the change 6.  people code review 7.  author merges 8.  sem-dist 9.  deploy to QA 10.  deploy to Production

https://code.google.com/p/gerrit/

Some stats from Gilt 50 changes / month avg to
largest DB at gilt 19 active production postgresql DBs today # of changes to initial db decreasing with LOSD

Supporting QA / Development Envs •  Make available two types
of databases for engineering: o  Sanitized (pre production, more data) o  Reference (unit tests, basic development) •  Reference created hourly •  Sanitized created daily For large databases, a complex process w/ lots of moving parts. Foreign keys are hard. For small databases, simple configuration.

Creating a Snapshot DB •  White list of tables to
extract data •  Optional standard SQL queries to extract data if partial data or PII •  Extract to CSV files using COPY •  One CSV file / table •  Configuration inside each schema's git repository •  Automated tests in jenkins •  Validated artifacts published

Developer CLI rake postgresql:create:single_database
[env,logical_name,type] env: development, test, stage, etc. logical_name: gilt, ondemandinventory, etc. type: reference or sanitized Creates a new db on localhost with latest published data.

Developer CLI rake postgresql:create_queued_dbs Background process that will fetch
latest published database data and create local psql databases. When ready, user can: rake postgresql:flip_queued_dbs

http://tech.gilt.com Thank you!

Schema Evolutions at Gilt Groupe

Schema Evolutions at Gilt Groupe

More Decks by Michael Bryzek

Other Decks in Programming

Featured

Transcript