Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Schema Evolutions at Gilt Groupe

Schema Evolutions at Gilt Groupe

Although it took us several tries and a few years, we finally figured out how to manage schema evolution really well, providing:

Simple API for developers
Code review
Simple setup for each database
Deploy via tarballs
Integration into development and test environments (including subsets of safe data!)

This talk will cover how we actually manage schema for over a dozen unique postgreSQL databases in production at Gilt Groupe ranging from databases containing one table and several KB of data to the largest containing ~1000 tables and nearing a TB of data. We'll highlight what we've learned from previous attempts and some general thoughts on where we are headed from here.


Michael Bryzek

March 22, 2013


  1. Schema Evolutions how we do it at Gilt Groupe PGDay

    NYC 2013 March 22, 2013 Michael Bryzek michael@gilt.com @mbryzek http://tech.gilt.com
  2. Today's Talk •  Context / Scaling as teams grow • 

    Schema Evolution Manager (open source) •  Supporting QA and Development environments https://github.com/gilt/schema-evolution-manager
  3. It always starts small One app One database One schema

    (public) One user (superuser) Couple developers
  4. Schema Evolution In App "Great - don't need to worry

    about this"
  5. release app + schema at same time > cap production

    release ...upgrade db... ...deploy app... Easy (but not simple!)
  6. What user's see

  7. Luckily we have Rollback! up:   alter  table  users  drop

     column  email_address;   down: alter  table  users  add  column  email_address  text;  
  8. Oops... That table had DATA in production. Good thing we

    have reliable backups.
  9. Meanwhile...

  10. We then really try code review

  11. Works great at first

  12. If we're lucky, business grows. So does Database. Aren't we

    lucky to have this problem? Fake chart showing # of tables in 1 database over time
  13. Schema change commits dwarfed in volume

  14. What happened to app performance?

  15. None
  16. Some stats from gilt •  Primary DB > 1000 tables

    • pg_dump of schema is 6MB by itself • Largest tables > 1B rows historical data, 100s of million rows active data • 100+ developers need data stores
  17. How can we fix this?

  18. IDEA: Lots of Small Applications A core strategy at gilt

    (LOSA) •  "micro services" •  primarily solving the problem of how to scale a technical organization •  requires significant automation •  creates new problems, but we think they are easier to solve
  19. Treat Schema as 1st Class Code •  Lots of Small

    Databases •  Dedicated git repositories •  Code review by the right people •  Small incremental changes applied frequently •  Real releases - tags and artifacts
  20. So you want to add a table? •  One application

    / major feature •  One database / application •  One git repository / database Requires automation and good tools to manage
  21. Schema Evolution Manager (SEM) •  Manages schema evolutions as text

    files •  100% command line based •  Creates releases as simple .tar.gz files •  Minimal dependencies https://github.com/gilt/schema-evolution-manager Open sourced today!
  22. Typical workflow 1. New git repo (e.g. schema-users) 2. Write a sql

    script w/ incremental change 3. Create a release 4. Test the release 5. Deploy to production 6. Iterate 100% independent of application lifecycle
  23. None
  24. Create a migration = new.sql

  25. add script

  26. git status

  27. Apply to local db [snip]

  28. git commit

  29. Create a release

  30. Release to QA/Stage 1. scp dist/sample-0.0.2.tar.gz <server>:~/ 2. ssh <server> 3. tar xfvz

    sample-0.0.2.tar.gz 4. cd sample-0.0.2 5. sem-apply ... 6. Optional "--dry_run" flag
  31. Schema Evolution Manager (SEM) •  If useful, we'd love your

    feedback! https://github.com/gilt/schema-evolution-manager
  32. Multi step schema changes Only real challenge is managing multi

    step deploys across schema / application for changes
  33. Renaming a column Release 1: Deploy schema to add new

    column Release 2: Deploy app to use old + new Release 3: Deploy schema to drop old column Release 4: Deploy app to use only new More complicated, but each step is simpler
  34. Real benefits •  Clear team ownership of individual databases • 

    Much higher fidelity code reviews - standards and conventions easily enforced across databases and teams •  Each individual database is smaller and thus easier to manage (horizontal scale) •  ~100% of application deploys have simple rollback •  Schema changes rarely affect production application response time anymore.
  35. Overall Workflow 1.  Create your script 2.  sem-add new.sql 3. 

    git commit -m "New script to blah" 4.  git push origin HEAD:refs/for/master 5.  automated tests verify the change 6.  people code review 7.  author merges 8.  sem-dist 9.  deploy to QA 10.  deploy to Production
  36. https://code.google.com/p/gerrit/

  37. Some stats from Gilt 50 changes / month avg to

    largest DB at gilt 19 active production postgresql DBs today # of changes to initial db decreasing with LOSD
  38. Supporting QA / Development Envs •  Make available two types

    of databases for engineering: o  Sanitized (pre production, more data) o  Reference (unit tests, basic development) •  Reference created hourly •  Sanitized created daily For large databases, a complex process w/ lots of moving parts. Foreign keys are hard. For small databases, simple configuration.
  39. Creating a Snapshot DB •  White list of tables to

    extract data •  Optional standard SQL queries to extract data if partial data or PII •  Extract to CSV files using COPY •  One CSV file / table •  Configuration inside each schema's git repository •  Automated tests in jenkins •  Validated artifacts published
  40. Developer CLI rake  postgresql:create:single_database            

       [env,logical_name,type]   env: development, test, stage, etc. logical_name: gilt, ondemandinventory, etc. type: reference or sanitized Creates a new db on localhost with latest published data.
  41. Developer CLI rake  postgresql:create_queued_dbs   Background process that will fetch

    latest published database data and create local psql databases. When ready, user can: rake  postgresql:flip_queued_dbs  
  42. http://tech.gilt.com Thank you!