Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Honey, I Shrunk the Database (for Test and Deve...

Avatar for dbness dbness
October 07, 2011

Honey, I Shrunk the Database (for Test and Developer Environments)

Maybe you were one of the lucky customers to receive Mint.com's 6am "oops" emails from staging servers in October 2010, or perhaps you work with data regulated by HIPAA - either way, you know it's not cool to have copies of production data anywhere that's not production. We'll cover strategies to protect and appropriately slice your data for use in testing and development environments.

Some topics covered include:

- Establishing database environment requirements
- Using pg_dump and pg_restore utilities for subsets of data
- Horizontally slicing application data to preserve relations
- Sanitizing data for non-production environments
- Resetting your database internals once you've imported a variety of data

Avatar for dbness

dbness

October 07, 2011
Tweet

More Decks by dbness

Other Decks in Technology

Transcript

  1.  Accuracy   You don’t truly know how your app will

    behave in production unless you use real data.   Production data is the ultimate in accuracy.
  2.  Requirements   Freshness – Daily, On command for non-developers  

    Shrinkage – Slices, Mutations  Resources   Source – extra disk space, RAM, and CPUs   Destination – limited, often entirely un-optimized   Development -- constrained DBA resources
  3.  Vertical Slice   Difficult to obtain a valid, useful subset

    of data.   Example: Include some entire tables, exclude others
  4.  Vertical Slice   Difficult to obtain a valid, useful subset

    of data.   Example: Include some entire tables, exclude others  Horizontal Slice   Difficult to write and maintain.   Example: SQL or application code to determine subset of data
  5.   Flexibility at Source (Production)   pg_dump   Include data

    only [-a --data-only]   Include table schema only [-s --schema-only]   Select tables [-t table1 table2 --table table1 table2]   Select schemas [-n schema --schema=schema]   Exclude schemas [-N schema --exclude-schema=schema]
  6.   Flexibility at Destination (Staging, Development)   pg_restore   Include

    data only [-a --data-only]   Select indexes [-i index --index=index]   Tune processing [-j number-of-jobs --jobs=number-of-jobs]   Select schemas [-n schema --schema=schema]   Select triggers[-T trigger --trigger=trigger]   Exclude privileges [-x --no-privileges --no-acl]
  7.  External Data Protection   HIPAA Regulations   PCI Compliance  

    API Terms of Use  Internal Data Protection   Protecting your users’ personal data   Protecting your users from accidents, e.g. staging emails   Your Terms of Service
  8.  Composite Slice including Vertical Slice – All application object schemas

    Vertical Slice – Entire tables of static content Horizontal Slice – Subset of users and their data Mutation – Changed user email addresses
  9.  Composite Slice including Vertical Slice – All application object schemas

    pg_dump --clean --schema-only --schema public db-01 > slice.sql
  10.   Composite Slice including Vertical Slice – All application object

    schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql Vertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql
  11.   Composite Slice including Vertical Slice – All application object

    schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql Vertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their data Mutation – Changed user email addresses
  12.   Horizontal Slice   Custom SQL SELECT * INTO staging.users

    FROM users WHERE EXISTS (subset of users);   Dynamic relative to full data set or newly created slice SELECT * INTO staging.stuff FROM stuff WHERE EXISTS (stuff per staging.users);
  13.   Horizontal Slice   Custom SQL   Dynamic relative to

    full data set or newly created slice   Mutations   Email Addresses   Use regular expressions to clean non-admin addresses e.g. [email protected] => [email protected]   Cached Data   Clear cached short link from link-shortening API
  14.   Composite Slice including Vertical Slice – All application object

    schemas pg_dump --clean --schema-only --schema public db-01 > slice.sql Vertical Slice – Entire tables of static content pg_dump --data-only --schema public -t cards db-01 >> slice.sql Horizontal Slice – Subset of users and their data Mutation – Changed user email addresses pg_dump --data-only --schema staging db-01 >> slice.sql
  15.   Rebuild   Prepare new database as standby   Gracefully

    close connections   Rotate by renaming databases   Security   Dedicated database build user   Membership in application user role   Application user role & privileges remain
  16.   Rebuild   $ bzcat slice.sql.bz2 | psql db-new  

    Staging schema has not been created, so all data loads to default schema
  17.   We hacked our rebuild by importing across schemas!  

    Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.
  18. --Updates all serial sequences for ID columns only BEGIN FOR

    table_record IN SELECT pc.relname FROM pg_class pc WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOP table_name = table_record.relname::text; EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal (table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || ' WHERE EXISTS (SELECT 1 FROM ' || table_name || ')'; END LOOP;
  19.   Rebuild   $ bzcat slice.sql.bz2 | psql db-new  

    Staging schema has not been created, so all data loads to default schema   echo “select 1 from update_id_sequences();” >> slice.sql   Vacuum   Reindex
  20.   Security   Database build user   CREATE DB privileges

      Member of Application user role   Application user remains database owner   Application user privileges remain limited   Build only works in predetermined environments
  21.  Requirements   Freshness – Daily, On command for non-developers  

    Shrinkage – Slices, Mutations  Resources   Source – extra disk space, RAM, and CPUs   Destination – limited, often entirely un-optimized   Development -- constrained DBA resources
  22.   Copies -- LVMSnapshots   See talk by Jon Erdman

    at PG Conf EU   Great for all reads   Data stays virtualized & doesn’t take up space until changed   Ideal for DDL changes without actual data changes
  23.   Copies, Slices -- pg_staging by dmitri http://github.com/dimitri/pg_staging   Simple

    -- pauses pgbouncer & restores backup   Efficient -- leverage bulk loading   Flexible -- supports varying psql files   Custom -- limited   Slices -- replicate by rtomayko of Github http://github.com/rtomayko/replicate   Simple - Preserves object relations via ActiveRecord   Inefficient -- Creates text-based .dump   Inflexible -- Corrupts id sequences on data insert   Custom -- highly