Slide 1

Slide 1 text

Andrew Godwin @andrewgodwin GOOD SCHEMA DESIGN WHY IT MATTERS and

Slide 2

Slide 2 text

Andrew Godwin Core Developer Senior Engineer Author & Maintainer

Slide 3

Slide 3 text

Schemas Explicit & Implicit

Slide 4

Slide 4 text

Explicit PostgreSQL MySQL Oracle SQLite CouchDB MongoDB Redis ZODB Implicit

Slide 5

Slide 5 text

Explicit Schema ID int Name text Weight uint 1 2 3 Alice Bob Charles 76 84 65 Implicit Schema { "id": 342, "name": "David", "weight": 44, }

Slide 6

Slide 6 text

Explicit Schema Normalised or semi normalised structure JOINs to retrieve related data Implicit Schema Embedded structure Related data retrieved naturally with object

Slide 7

Slide 7 text

Silent Failure { "id": 342, "name": "David", "weight": 74, } { "id": 342, "name": "Ellie", "weight": "85kg", } { "id": 342, "nom": "Frankie", "weight": 77, } { "id": 342, "name": "Frankie", "weight": -67, }

Slide 8

Slide 8 text

Schemas inform Storage

Slide 9

Slide 9 text

PostgreSQL

Slide 10

Slide 10 text

Adding NULLable columns: instant But must be at end of table

Slide 11

Slide 11 text

CREATE INDEX CONCURRENTLY Slower, and only one at a time

Slide 12

Slide 12 text

Constraints after column addition This is more general advice

Slide 13

Slide 13 text

MySQL Locks whole table Rewrites entire storage No DDL transactions

Slide 14

Slide 14 text

Oracle / MSSQL / etc. Look into their strengths

Slide 15

Slide 15 text

Changing the Schema Databases aren't code...

Slide 16

Slide 16 text

You can't put your database in a VCS You can put your schema in a VCS But your data won't always survive.

Slide 17

Slide 17 text

Django Migrations Codified schema change format

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Migrations aren't enough You can't automate away a social problem!

Slide 20

Slide 20 text

What if we got rid of the schema? That pesky, pesky schema.

Slide 21

Slide 21 text

The Nesting Problem { "id": 123, "name": "Andrew", "friends": [ {"id": 456, "name": "David"}, {"id": 789, "name": "Mazz"}, ], "likes": [ {"id": 22, "liker": {"id": 789, "name", "Mazz"}}, ], }

Slide 22

Slide 22 text

You don't have to use a document DB (like CouchDB, MongoDB, etc.)

Slide 23

Slide 23 text

Schemaless Columns ID int Name text Weight uint Data json 1 Alice 76 { "nickname": "Al", "bgcolor": "#ff0033" }

Slide 24

Slide 24 text

But that must be slower... Right?

Slide 25

Slide 25 text

Comparison (never trust benchmarks) Loading 1.2 million records PostgreSQL MongoDB 76 sec 8 min Sequential scan PostgreSQL MongoDB 980 ms 980 ms Index scan (Postgres GINhash) PostgreSQL MongoDB 0.7 ms 1 ms

Slide 26

Slide 26 text

Load Shapes

Slide 27

Slide 27 text

Read-heavy Write-heavy Large size

Slide 28

Slide 28 text

Read-heavy Write-heavy Large size Wikipedia TV show page Minecraft Forums Amazon Glacier Eventbrite Logging

Slide 29

Slide 29 text

Read-heavy Write-heavy Large size Offline storage Append formats In-memory cache Many indexes Fewer indexes

Slide 30

Slide 30 text

Your load changes over time Scaling is not just a flat multiplier

Slide 31

Slide 31 text

General Advice Write heavy → Fewer indexes Read heavy → Denormalise Keep large data away from read/write heavy data Blob stores/filesystems are DBs too

Slide 32

Slide 32 text

Lessons They're near the end so you remember them.

Slide 33

Slide 33 text

Re-evaulate as you grow Different things matter at different sizes

Slide 34

Slide 34 text

Adding NULL columns is great Always prefer this if nothing else

Slide 35

Slide 35 text

You'll need more than one DBMS But don't use too many, you'll be swamped

Slide 36

Slide 36 text

Indexes aren't free You pay the price at write/restore time

Slide 37

Slide 37 text

Relational DBs are flexible They can do a lot more than JOINing normalised tables

Slide 38

Slide 38 text

Thanks! Andrew Godwin @andrewgodwin eventbrite.com/jobs are hiring: