Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Good Schema Design and Why It Matters

Good Schema Design and Why It Matters

A talk I gave at DjangoCon Europe 2014.

Andrew Godwin

May 15, 2014
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. Andrew Godwin
    @andrewgodwin
    GOOD SCHEMA DESIGN
    WHY IT MATTERS
    and

    View Slide

  2. Andrew Godwin
    Core Developer
    Senior Engineer
    Author & Maintainer

    View Slide

  3. Schemas
    Explicit & Implicit

    View Slide

  4. Explicit
    PostgreSQL
    MySQL
    Oracle
    SQLite
    CouchDB
    MongoDB
    Redis
    ZODB
    Implicit

    View Slide

  5. Explicit Schema
    ID int Name text Weight uint
    1
    2
    3
    Alice
    Bob
    Charles
    76
    84
    65
    Implicit Schema
    {
    "id": 342,
    "name": "David",
    "weight": 44,
    }

    View Slide

  6. Explicit Schema
    Normalised or semi normalised structure
    JOINs to retrieve related data
    Implicit Schema
    Embedded structure
    Related data retrieved naturally with object

    View Slide

  7. Silent Failure
    {
    "id": 342,
    "name": "David",
    "weight": 74,
    }
    {
    "id": 342,
    "name": "Ellie",
    "weight": "85kg",
    }
    {
    "id": 342,
    "nom": "Frankie",
    "weight": 77,
    }
    {
    "id": 342,
    "name": "Frankie",
    "weight": -67,
    }

    View Slide

  8. Schemas
    inform
    Storage

    View Slide

  9. PostgreSQL

    View Slide

  10. Adding NULLable columns: instant
    But must be at end of table

    View Slide

  11. CREATE INDEX CONCURRENTLY
    Slower, and only one at a time

    View Slide

  12. Constraints after column addition
    This is more general advice

    View Slide

  13. MySQL
    Locks whole table
    Rewrites entire storage
    No DDL transactions

    View Slide

  14. Oracle / MSSQL / etc.
    Look into their strengths

    View Slide

  15. Changing the Schema
    Databases aren't code...

    View Slide

  16. You can't put your database in a VCS
    You can put your schema in a VCS
    But your data won't always survive.

    View Slide

  17. Django Migrations
    Codified schema change format

    View Slide

  18. View Slide

  19. Migrations aren't enough
    You can't automate away a social problem!

    View Slide

  20. What if we got rid of the schema?
    That pesky, pesky schema.

    View Slide

  21. The Nesting Problem
    {
    "id": 123,
    "name": "Andrew",
    "friends": [
    {"id": 456, "name": "David"},
    {"id": 789, "name": "Mazz"},
    ],
    "likes": [
    {"id": 22, "liker": {"id": 789, "name", "Mazz"}},
    ],
    }

    View Slide

  22. You don't have to use a document DB
    (like CouchDB, MongoDB, etc.)

    View Slide

  23. Schemaless Columns
    ID int Name text Weight uint Data json
    1 Alice 76 { "nickname": "Al",
    "bgcolor": "#ff0033" }

    View Slide

  24. But that must be slower...
    Right?

    View Slide

  25. Comparison
    (never trust benchmarks)
    Loading 1.2 million records
    PostgreSQL
    MongoDB
    76 sec
    8 min
    Sequential scan
    PostgreSQL
    MongoDB
    980 ms
    980 ms
    Index scan (Postgres GINhash)
    PostgreSQL
    MongoDB
    0.7 ms
    1 ms

    View Slide

  26. Load Shapes

    View Slide

  27. Read-heavy
    Write-heavy Large size

    View Slide

  28. Read-heavy
    Write-heavy Large size
    Wikipedia
    TV show page
    Minecraft
    Forums
    Amazon Glacier
    Eventbrite
    Logging

    View Slide

  29. Read-heavy
    Write-heavy Large size
    Offline storage
    Append
    formats
    In-memory cache
    Many indexes
    Fewer indexes

    View Slide

  30. Your load changes over time
    Scaling is not just a flat multiplier

    View Slide

  31. General Advice
    Write heavy → Fewer indexes
    Read heavy → Denormalise
    Keep large data away from read/write heavy data
    Blob stores/filesystems are DBs too

    View Slide

  32. Lessons
    They're near the end so you remember them.

    View Slide

  33. Re-evaulate as you grow
    Different things matter at different sizes

    View Slide

  34. Adding NULL columns is great
    Always prefer this if nothing else

    View Slide

  35. You'll need more than one DBMS
    But don't use too many, you'll be swamped

    View Slide

  36. Indexes aren't free
    You pay the price at write/restore time

    View Slide

  37. Relational DBs are flexible
    They can do a lot more than JOINing normalised tables

    View Slide

  38. Thanks!
    Andrew Godwin
    @andrewgodwin
    eventbrite.com/jobs
    are hiring:

    View Slide