Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling with Django

Scaling with Django

Scaling the database of a Django application.

Ash Christopher

August 17, 2012
Tweet

More Decks by Ash Christopher

Other Decks in Programming

Transcript

  1. SCALE
    (and how to do it with Django)
    @ashchristopher
    [email protected]
    Friday, 17 August, 12

    View Slide

  2. Scalability
    (What is it?)
    Friday, 17 August, 12

    View Slide

  3. Scalability != Performance
    Friday, 17 August, 12

    View Slide

  4. The Difference
    Friday, 17 August, 12

    View Slide

  5. The Difference
    def view(request):
    return HttpResponse('Hello World')
    ./manage.py run_gunicorn --workers=1
    Performant
    Friday, 17 August, 12

    View Slide

  6. The Difference
    def view(request):
    return HttpResponse('Hello World')
    ./manage.py run_gunicorn --workers=1
    Performant
    import sleep
    def view(request):
    sleep(10)
    return HttpResponse('Hello World')
    ./manage.py run_gunicorn --workers=10
    Scalable
    Friday, 17 August, 12

    View Slide

  7. The Difference
    def view(request):
    return HttpResponse('Hello World')
    ./manage.py run_gunicorn --workers=1
    Performant
    import sleep
    def view(request):
    sleep(10)
    return HttpResponse('Hello World')
    ./manage.py run_gunicorn --workers=10
    Scalable
    Yes, I realize this is a contrived example
    Friday, 17 August, 12

    View Slide

  8. What to scale...
    Application
    Database
    Friday, 17 August, 12

    View Slide

  9. Focus on the Database
    (because it’s non-trivial)
    Friday, 17 August, 12

    View Slide

  10. Two main reasons to
    scale your database
    Friday, 17 August, 12

    View Slide

  11. Write Performance
    There is a limit to how fast
    data can be written to disk
    Friday, 17 August, 12

    View Slide

  12. Massive Volumes of Data
    Friday, 17 August, 12

    View Slide

  13. Massive Volumes of Data
    A database can only
    store so much...
    Friday, 17 August, 12

    View Slide

  14. Before you scale, you need
    to know
    what to scale
    Friday, 17 August, 12

    View Slide

  15. Collect Metrics
    What parts of your system are growing?
    How fast is your data growing?
    How long will your current infrastructure last?
    What would cause your growth rate to increase?
    Friday, 17 August, 12

    View Slide

  16. Analyze Data Models
    (Analyze Relationships between Models)
    Easier to Scale Harder to Scale
    Friday, 17 August, 12

    View Slide

  17. You’re Ready to Start
    Scaling!
    Friday, 17 August, 12

    View Slide

  18. Scale
    Up Before Out
    Friday, 17 August, 12

    View Slide

  19. Invest in Faster Hardware
    Friday, 17 August, 12

    View Slide

  20. Invest in Faster Hardware
    More RAM
    Friday, 17 August, 12

    View Slide

  21. Invest in Faster Hardware
    More RAM SSD Harddrives
    Friday, 17 August, 12

    View Slide

  22. Outrun The Problem
    Friday, 17 August, 12

    View Slide

  23. You Will Hit a Limit
    Limits of MySQL
    32 CPU
    256GB RAM
    Friday, 17 August, 12

    View Slide

  24. Functional Partitioning
    (aka. Feature Partitioning)
    (aka. Vertical Partitioning)
    (...)
    Friday, 17 August, 12

    View Slide

  25. Functional Partitioning
    the
    Internet
    Users
    Posts
    Comments
    Friday, 17 August, 12

    View Slide

  26. More Databases
    DATABASES = {
    'default': dj_database_url.config(default='mysql://localhost/default'),
    'posts': dj_database_url.config(default='mysql://localhost/posts'),
    'comments': dj_database_url.config(default='mysql://localhost/comments'),
    ...
    }
    Recommend using:
    dj-database-url
    Friday, 17 August, 12

    View Slide

  27. Django Routers
    Friday, 17 August, 12

    View Slide

  28. Django Routers
    class SimpleRouter(object):
    def db_for_read(self, model, **hints):
    # return database or None
    def db_for_write(self, model, **hints):
    # return database or None
    def allow_relation(self, obj1, obj2, **hints):
    # return True, False or None
    def allow_syncdb(self, db, model):
    # return True, False or None
    Friday, 17 August, 12

    View Slide

  29. Django Routers
    class SimpleRouter(object):
    def db_for_read(self, model, **hints):
    # return database or None
    def db_for_write(self, model, **hints):
    # return database or None
    def allow_relation(self, obj1, obj2, **hints):
    # return True, False or None
    def allow_syncdb(self, db, model):
    # return True, False or None
    ‣ Split data to different database
    Friday, 17 August, 12

    View Slide

  30. Django Routers
    class SimpleRouter(object):
    def db_for_read(self, model, **hints):
    # return database or None
    def db_for_write(self, model, **hints):
    # return database or None
    def allow_relation(self, obj1, obj2, **hints):
    # return True, False or None
    def allow_syncdb(self, db, model):
    # return True, False or None
    ‣ Split data to different database
    ‣ Routing happens automatically
    Friday, 17 August, 12

    View Slide

  31. Django Routers
    class SimpleRouter(object):
    def db_for_read(self, model, **hints):
    # return database or None
    def db_for_write(self, model, **hints):
    # return database or None
    def allow_relation(self, obj1, obj2, **hints):
    # return True, False or None
    def allow_syncdb(self, db, model):
    # return True, False or None
    ‣ Split data to different database
    ‣ Routing happens automatically
    ‣ Easy to stub in
    Friday, 17 August, 12

    View Slide

  32. Remove ForeignKeys
    class Post(models.Model):
    text = models.TextField()
    class Comment(models.Model):
    post = models.ForeignKey(‘Post’)
    class Post(models.Model):
    text = models.TextField()
    class Comment(models.Model):
    post_id = models.PostitiveIntegerField()
    Friday, 17 August, 12

    View Slide

  33. Partitioning Isn’t Free
    ‣ No more ForeignKeys
    ‣ No more select_related()
    ‣ No more prefetch_related()
    ‣ More database calls*
    ‣ Lose the Django Admin
    ‣ No more cascading deletes
    ‣ TransactionalTestCase doesn’t rollback on secondary
    databases
    * there are strategies to minimize database calls
    Friday, 17 August, 12

    View Slide

  34. Treat Databases as
    Lookup Tables
    Friday, 17 August, 12

    View Slide

  35. When Should You
    Feature Partition?
    Friday, 17 August, 12

    View Slide

  36. Partition Right Away...
    The Good
    •Easy
    •No data migrations
    •No refactoring
    Friday, 17 August, 12

    View Slide

  37. Partition Right Away...
    The Good
    •Easy
    •No data migrations
    •No refactoring
    The Bad
    •You don’t have metrics
    •A lot of overhead
    •Codebase gets ‘gross’
    really quickly
    Friday, 17 August, 12

    View Slide

  38. Partition Right Away...
    The Good
    •Easy
    •No data migrations
    •No refactoring
    The Bad
    •You don’t have metrics
    •A lot of overhead
    •Codebase gets ‘gross’
    really quickly
    The Ugly
    •Might be a waste of
    time
    •Efficiency decrease
    •Shipping support code
    •Not shipping features
    •Premature optimization
    Friday, 17 August, 12

    View Slide

  39. When You Need It...
    The Good
    •You know what to partition
    •You know usage patterns
    •Familiar with the system
    Friday, 17 August, 12

    View Slide

  40. When You Need It...
    The Good
    •You know what to partition
    •You know usage patterns
    •Familiar with the system
    The Bad
    •Scaling a live system
    •Scaling a large system
    •Multi-part migrations
    •Massive amount of planning
    needed
    Friday, 17 August, 12

    View Slide

  41. When You Need It...
    The Good
    •You know what to partition
    •You know usage patterns
    •Familiar with the system
    The Bad
    •Scaling a live system
    •Scaling a large system
    •Multi-part migrations
    •Massive amount of planning
    needed
    The Ugly
    •A lot of moving parts
    •Often migration can only
    be in 1 direction
    •Pressure (scaling because
    you NEED to)
    •No other options
    Friday, 17 August, 12

    View Slide

  42. “replacing all components of a
    car while driving it at 100mph”
    Mike Krieger - Instagram
    Friday, 17 August, 12

    View Slide

  43. Our Strategy
    Friday, 17 August, 12

    View Slide

  44. Other Strategies?
    ‣In-app Replication
    ‣Out of app Replication
    ‣Backfill
    ‣Epic Downtime
    All valid - the best strategy depends on your app
    Friday, 17 August, 12

    View Slide

  45. Phew...
    Friday, 17 August, 12

    View Slide

  46. It’s Not Over Yet
    Friday, 17 August, 12

    View Slide

  47. Horizontal Partitioning
    (aka. Sharding)
    Friday, 17 August, 12

    View Slide

  48. Data Split Across Many
    Databases
    users_shard_01 users_shard_02 users_shard_03 users_shard_04 users_shard_n
    posts_shard_01 posts_shard_02 posts_shard_03 posts_shard_04 posts_shard_n
    comments_shard_01
    ...
    comments_shard_03 comments_shard_04 comments_shard_n
    comments_shard_02
    ...
    ...
    Friday, 17 August, 12

    View Slide

  49. Shards are just Databases
    DATABASES = {
    'default': dj_database_url.config(default='mysql://localhost/default'),
    'post_shard_01': dj_database_url.config(default='mysql://localhost/post_shard_01'),
    'post_shard_02': dj_database_url.config(default='mysql://localhost/post_shard_02'),
    'post_shard_03': dj_database_url.config(default='mysql://localhost/post_shard_03'),
    ...
    }
    Recommend using:
    dj-database-url
    Friday, 17 August, 12

    View Slide

  50. Picking a Sharding Key
    ‣ Usually the primary key of an major entity
    in your system
    ‣ Different for every system you try to scale
    ‣ Look past the `User` model
    Sharding key
    Friday, 17 August, 12

    View Slide

  51. Pick the Wrong Sharding
    Key?
    ...
    Query
    Friday, 17 August, 12

    View Slide

  52. Friday, 17 August, 12

    View Slide

  53. Denormalized Data
    ‣ Pre-process your data as it comes in rather than as it’s requested
    ‣ Perfect place to use NoSQL (while maintaining a canonical source of data)
    ‣ Perform query lookups in denormalized data rather than querying all the shards
    Friday, 17 August, 12

    View Slide

  54. Stop Using Auto-increment
    for Primary Key IDS
    ‣ Can’t migrate data between shards
    ‣ Globally unique primary keys
    ‣ Encode meta information in primary key
    Friday, 17 August, 12

    View Slide

  55. Generating ID’s
    ‣ Single auto-incremented field in `default` database
    ‣ External software (eg. Twitter Snowflake)
    Friday, 17 August, 12

    View Slide

  56. Sharding in the Code
    posts1 = Post.objects.using(‘posts_shard_01’).all()
    posts2 = Post.objects.using(‘posts_shard_02’).all()
    ...
    ‣ Manually select database to use
    ‣ Pass in the ‘shard’ you want to access
    QuerySet.using(...)
    Friday, 17 August, 12

    View Slide

  57. QuerySet + Routers
    ‣ Use your Django routers
    ‣ Automatically route data to the proper shard on write
    ‣Still need to use QuerySet.using() on read
    Friday, 17 August, 12

    View Slide

  58. Links
    dj-database-url https://github.com/kennethreitz/dj-database-url
    django-multidb-patterns https://github.com/malcolmt/django-multidb-patterns
    High Scalability http://highscalability.com
    Friday, 17 August, 12

    View Slide

  59. @ashchristopher
    [email protected]
    Friday, 17 August, 12

    View Slide