Slide 1

Slide 1 text

SCALE (and how to do it with Django) @ashchristopher [email protected] Friday, 17 August, 12

Slide 2

Slide 2 text

Scalability (What is it?) Friday, 17 August, 12

Slide 3

Slide 3 text

Scalability != Performance Friday, 17 August, 12

Slide 4

Slide 4 text

The Difference Friday, 17 August, 12

Slide 5

Slide 5 text

The Difference def view(request): return HttpResponse('Hello World') ./manage.py run_gunicorn --workers=1 Performant Friday, 17 August, 12

Slide 6

Slide 6 text

The Difference def view(request): return HttpResponse('Hello World') ./manage.py run_gunicorn --workers=1 Performant import sleep def view(request): sleep(10) return HttpResponse('Hello World') ./manage.py run_gunicorn --workers=10 Scalable Friday, 17 August, 12

Slide 7

Slide 7 text

The Difference def view(request): return HttpResponse('Hello World') ./manage.py run_gunicorn --workers=1 Performant import sleep def view(request): sleep(10) return HttpResponse('Hello World') ./manage.py run_gunicorn --workers=10 Scalable Yes, I realize this is a contrived example Friday, 17 August, 12

Slide 8

Slide 8 text

What to scale... Application Database Friday, 17 August, 12

Slide 9

Slide 9 text

Focus on the Database (because it’s non-trivial) Friday, 17 August, 12

Slide 10

Slide 10 text

Two main reasons to scale your database Friday, 17 August, 12

Slide 11

Slide 11 text

Write Performance There is a limit to how fast data can be written to disk Friday, 17 August, 12

Slide 12

Slide 12 text

Massive Volumes of Data Friday, 17 August, 12

Slide 13

Slide 13 text

Massive Volumes of Data A database can only store so much... Friday, 17 August, 12

Slide 14

Slide 14 text

Before you scale, you need to know what to scale Friday, 17 August, 12

Slide 15

Slide 15 text

Collect Metrics What parts of your system are growing? How fast is your data growing? How long will your current infrastructure last? What would cause your growth rate to increase? Friday, 17 August, 12

Slide 16

Slide 16 text

Analyze Data Models (Analyze Relationships between Models) Easier to Scale Harder to Scale Friday, 17 August, 12

Slide 17

Slide 17 text

You’re Ready to Start Scaling! Friday, 17 August, 12

Slide 18

Slide 18 text

Scale Up Before Out Friday, 17 August, 12

Slide 19

Slide 19 text

Invest in Faster Hardware Friday, 17 August, 12

Slide 20

Slide 20 text

Invest in Faster Hardware More RAM Friday, 17 August, 12

Slide 21

Slide 21 text

Invest in Faster Hardware More RAM SSD Harddrives Friday, 17 August, 12

Slide 22

Slide 22 text

Outrun The Problem Friday, 17 August, 12

Slide 23

Slide 23 text

You Will Hit a Limit Limits of MySQL 32 CPU 256GB RAM Friday, 17 August, 12

Slide 24

Slide 24 text

Functional Partitioning (aka. Feature Partitioning) (aka. Vertical Partitioning) (...) Friday, 17 August, 12

Slide 25

Slide 25 text

Functional Partitioning the Internet Users Posts Comments Friday, 17 August, 12

Slide 26

Slide 26 text

More Databases DATABASES = { 'default': dj_database_url.config(default='mysql://localhost/default'), 'posts': dj_database_url.config(default='mysql://localhost/posts'), 'comments': dj_database_url.config(default='mysql://localhost/comments'), ... } Recommend using: dj-database-url Friday, 17 August, 12

Slide 27

Slide 27 text

Django Routers Friday, 17 August, 12

Slide 28

Slide 28 text

Django Routers class SimpleRouter(object): def db_for_read(self, model, **hints): # return database or None def db_for_write(self, model, **hints): # return database or None def allow_relation(self, obj1, obj2, **hints): # return True, False or None def allow_syncdb(self, db, model): # return True, False or None Friday, 17 August, 12

Slide 29

Slide 29 text

Django Routers class SimpleRouter(object): def db_for_read(self, model, **hints): # return database or None def db_for_write(self, model, **hints): # return database or None def allow_relation(self, obj1, obj2, **hints): # return True, False or None def allow_syncdb(self, db, model): # return True, False or None ‣ Split data to different database Friday, 17 August, 12

Slide 30

Slide 30 text

Django Routers class SimpleRouter(object): def db_for_read(self, model, **hints): # return database or None def db_for_write(self, model, **hints): # return database or None def allow_relation(self, obj1, obj2, **hints): # return True, False or None def allow_syncdb(self, db, model): # return True, False or None ‣ Split data to different database ‣ Routing happens automatically Friday, 17 August, 12

Slide 31

Slide 31 text

Django Routers class SimpleRouter(object): def db_for_read(self, model, **hints): # return database or None def db_for_write(self, model, **hints): # return database or None def allow_relation(self, obj1, obj2, **hints): # return True, False or None def allow_syncdb(self, db, model): # return True, False or None ‣ Split data to different database ‣ Routing happens automatically ‣ Easy to stub in Friday, 17 August, 12

Slide 32

Slide 32 text

Remove ForeignKeys class Post(models.Model): text = models.TextField() class Comment(models.Model): post = models.ForeignKey(‘Post’) class Post(models.Model): text = models.TextField() class Comment(models.Model): post_id = models.PostitiveIntegerField() Friday, 17 August, 12

Slide 33

Slide 33 text

Partitioning Isn’t Free ‣ No more ForeignKeys ‣ No more select_related() ‣ No more prefetch_related() ‣ More database calls* ‣ Lose the Django Admin ‣ No more cascading deletes ‣ TransactionalTestCase doesn’t rollback on secondary databases * there are strategies to minimize database calls Friday, 17 August, 12

Slide 34

Slide 34 text

Treat Databases as Lookup Tables Friday, 17 August, 12

Slide 35

Slide 35 text

When Should You Feature Partition? Friday, 17 August, 12

Slide 36

Slide 36 text

Partition Right Away... The Good •Easy •No data migrations •No refactoring Friday, 17 August, 12

Slide 37

Slide 37 text

Partition Right Away... The Good •Easy •No data migrations •No refactoring The Bad •You don’t have metrics •A lot of overhead •Codebase gets ‘gross’ really quickly Friday, 17 August, 12

Slide 38

Slide 38 text

Partition Right Away... The Good •Easy •No data migrations •No refactoring The Bad •You don’t have metrics •A lot of overhead •Codebase gets ‘gross’ really quickly The Ugly •Might be a waste of time •Efficiency decrease •Shipping support code •Not shipping features •Premature optimization Friday, 17 August, 12

Slide 39

Slide 39 text

When You Need It... The Good •You know what to partition •You know usage patterns •Familiar with the system Friday, 17 August, 12

Slide 40

Slide 40 text

When You Need It... The Good •You know what to partition •You know usage patterns •Familiar with the system The Bad •Scaling a live system •Scaling a large system •Multi-part migrations •Massive amount of planning needed Friday, 17 August, 12

Slide 41

Slide 41 text

When You Need It... The Good •You know what to partition •You know usage patterns •Familiar with the system The Bad •Scaling a live system •Scaling a large system •Multi-part migrations •Massive amount of planning needed The Ugly •A lot of moving parts •Often migration can only be in 1 direction •Pressure (scaling because you NEED to) •No other options Friday, 17 August, 12

Slide 42

Slide 42 text

“replacing all components of a car while driving it at 100mph” Mike Krieger - Instagram Friday, 17 August, 12

Slide 43

Slide 43 text

Our Strategy Friday, 17 August, 12

Slide 44

Slide 44 text

Other Strategies? ‣In-app Replication ‣Out of app Replication ‣Backfill ‣Epic Downtime All valid - the best strategy depends on your app Friday, 17 August, 12

Slide 45

Slide 45 text

Phew... Friday, 17 August, 12

Slide 46

Slide 46 text

It’s Not Over Yet Friday, 17 August, 12

Slide 47

Slide 47 text

Horizontal Partitioning (aka. Sharding) Friday, 17 August, 12

Slide 48

Slide 48 text

Data Split Across Many Databases users_shard_01 users_shard_02 users_shard_03 users_shard_04 users_shard_n posts_shard_01 posts_shard_02 posts_shard_03 posts_shard_04 posts_shard_n comments_shard_01 ... comments_shard_03 comments_shard_04 comments_shard_n comments_shard_02 ... ... Friday, 17 August, 12

Slide 49

Slide 49 text

Shards are just Databases DATABASES = { 'default': dj_database_url.config(default='mysql://localhost/default'), 'post_shard_01': dj_database_url.config(default='mysql://localhost/post_shard_01'), 'post_shard_02': dj_database_url.config(default='mysql://localhost/post_shard_02'), 'post_shard_03': dj_database_url.config(default='mysql://localhost/post_shard_03'), ... } Recommend using: dj-database-url Friday, 17 August, 12

Slide 50

Slide 50 text

Picking a Sharding Key ‣ Usually the primary key of an major entity in your system ‣ Different for every system you try to scale ‣ Look past the `User` model Sharding key Friday, 17 August, 12

Slide 51

Slide 51 text

Pick the Wrong Sharding Key? ... Query Friday, 17 August, 12

Slide 52

Slide 52 text

Friday, 17 August, 12

Slide 53

Slide 53 text

Denormalized Data ‣ Pre-process your data as it comes in rather than as it’s requested ‣ Perfect place to use NoSQL (while maintaining a canonical source of data) ‣ Perform query lookups in denormalized data rather than querying all the shards Friday, 17 August, 12

Slide 54

Slide 54 text

Stop Using Auto-increment for Primary Key IDS ‣ Can’t migrate data between shards ‣ Globally unique primary keys ‣ Encode meta information in primary key Friday, 17 August, 12

Slide 55

Slide 55 text

Generating ID’s ‣ Single auto-incremented field in `default` database ‣ External software (eg. Twitter Snowflake) Friday, 17 August, 12

Slide 56

Slide 56 text

Sharding in the Code posts1 = Post.objects.using(‘posts_shard_01’).all() posts2 = Post.objects.using(‘posts_shard_02’).all() ... ‣ Manually select database to use ‣ Pass in the ‘shard’ you want to access QuerySet.using(...) Friday, 17 August, 12

Slide 57

Slide 57 text

QuerySet + Routers ‣ Use your Django routers ‣ Automatically route data to the proper shard on write ‣Still need to use QuerySet.using() on read Friday, 17 August, 12

Slide 58

Slide 58 text

Links dj-database-url https://github.com/kennethreitz/dj-database-url django-multidb-patterns https://github.com/malcolmt/django-multidb-patterns High Scalability http://highscalability.com Friday, 17 August, 12

Slide 59

Slide 59 text

@ashchristopher [email protected] Friday, 17 August, 12