Slide 1

Slide 1 text

Database Sharding with Django Ash Christopher [email protected] @ashchristopher 1 DjangoCon 2013 1 Thursday, 5 September, 13

Slide 2

Slide 2 text

2 2 Thursday, 5 September, 13

Slide 3

Slide 3 text

3 History Why we decided to shard our data. 3 Thursday, 5 September, 13

Slide 4

Slide 4 text

4 4 Thursday, 5 September, 13

Slide 5

Slide 5 text

5 Focus on Learning and Experimentation “Move fast and break things.” 5 Thursday, 5 September, 13

Slide 6

Slide 6 text

6 6 Thursday, 5 September, 13

Slide 7

Slide 7 text

7 Scaling Up Solving your problems with money. 7 Thursday, 5 September, 13

Slide 8

Slide 8 text

8 8 Thursday, 5 September, 13

Slide 9

Slide 9 text

9 9 Thursday, 5 September, 13

Slide 10

Slide 10 text

10 Small Business Owner Signups per Month November 2010 - April 2012 10 Thursday, 5 September, 13

Slide 11

Slide 11 text

11 Data Generated per Month November 2010 - April 2012 11 Thursday, 5 September, 13

Slide 12

Slide 12 text

Introduced Sharding 12 12 Thursday, 5 September, 13

Slide 13

Slide 13 text

13 default 30 shards default + 13 Thursday, 5 September, 13

Slide 14

Slide 14 text

Our platform now supports over 1.5 million users. 14 14 Thursday, 5 September, 13

Slide 15

Slide 15 text

Process We Followed at Wave I. Scale Up II. Feature Partition/Sharding III. Horizontal Sharding 15 15 Thursday, 5 September, 13

Slide 16

Slide 16 text

Things to Know Before You Shard Things you should be aware of before going introducing sharding into your project. 16 16 Thursday, 5 September, 13

Slide 17

Slide 17 text

17 Sharding sucks! 17 Thursday, 5 September, 13

Slide 18

Slide 18 text

18 Sharding is awesome! 18 Thursday, 5 September, 13

Slide 19

Slide 19 text

19 Multiple Databases in Django How to make use of more than one database in your Django projects. 19 Thursday, 5 September, 13

Slide 20

Slide 20 text

Define multiple database in your settings.py file. 20 DATABASES = { 'default': { 'ENGINE': 'django.db.backends.mysql', 'NAME': 'default', 'USER': 'mysql_user', 'PASSWORD': '***', 'HOST': '0.0.0.0', 'PORT': '3306', }, 'posts': { 'ENGINE': 'django.db.backends.mysql', 'NAME': 'posts', 'USER': 'mysql_user', 'PASSWORD': '***', 'HOST': '0.0.0.0', 'PORT': '3306', }, 'comments': { 'ENGINE': 'django.db.backends.mysql', 'NAME': 'comments', 'USER': 'mysql_user', 'PASSWORD': '***', 'HOST': '0.0.0.0', 'PORT': '3306', }, } 20 Thursday, 5 September, 13

Slide 21

Slide 21 text

21 dj-database-url https://github.com/kennethreitz/dj-database-url 21 Thursday, 5 September, 13

Slide 22

Slide 22 text

Define databases in your settings.py file using dj-database-url. (Utility will be more apparent when we start sharding) 22 import dj_database_url DATABASES = { 'default': dj_database_url.config(default='mysql://mysql_user:***@localhost/default'), 'posts': dj_database_url.config(default='mysql://mysql_user:***@localhost/posts'), 'comments': dj_database_url.config(default='mysql://mysql_user:***@localhost/comments'), } 22 Thursday, 5 September, 13

Slide 23

Slide 23 text

Database Routers 23 The mechanism by which Django supports interacting with multiple databases. 23 Thursday, 5 September, 13

Slide 24

Slide 24 text

Define your routers in your settings.py file. 24 DATABASE_ROUTERS = ( 'posts.routers.PostsRouter', 'comments.routers.CommentsRouter', ) 24 Thursday, 5 September, 13

Slide 25

Slide 25 text

The anatomy of a Django database router. 25 class MyDatabaseRouter(object): def db_for_read(model, **hints): # Given a model class `model`, return the alias that maps to the # the database connection for reading, or None if there is no opinion. pass def db_for_write(model, **hints): # Given a model class `model`, return the alias that maps to the # the database connection for writing, or None if there is no opinion. pass def allow_relation(obj1, obj2, **hints): # Return True if the relation between `obj1` and `obj2` should be # explicitly allowed, return False if the relation between `obj1` # and `obj2` should be strictly disallowed, or return None if the # router should have no opinion. pass def allow_syncdb(db, model): # Given a database alias `db` and a Django model `model` return True # if the model should be synchronized for the given alias, return False # if the router should not be synchronized, or return None if the router # has no opinion. pass 25 Thursday, 5 September, 13

Slide 26

Slide 26 text

Defining Your Models 26 There is a small tweak needed when defining your models when using multiple databases. 26 Thursday, 5 September, 13

Slide 27

Slide 27 text

27 Defining your relationships without ForeignKeys. from django.db import models class Post(models.Model): user_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) class Comment(models.Model): user_pk = models.PositiveIntegerField() post_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) 27 Thursday, 5 September, 13

Slide 28

Slide 28 text

28 Defining your relationships without ForeignKeys. from django.db import models class Post(models.Model): user_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) class Comment(models.Model): user_pk = models.PositiveIntegerField() post_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) 28 Thursday, 5 September, 13

Slide 29

Slide 29 text

Sharding I. Feature partitioning II. Horizontal sharding 29 29 Thursday, 5 September, 13

Slide 30

Slide 30 text

Feature Partitioning 30 Dividing data storage and retrieval duties among different databases based on function. 30 Thursday, 5 September, 13

Slide 31

Slide 31 text

31 31 Thursday, 5 September, 13

Slide 32

Slide 32 text

Default (User accounts) Posts Comments 32 32 Thursday, 5 September, 13

Slide 33

Slide 33 text

class AppBasedRouter(object): def __init__(self, *args, **kwargs): if not (self.app_name or self.db_name): raise AttributeError('missing class variables `app_name` or `db_name`') def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def allow_relation(self, obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None App-based feature partitioning database router. 33 33 Thursday, 5 September, 13

Slide 34

Slide 34 text

class AppBasedRouter(object): def __init__(self, *args, **kwargs): if not (self.app_name or self.db_name): raise AttributeError('missing class variables `app_name` or `db_name`') def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def allow_relation(self, obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None App-based feature partitioning database router. We will come back to this later 33 33 Thursday, 5 September, 13

Slide 35

Slide 35 text

class AppBasedRouter(object): def __init__(self, *args, **kwargs): if not (self.app_name or self.db_name): raise AttributeError(...) def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None App-based feature partitioning database router. 34 34 Thursday, 5 September, 13

Slide 36

Slide 36 text

App-based feature partitioning database router. 35 class AppBasedRouter(object): def __init__(self, *args, **kwargs): if not (self.app_name or self.db_name): raise AttributeError(...) def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None 35 Thursday, 5 September, 13

Slide 37

Slide 37 text

App-based feature partitioning database router. 36 class AppBasedRouter(object): def allow_relation(self, obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None 36 Thursday, 5 September, 13

Slide 38

Slide 38 text

App-based feature partitioning database router. 37 class AppBasedRouter(object): def allow_relation(self, obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None 37 Thursday, 5 September, 13

Slide 39

Slide 39 text

Using the AppBasedRouter. 38 class PostsRouter(AppBasedRouter): app_name = 'posts' db_name = 'posts_db' class CommentsRouter(AppBasedRouter): app_name = 'comments' db_name = 'comments_db' 38 Thursday, 5 September, 13

Slide 40

Slide 40 text

Migrations with South South supports multiple databases but it’s not automatic. 39 39 Thursday, 5 September, 13

Slide 41

Slide 41 text

40 40 Thursday, 5 September, 13

Slide 42

Slide 42 text

Sync south_migrationhistory on to every database. class AppBasedRouter(object): ... def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None Remember this?? 41 41 Thursday, 5 September, 13

Slide 43

Slide 43 text

Multi-database migrations using South. 42 ./manage.py migrate posts --database=default --fake ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default NOTE: Need to specify the apps to migrate! 42 Thursday, 5 September, 13

Slide 44

Slide 44 text

Running migrations with multiple databases. 43 ./manage.py migrate posts --database=default --fake ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default 43 Thursday, 5 September, 13

Slide 45

Slide 45 text

Running migrations with multiple databases. 44 ./manage.py migrate posts --database=default --fake ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default 44 Thursday, 5 September, 13

Slide 46

Slide 46 text

Running migrations with multiple databases. 45 ./manage.py migrate posts --database=default --fake ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default 45 Thursday, 5 September, 13

Slide 47

Slide 47 text

posts = Post.objects.using('default').filter(date_added__gte='2013-09-04') DatabaseError Traceback (most recent call last) in () ----> 1 Invoice.objects.filter(date_added__lte='2013-09-01') ... /data/virtualenv/djangocon_sharding/lib/python2.7/site-packages/MySQLdb/cursors.pyc in execute(self, query, args) 172 del tb 173 self.messages.append((exc, value)) --> 174 self.errorhandler(self, exc, value) 175 self._executed = query 176 if not self._defer_warnings: self._warning_check() /data/virtualenv/djangocon_sharding/lib/python2.7/site-packages/MySQLdb/connections.pyc in defaulterrorhandler(***failed resolving arguments***) 34 del cursor 35 del connection ---> 36 raise errorclass, errorvalue 37 38 re_numeric_part = re.compile(r"^(\d+)") DatabaseError: (1146, "Table 'djangocon_sharding.posts_post' doesn't exist") We want to generate exceptions when accessing data. 46 46 Thursday, 5 September, 13

Slide 48

Slide 48 text

47 47 Thursday, 5 September, 13

Slide 49

Slide 49 text

47 47 Thursday, 5 September, 13

Slide 50

Slide 50 text

Horizontally Sharding the Data Splitting data across numerous databases with the same schema. 48 48 Thursday, 5 September, 13

Slide 51

Slide 51 text

49 49 Thursday, 5 September, 13

Slide 52

Slide 52 text

Choosing a Sharding Key Pay special attention to the relationships between Models. 50 50 Thursday, 5 September, 13

Slide 53

Slide 53 text

Choosing a Sharding Key Pay special attention to the relationships between Models. This might make a great Shard Key 50 50 Thursday, 5 September, 13

Slide 54

Slide 54 text

Unlucky with your Schema? Finding an optimal shard key for some systems (like social networks) can be difficult. 51 51 Thursday, 5 September, 13

Slide 55

Slide 55 text

Sharded data Main data Sharding key 52 52 Thursday, 5 September, 13

Slide 56

Slide 56 text

Picked a Poor Sharding Key? 53 Choosing a key to shard on is hard - there are a few tell-tale signs that your pick wasn’t optimal. 53 Thursday, 5 September, 13

Slide 57

Slide 57 text

Resist your Urges 54 Not everything makes for a good sharding key. 54 Thursday, 5 September, 13

Slide 58

Slide 58 text

Define Your Sharded Databases Defining multiple databases with the same schema for use in Django applications. 55 55 Thursday, 5 September, 13

Slide 59

Slide 59 text

Extend dj-database-url to make multi-db easier. 56 import dj_database_url def config(default=None, **kwargs): """ Returns a database dictionary from a URI. Updates the resulting database dictionary with the `kwargs` passed in. """ database_config = dj_database_url.config(default=default) database_config.update(kwargs) return database_config 56 Thursday, 5 September, 13

Slide 60

Slide 60 text

Define your sharded databases in your settings.py file. Group the shards by a common identifier 57 import dj_multidb_url DATABASES = { 'default': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/default'), 'blog_001': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_001', shard_group='blog_shards'), 'blog_002': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_002', shard_group='blog_shards'), 'blog_003': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_003', shard_group='blog_shards'), 'blog_004': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_004', shard_group='blog_shards'), 'blog_005': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_005', shard_group='blog_shards'), 'blog_006': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_006', shard_group='blog_shards'), 'blog_007': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_007', shard_group='blog_shards'), 'blog_008': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_008', shard_group='blog_shards'), 'blog_009': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_009', shard_group='blog_shards'), 'blog_010': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_010', shard_group='blog_shards'), } Share a common prefix 57 Thursday, 5 September, 13

Slide 61

Slide 61 text

Database Routers for Sharding 58 58 Thursday, 5 September, 13

Slide 62

Slide 62 text

Sample of a sharded Model. 59 class MyShardedModel(Model): sharding_key_pk = models.IntegerField() ... def get_shard(self): """ Returns the database alias that the data should reside on. """ if self._state.db: return self._state.db sharding_key = ShardingKey.objects.get(pk=self.sharding_key_pk) return sharding_key.shard_database_alias 59 Thursday, 5 September, 13

Slide 63

Slide 63 text

Sharded database router. 60 class BaseShardRouter(object): app_list = tuple() db_name_prefix = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 60 Thursday, 5 September, 13

Slide 64

Slide 64 text

Sharded database router. 61 class BaseShardRouter(object): app_list = tuple() db_name_prefix = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 61 Thursday, 5 September, 13

Slide 65

Slide 65 text

Sharded database router. 62 class BaseShardRouter(object): app_list = tuple() db_name_prefix = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 62 Thursday, 5 September, 13

Slide 66

Slide 66 text

Sharded database router. 63 class BaseShardRouter(object): app_list = tuple() db_name_prefix = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 63 Thursday, 5 September, 13

Slide 67

Slide 67 text

Sharded database router. 64 class BaseShardRouter(object): app_list = tuple() db_name_prefix = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None NOTE: We did not define an allow_relation() method because we deferred to the default behaviour. 64 Thursday, 5 September, 13

Slide 68

Slide 68 text

Using the BaseShardRouter. 65 class BlogRouter(BaseShardRouter): app_list = ('posts', 'comments', ) db_name_prefix = 'blog_' 65 Thursday, 5 September, 13

Slide 69

Slide 69 text

Reading and Writing to Shards 66 66 Thursday, 5 September, 13

Slide 70

Slide 70 text

Reading from a shard. 67 shard = sharding_key.shard_database_alias post = Post.objects.using(shard).get(...) 67 Thursday, 5 September, 13

Slide 71

Slide 71 text

Updating an instance on a shard. 68 shard = sharding_key.shard_database_alias post = Post.objects.using(shard).get(...) post.status = 'published' post.save() 68 Thursday, 5 September, 13

Slide 72

Slide 72 text

Creating a new instance on a shard. 69 shard = sharding_key.shard_database_alias post = Post(...) post.save(using=shard) 69 Thursday, 5 September, 13

Slide 73

Slide 73 text

Creating a new instance on a shard via QuerySet method. 70 shard = sharding_key.shard_database_alias post = Post.objects.using(shard).create(...) 70 Thursday, 5 September, 13

Slide 74

Slide 74 text

Globally Unique Identifiers 71 71 Thursday, 5 September, 13

Slide 75

Slide 75 text

http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/ 72 72 Thursday, 5 September, 13

Slide 76

Slide 76 text

73 73 Thursday, 5 September, 13

Slide 77

Slide 77 text

74 64 BITS Timestamp (41-bits) Worker ID (11-bits) Sequence ID (12-bits) 74 Thursday, 5 September, 13

Slide 78

Slide 78 text

75 Python key generator proof-of-concept. class IDGenerator(object): def __init__(self): self.worker_id = 1 self.sequence = 0 self.timestamp = 0 self._last_key = 0 self._last_timestamp = 0 def _prime_timestamp(self): timestamp = int(time.time() * 1000) if timestamp < self.timestamp: raise Exception self._last_timestamp, self.timestamp = self.timestamp, timestamp def _generate_key(self, collision=False): if collision: self.sequence += (self.sequence % 4096) + 1 self._prime_timestamp() key = self.timestamp << (64 - 41) key |= self.worker_id << (64 - 41 - 11) key |= self.sequence return key def key(self): key = self._generate_key() while key <= self._last_key: # wait for milliseconds to change, then re-gen the key key = self._generate_key(collision=True) self._last_key = key return key 75 Thursday, 5 September, 13

Slide 79

Slide 79 text

76 Questions? 76 Thursday, 5 September, 13