Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Database Sharding with Django - DjangoCon 2013

Database Sharding with Django - DjangoCon 2013

Django works great for single database webapps, but did you know they give us the tools to create webapps with sharded data right out of the box? This talk will go over how we leveraged Django's features to add sharding to our webapp infrastructure.

Video: http://www.youtube.com/watch?v=bAaoSAoACEI

Ash Christopher

September 05, 2013
Tweet

More Decks by Ash Christopher

Other Decks in Programming

Transcript

  1. 10 Small Business Owner Signups per Month November 2010 -

    April 2012 10 Thursday, 5 September, 13
  2. Process We Followed at Wave I. Scale Up II. Feature

    Partition/Sharding III. Horizontal Sharding 15 15 Thursday, 5 September, 13
  3. Things to Know Before You Shard Things you should be

    aware of before going introducing sharding into your project. 16 16 Thursday, 5 September, 13
  4. 19 Multiple Databases in Django How to make use of

    more than one database in your Django projects. 19 Thursday, 5 September, 13
  5. Define multiple database in your settings.py file. 20 DATABASES =

    { 'default': { 'ENGINE': 'django.db.backends.mysql', 'NAME': 'default', 'USER': 'mysql_user', 'PASSWORD': '***', 'HOST': '0.0.0.0', 'PORT': '3306', }, 'posts': { 'ENGINE': 'django.db.backends.mysql', 'NAME': 'posts', 'USER': 'mysql_user', 'PASSWORD': '***', 'HOST': '0.0.0.0', 'PORT': '3306', }, 'comments': { 'ENGINE': 'django.db.backends.mysql', 'NAME': 'comments', 'USER': 'mysql_user', 'PASSWORD': '***', 'HOST': '0.0.0.0', 'PORT': '3306', }, } 20 Thursday, 5 September, 13
  6. Define databases in your settings.py file using dj-database-url. (Utility will

    be more apparent when we start sharding) 22 import dj_database_url DATABASES = { 'default': dj_database_url.config(default='mysql://mysql_user:***@localhost/default'), 'posts': dj_database_url.config(default='mysql://mysql_user:***@localhost/posts'), 'comments': dj_database_url.config(default='mysql://mysql_user:***@localhost/comments'), } 22 Thursday, 5 September, 13
  7. Database Routers 23 The mechanism by which Django supports interacting

    with multiple databases. 23 Thursday, 5 September, 13
  8. Define your routers in your settings.py file. 24 DATABASE_ROUTERS =

    ( 'posts.routers.PostsRouter', 'comments.routers.CommentsRouter', ) 24 Thursday, 5 September, 13
  9. The anatomy of a Django database router. 25 class MyDatabaseRouter(object):

    def db_for_read(model, **hints): # Given a model class `model`, return the alias that maps to the # the database connection for reading, or None if there is no opinion. pass def db_for_write(model, **hints): # Given a model class `model`, return the alias that maps to the # the database connection for writing, or None if there is no opinion. pass def allow_relation(obj1, obj2, **hints): # Return True if the relation between `obj1` and `obj2` should be # explicitly allowed, return False if the relation between `obj1` # and `obj2` should be strictly disallowed, or return None if the # router should have no opinion. pass def allow_syncdb(db, model): # Given a database alias `db` and a Django model `model` return True # if the model should be synchronized for the given alias, return False # if the router should not be synchronized, or return None if the router # has no opinion. pass 25 Thursday, 5 September, 13
  10. Defining Your Models 26 There is a small tweak needed

    when defining your models when using multiple databases. 26 Thursday, 5 September, 13
  11. 27 Defining your relationships without ForeignKeys. from django.db import models

    class Post(models.Model): user_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) class Comment(models.Model): user_pk = models.PositiveIntegerField() post_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) 27 Thursday, 5 September, 13
  12. 28 Defining your relationships without ForeignKeys. from django.db import models

    class Post(models.Model): user_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) class Comment(models.Model): user_pk = models.PositiveIntegerField() post_pk = models.PositiveIntegerField() active = models.BooleanField(default=False) date_added = models.DateTimeField(auto_now_add=True) date_modified = models.DateTimeField(auto_now=True) 28 Thursday, 5 September, 13
  13. Feature Partitioning 30 Dividing data storage and retrieval duties among

    different databases based on function. 30 Thursday, 5 September, 13
  14. class AppBasedRouter(object): def __init__(self, *args, **kwargs): if not (self.app_name or

    self.db_name): raise AttributeError('missing class variables `app_name` or `db_name`') def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def allow_relation(self, obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None App-based feature partitioning database router. 33 33 Thursday, 5 September, 13
  15. class AppBasedRouter(object): def __init__(self, *args, **kwargs): if not (self.app_name or

    self.db_name): raise AttributeError('missing class variables `app_name` or `db_name`') def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def allow_relation(self, obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None App-based feature partitioning database router. We will come back to this later 33 33 Thursday, 5 September, 13
  16. class AppBasedRouter(object): def __init__(self, *args, **kwargs): if not (self.app_name or

    self.db_name): raise AttributeError(...) def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None App-based feature partitioning database router. 34 34 Thursday, 5 September, 13
  17. App-based feature partitioning database router. 35 class AppBasedRouter(object): def __init__(self,

    *args, **kwargs): if not (self.app_name or self.db_name): raise AttributeError(...) def db_for_read(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None def db_for_write(self, model, **hints): if model._meta.app_label == self.app_name: return self.db_name return None 35 Thursday, 5 September, 13
  18. App-based feature partitioning database router. 36 class AppBasedRouter(object): def allow_relation(self,

    obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None 36 Thursday, 5 September, 13
  19. App-based feature partitioning database router. 37 class AppBasedRouter(object): def allow_relation(self,

    obj1, obj2, **hints): # only allow joins between objects in the same app if obj1._meta.app_label == self.app_name and obj2._meta.app_label == self.app_name: return True elif obj1._meta.app_label == self.app_name or obj2._meta.app_label == self.app_name: return False else: # neither of the models are from the app - we have no opinion. return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None 37 Thursday, 5 September, 13
  20. Using the AppBasedRouter. 38 class PostsRouter(AppBasedRouter): app_name = 'posts' db_name

    = 'posts_db' class CommentsRouter(AppBasedRouter): app_name = 'comments' db_name = 'comments_db' 38 Thursday, 5 September, 13
  21. Sync south_migrationhistory on to every database. class AppBasedRouter(object): ... def

    allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True if db == self.db_name: return model._meta.app_label == self.app_name return None Remember this?? 41 41 Thursday, 5 September, 13
  22. Multi-database migrations using South. 42 ./manage.py migrate posts --database=default --fake

    ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default NOTE: Need to specify the apps to migrate! 42 Thursday, 5 September, 13
  23. Running migrations with multiple databases. 43 ./manage.py migrate posts --database=default

    --fake ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default 43 Thursday, 5 September, 13
  24. Running migrations with multiple databases. 44 ./manage.py migrate posts --database=default

    --fake ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default 44 Thursday, 5 September, 13
  25. Running migrations with multiple databases. 45 ./manage.py migrate posts --database=default

    --fake ./manage.py migrate posts --database=posts_db ./manage.py migrate comments --database=default --fake ./manage.py migrate comments --database=comments_db ./manage.py migrate --database=default 45 Thursday, 5 September, 13
  26. posts = Post.objects.using('default').filter(date_added__gte='2013-09-04') DatabaseError Traceback (most recent call last) <ipython-input-5-89aaceb5a56d>

    in <module>() ----> 1 Invoice.objects.filter(date_added__lte='2013-09-01') ... /data/virtualenv/djangocon_sharding/lib/python2.7/site-packages/MySQLdb/cursors.pyc in execute(self, query, args) 172 del tb 173 self.messages.append((exc, value)) --> 174 self.errorhandler(self, exc, value) 175 self._executed = query 176 if not self._defer_warnings: self._warning_check() /data/virtualenv/djangocon_sharding/lib/python2.7/site-packages/MySQLdb/connections.pyc in defaulterrorhandler(***failed resolving arguments***) 34 del cursor 35 del connection ---> 36 raise errorclass, errorvalue 37 38 re_numeric_part = re.compile(r"^(\d+)") DatabaseError: (1146, "Table 'djangocon_sharding.posts_post' doesn't exist") We want to generate exceptions when accessing data. 46 46 Thursday, 5 September, 13
  27. Horizontally Sharding the Data Splitting data across numerous databases with

    the same schema. 48 48 Thursday, 5 September, 13
  28. Choosing a Sharding Key Pay special attention to the relationships

    between Models. 50 50 Thursday, 5 September, 13
  29. Choosing a Sharding Key Pay special attention to the relationships

    between Models. This might make a great Shard Key 50 50 Thursday, 5 September, 13
  30. Unlucky with your Schema? Finding an optimal shard key for

    some systems (like social networks) can be difficult. 51 51 Thursday, 5 September, 13
  31. Picked a Poor Sharding Key? 53 Choosing a key to

    shard on is hard - there are a few tell-tale signs that your pick wasn’t optimal. 53 Thursday, 5 September, 13
  32. Resist your Urges 54 Not everything makes for a good

    sharding key. 54 Thursday, 5 September, 13
  33. Define Your Sharded Databases Defining multiple databases with the same

    schema for use in Django applications. 55 55 Thursday, 5 September, 13
  34. Extend dj-database-url to make multi-db easier. 56 import dj_database_url def

    config(default=None, **kwargs): """ Returns a database dictionary from a URI. Updates the resulting database dictionary with the `kwargs` passed in. """ database_config = dj_database_url.config(default=default) database_config.update(kwargs) return database_config 56 Thursday, 5 September, 13
  35. Define your sharded databases in your settings.py file. Group the

    shards by a common identifier 57 import dj_multidb_url DATABASES = { 'default': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/default'), 'blog_001': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_001', shard_group='blog_shards'), 'blog_002': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_002', shard_group='blog_shards'), 'blog_003': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_003', shard_group='blog_shards'), 'blog_004': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_004', shard_group='blog_shards'), 'blog_005': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_005', shard_group='blog_shards'), 'blog_006': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_006', shard_group='blog_shards'), 'blog_007': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_007', shard_group='blog_shards'), 'blog_008': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_008', shard_group='blog_shards'), 'blog_009': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_009', shard_group='blog_shards'), 'blog_010': dj_multidb_url.config(default='mysql://mysql_user:***@localhost/blog_010', shard_group='blog_shards'), } Share a common prefix 57 Thursday, 5 September, 13
  36. Sample of a sharded Model. 59 class MyShardedModel(Model): sharding_key_pk =

    models.IntegerField() ... def get_shard(self): """ Returns the database alias that the data should reside on. """ if self._state.db: return self._state.db sharding_key = ShardingKey.objects.get(pk=self.sharding_key_pk) return sharding_key.shard_database_alias 59 Thursday, 5 September, 13
  37. Sharded database router. 60 class BaseShardRouter(object): app_list = tuple() db_name_prefix

    = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 60 Thursday, 5 September, 13
  38. Sharded database router. 61 class BaseShardRouter(object): app_list = tuple() db_name_prefix

    = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 61 Thursday, 5 September, 13
  39. Sharded database router. 62 class BaseShardRouter(object): app_list = tuple() db_name_prefix

    = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 62 Thursday, 5 September, 13
  40. Sharded database router. 63 class BaseShardRouter(object): app_list = tuple() db_name_prefix

    = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None 63 Thursday, 5 September, 13
  41. Sharded database router. 64 class BaseShardRouter(object): app_list = tuple() db_name_prefix

    = None def db_for_read(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def db_for_write(self, model, **hints): if model._meta.app_label in self.app_list: instance = hints.get('instance') if instance: return instance.get_shard() return None def allow_syncdb(self, db, model): if model._meta.app_label in ['south']: return True this_app = (model._meta.app_label in self.app_list) _db = db.startswith(self.db_names_prefix) if this_app: return _db if _db: return False return None NOTE: We did not define an allow_relation() method because we deferred to the default behaviour. 64 Thursday, 5 September, 13
  42. Reading from a shard. 67 shard = sharding_key.shard_database_alias post =

    Post.objects.using(shard).get(...) 67 Thursday, 5 September, 13
  43. Updating an instance on a shard. 68 shard = sharding_key.shard_database_alias

    post = Post.objects.using(shard).get(...) post.status = 'published' post.save() 68 Thursday, 5 September, 13
  44. Creating a new instance on a shard. 69 shard =

    sharding_key.shard_database_alias post = Post(...) post.save(using=shard) 69 Thursday, 5 September, 13
  45. Creating a new instance on a shard via QuerySet method.

    70 shard = sharding_key.shard_database_alias post = Post.objects.using(shard).create(...) 70 Thursday, 5 September, 13
  46. 75 Python key generator proof-of-concept. class IDGenerator(object): def __init__(self): self.worker_id

    = 1 self.sequence = 0 self.timestamp = 0 self._last_key = 0 self._last_timestamp = 0 def _prime_timestamp(self): timestamp = int(time.time() * 1000) if timestamp < self.timestamp: raise Exception self._last_timestamp, self.timestamp = self.timestamp, timestamp def _generate_key(self, collision=False): if collision: self.sequence += (self.sequence % 4096) + 1 self._prime_timestamp() key = self.timestamp << (64 - 41) key |= self.worker_id << (64 - 41 - 11) key |= self.sequence return key def key(self): key = self._generate_key() while key <= self._last_key: # wait for milliseconds to change, then re-gen the key key = self._generate_key(collision=True) self._last_key = key return key 75 Thursday, 5 September, 13