Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling the World's Largest Django App - Django...

David Cramer
September 26, 2011

Scaling the World's Largest Django App - DjangoCon 2010

David Cramer

September 26, 2011
Tweet

More Decks by David Cramer

Other Decks in Technology

Transcript

  1. What is DISQUS? We are a comment system with an

    emphasis on connecting communities http://disqus.com/about/ dis·cuss • dĭ-skŭs' 3
  2. What is Scale? 17,000 requests/second peak 450,000 websites 15 million

    profiles 75 million comments 250 million visitors (August 2010) 50M 100M 150M 200M 250M 300M Number of Visitors Our traffic at a glance 4
  3. Our Challenges • We can’t predict when things will happen

    • Random celebrity gossip • Natural disasters • Discussions never expire • We can’t keep those millions of articles from 2008 in the cache • You don’t know in advance (generally) where the traffic will be • Especially with dynamic paging, realtime, sorting, personal prefs, etc. 5
  4. Our Challenges (cont’d) • High availability • Not a destination

    site • Difficult to schedule maintenance 6
  5. Server Architecture - Load Balancing • Load Balancing • Software,

    HAProxy • High performance, intelligent server availability checking • Bonus: Nice statistics reporting • High Availability • heartbeat Image Source: http://haproxy.1wt.eu/ 8
  6. Server Architecture • ~100 Servers • 30% Web Servers (Apache

    + mod_wsgi) • 10% Databases (PostgreSQL) • 25% Cache Servers (memcached) • 20% Load Balancing / High Availability (HAProxy + heartbeat) • 15% Utility Servers (Python scripts) 9
  7. Server Architecture - Web Servers • Apache 2.2 • mod_wsgi

    • Using `maximum-requests` to plug memory leaks. • Performance Monitoring • Custom middleware (PerformanceLogMiddleware) • Ships performance statistics (DB queries, external calls, template rendering, etc) through syslog • Collected and graphed through Ganglia 10
  8. Server Architecture - Database • PostgreSQL • Slony-I for Replication

    • Trigger-based • Read slaves for extra read capacity • Failover master database for high availability 11
  9. Server Architecture - Database • Make sure indexes fit in

    memory and measure I/O • High I/O generally means slow queries due to missing indexes or indexes not in buffer cache • Log Slow Queries • syslog-ng + pgFouine + cron to automate slow query logging 12
  10. Server Architecture - Database • Use connection pooling • Django

    doesn’t do this for you • We use pgbouncer • Limits the maximum number of connections your database needs to handle • Save on costly opening and tearing down of new database connections 13
  11. Partitioning • Fairly easy to implement, quick wins • Done

    at the application level • Data is replayed by Slony • Two methods of data separation 15
  12. Vertical Partitioning Vertical partitioning involves creating tables with fewer columns

    and using additional tables to store the remaining columns. http://en.wikipedia.org/wiki/Partition_(database) Posts Users Forums Sentry 16
  13. Pythonic Joins posts = Post.objects.all()[0:25] # store users in a

    dictionary based on primary key users = dict( (u.pk, u) for u in \ User.objects.filter(pk__in=set(p.user_id for p in posts)) ) # map users to their posts for p in posts: p._user_cache = users.get(p.user_id) Allows us to separate datasets 17
  14. Pythonic Joins (cont’d) • Slower than at database level •

    But not enough that you should care • Trading performance for scale • Allows us to separate data • Easy vertical partitioning • More efficient caching • get_many, object-per-row cache 18
  15. Designating Masters • Alleviates some of the write load on

    your primary application master • Masters exist under specific conditions: • application use case • partitioned data • Database routers make this (fairly) easy 19
  16. Routing by Application class ApplicationRouter(object): def db_for_read(self, model, **hints): instance

    = hints.get('instance') if not instance: return None app_label = instance._meta.app_label return get_application_alias(app_label) 20
  17. Horizontal Partitioning Horizontal partitioning (also known as sharding) involves splitting

    one set of data into different tables. http://en.wikipedia.org/wiki/Partition_(database) Your Blog CNN Disqus Telegraph 21
  18. Horizontal Partitions • Some forums have very large datasets •

    Partners need high availability • Helps scale the write load on the master • We rely more on vertical partitions 22
  19. Routing by Partition class ForumPartitionRouter(object): def db_for_read(self, model, **hints): instance

    = hints.get('instance') if not instance: return None forum_id = getattr(instance, 'forum_id', None) if not forum_id: return None return get_forum_alias(forum_id) # Now, making sure hints are available forum.post_set.all() # What we used to do Post.objects.filter(forum=forum) 23
  20. Optimizing QuerySets • We really dislike raw SQL • It

    creates more work when dealing with partitions • Built-in cache allows sub-slicing • But isn’t always needed • We removed this cache 24
  21. Removing the Cache • Django internally caches the results of

    your QuerySet • This adds additional memory overhead • Many times you only need to view a result set once • So we built SkinnyQuerySet # 1 query qs = Model.objects.all()[0:100] # 0 queries (we don’t need this behavior) qs = qs[0:10] # 1 query qs = qs.filter(foo=bar) 25
  22. Removing the Cache (cont’d) class SkinnyQuerySet(QuerySet): def __iter__(self): if self._result_cache

    is not None: # __len__ must have been run return iter(self._result_cache) has_run = getattr(self, 'has_run', False) if has_run: raise QuerySetDoubleIteration("...") self.has_run = True # We wanted .iterator() as the default return self.iterator() Optimizing memory usage by removing the cache http://gist.github.com/550438 26
  23. Atomic Updates • Keeps your data consistent • save() isnt

    thread-safe • use update() instead • Great for things like counters • But should be considered for all write operations 27
  24. Atomic Updates (cont’d) post = Post(pk=1) # a moderator approves

    post.approved = True post.save() Thread safety is impossible with .save() Request 1 post = Post(pk=1) # the author adjusts their message post.message = ‘Hello!’ post.save() Request 2 28
  25. Atomic Updates (cont’d) post = Post(pk=1) # a moderator approves

    Post.objects.filter(pk=post.pk)\ .update(approved=True) So we need atomic updates Request 1 post = Post(pk=1) # the author adjusts their message Post.objects.filter(pk=post.pk)\ .update(message=‘Hello!’) Request 2 29
  26. Atomic Updates (cont’d) def update(obj, using=None, **kwargs): """ Updates specified

    attributes on the current instance. """ assert obj, "Instance has not yet been created." obj.__class__._base_manager.using(using)\ .filter(pk=obj) .update(**kwargs) for k, v in kwargs.iteritems(): if isinstance(v, ExpressionNode): # NotImplemented continue setattr(obj, k, v) A better way to approach updates http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py 30
  27. Delayed Signals • Queueing low priority tasks • even if

    they’re fast • Asynchronous (Delayed) signals • very friendly to the developer • ..but not as friendly as real signals 31
  28. Delayed Signals (cont’d) from disqus.common.signals import delayed_save def my_func(data, sender,

    created, **kwargs): print data[‘id’] delayed_save.connect(my_func, sender=Post) We send a specific serialized version of the model for delayed signals This is all handled through our Queue 32
  29. Caching • Memcached • Use pylibmc (newer libMemcached-based) • Ticket

    #11675 (add pylibmc support) • Third party applications: • django-newcache, django-pylibmc 33
  30. Caching (cont’d) • libMemcached / pylibmc is configurable with “behaviors”.

    • Memcached “single point of failure” • Distributed system, but we must take precautions. • Connection timeout to memcached can stall requests. • Use `_auto_eject_hosts` and `_retry_timeout` behaviors to prevent reconnecting to dead caches. 34
  31. Caching (cont’d) • Default (naive) hashing behavior • Modulo hashed

    cache key cache for index to server list. • Removal of a server causes majority of cache keys to be remapped to new servers. CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’] key = ‘my_cache_key’ cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)] 35
  32. Caching (cont’d) • Better approach: consistent hashing • libMemcached (pylibmc)

    uses libketama (http://tinyurl.com/lastfm-libketama) • Addition / removal of a cache server remaps (K/n) cache keys (where K=number of keys and n=number of servers) Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction 36
  33. Caching (cont’d) • Thundering herd (stampede) problem • Invalidating a

    heavily accessed cache key causes many clients to refill cache. • But everyone refetching to fill the cache from the data store or reprocessing data can cause things to get even slower. • Most times, it’s ideal to return the previously invalidated cache value and let a single client refill the cache. • django-newcache or MintCache (http:// djangosnippets.org/snippets/793/) will do this for you. • Prefer filling cache on invalidation instead of deleting from cache also helps to prevent the thundering herd problem. 37
  34. Transactions • TransactionMiddleware got us started, but down the road

    became a burden • For postgresql_psycopg2, there’s a database option, OPTIONS[‘autocommit’] • Each query is in its own transaction. This means each request won’t start in a transaction. • But sometimes we want transactions (e.g., saving multiple objects and rolling back on error) 38
  35. Transactions (cont’d) • Tips: • Use autocommit for read slave

    databases. • Isolate slow functions (e.g., external calls, template rendering) from transactions. • Selective autocommit • Most read-only views don’t need to be in transactions. • Start in autocommit and switch to a transaction on write. 39
  36. Scaling the Team • Small team of engineers • Monthly

    users / developers = 40m • Which means writing tests.. • ..and having a dead simple workflow 40
  37. Keeping it Simple • A developer can be up and

    running in a few minutes • assuming postgres and other server applications are already installed • pip, virtualenv • settings.py 41
  38. Setting Up Local 1. createdb -E UTF-8 disqus 2. git

    clone git://repo 3. mkvirtualenv disqus 4. pip install -U -r requirements.txt 5. ./manage.py syncdb && ./manage.py migrate 42
  39. Sane Defaults from disqus.conf.settings.default import * try: from local_settings import

    * except ImportError: import sys, traceback sys.stderr.write("Can't find 'localsettings.py’\n”) sys.stderr.write("\nThe exception was:\n\n") traceback.print_exc() settings.py from disqus.conf.settings.dev import * local_settings.py 43
  40. Continuous Integration • Daily deploys with Fabric • several times

    an hour on some days • Hudson keeps our builds going • combined with Selenium • Post-commit hooks for quick testing • like Pyflakes • Reverting to a previous version is a matter of seconds 44
  41. Testing • It’s not fun breaking things when you’re the

    new guy • Our testing process is fairly heavy • 70k (Python) LOC, 73% coverage, 20 min suite • Custom Test Runner (unittest) • We needed XML, Selenium, Query Counts • Database proxies (for read-slave testing) • Integration with our Queue 46
  42. Testing (cont’d) # failures yield a dump of queries def

    test_read_slave(self): Model.objects.using(‘read_slave’).count() self.assertQueryCount(1, ‘read_slave’) def test_button(self): self.selenium.click('//a[@class=”dsq-button”]') Query Counts Selenium Queue Integration class WorkerTest(DisqusTest): workers = [‘fire_signal’] def test_delayed_signal(self): ... 47
  43. Bug Tracking • Switched from Trac to Redmine • We

    wanted Subtasks • Emailing exceptions is a bad idea • Even if its localhost • Previously using django-db-log to aggregate errors to a single point • We’ve overhauled db log and are releasing Sentry 48
  44. Feature Switches • We needed a safety in case a

    feature wasn’t performing well at peak • it had to respond without delay, globally, and without writing to disk • Allows us to work out of trunk (mostly) • Easy to release new features to a portion of your audience • Also nice for “Labs” type projects 51
  45. Final Thoughts • The language (usually) isn’t your problem •

    We like Django • But we maintain local patches • Some tickets don’t have enough of a following • Patches, like #17, completely change Django.. • ..arguably in a good way • Others don’t have champions Ticket #17 describes making the ORM an identify mapper 53
  46. Housekeeping Want to learn from others about performance and scaling

    problems? Birds of a Feather We’re Hiring! DISQUS is looking for amazing engineers Or play some StarCraft 2? 54
  47. References django-sentry http://github.com/dcramer/django-sentry Our Feature Switches http://cl.ly/2FYt Andy McCurdy’s update()

    http://github.com/andymccurdy/django-tips-and-tricks Our PyFlakes Fork http://github.com/dcramer/pyflakes SkinnyQuerySet http://gist.github.com/550438 django-newcache http://github.com/ericflo/django-newcache attach_foreignkey (Pythonic Joins) http://gist.github.com/567356 56