Scaling the World's Largest Django App - DjangoCon 2010

Slide 1

Slide 1 text

Jason Yan @jasonyan David Cramer @zeeg Scaling the World’s Largest Django App 1

Slide 2

Slide 2 text

What is DISQUS? 2

Slide 3

Slide 3 text

What is DISQUS? We are a comment system with an emphasis on connecting communities http://disqus.com/about/ dis·cuss • dĭ-skŭs' 3

Slide 4

Slide 4 text

What is Scale? 17,000 requests/second peak 450,000 websites 15 million proﬁles 75 million comments 250 million visitors (August 2010) 50M 100M 150M 200M 250M 300M Number of Visitors Our trafﬁc at a glance 4

Slide 5

Slide 5 text

Our Challenges • We can’t predict when things will happen • Random celebrity gossip • Natural disasters • Discussions never expire • We can’t keep those millions of articles from 2008 in the cache • You don’t know in advance (generally) where the trafﬁc will be • Especially with dynamic paging, realtime, sorting, personal prefs, etc. 5

Slide 6

Slide 6 text

Our Challenges (cont’d) • High availability • Not a destination site • Difﬁcult to schedule maintenance 6

Slide 7

Slide 7 text

Server Architecture 7

Slide 8

Slide 8 text

Server Architecture - Load Balancing • Load Balancing • Software, HAProxy • High performance, intelligent server availability checking • Bonus: Nice statistics reporting • High Availability • heartbeat Image Source: http://haproxy.1wt.eu/ 8

Slide 9

Slide 9 text

Server Architecture • ~100 Servers • 30% Web Servers (Apache + mod_wsgi) • 10% Databases (PostgreSQL) • 25% Cache Servers (memcached) • 20% Load Balancing / High Availability (HAProxy + heartbeat) • 15% Utility Servers (Python scripts) 9

Slide 10

Slide 10 text

Server Architecture - Web Servers • Apache 2.2 • mod_wsgi • Using `maximum-requests` to plug memory leaks. • Performance Monitoring • Custom middleware (PerformanceLogMiddleware) • Ships performance statistics (DB queries, external calls, template rendering, etc) through syslog • Collected and graphed through Ganglia 10

Slide 11

Slide 11 text

Server Architecture - Database • PostgreSQL • Slony-I for Replication • Trigger-based • Read slaves for extra read capacity • Failover master database for high availability 11

Slide 12

Slide 12 text

Server Architecture - Database • Make sure indexes ﬁt in memory and measure I/O • High I/O generally means slow queries due to missing indexes or indexes not in buffer cache • Log Slow Queries • syslog-ng + pgFouine + cron to automate slow query logging 12

Slide 13

Slide 13 text

Server Architecture - Database • Use connection pooling • Django doesn’t do this for you • We use pgbouncer • Limits the maximum number of connections your database needs to handle • Save on costly opening and tearing down of new database connections 13

Slide 14

Slide 14 text

Our Data Model 14

Slide 15

Slide 15 text

Partitioning • Fairly easy to implement, quick wins • Done at the application level • Data is replayed by Slony • Two methods of data separation 15

Slide 16

Slide 16 text

Vertical Partitioning Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns. http://en.wikipedia.org/wiki/Partition_(database) Posts Users Forums Sentry 16

Slide 17

Slide 17 text

Pythonic Joins posts = Post.objects.all()[0:25] # store users in a dictionary based on primary key users = dict( (u.pk, u) for u in \ User.objects.filter(pk__in=set(p.user_id for p in posts)) ) # map users to their posts for p in posts: p._user_cache = users.get(p.user_id) Allows us to separate datasets 17

Slide 18

Slide 18 text

Pythonic Joins (cont’d) • Slower than at database level • But not enough that you should care • Trading performance for scale • Allows us to separate data • Easy vertical partitioning • More efﬁcient caching • get_many, object-per-row cache 18

Slide 19

Slide 19 text

Designating Masters • Alleviates some of the write load on your primary application master • Masters exist under speciﬁc conditions: • application use case • partitioned data • Database routers make this (fairly) easy 19

Slide 20

Slide 20 text

Routing by Application class ApplicationRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None app_label = instance._meta.app_label return get_application_alias(app_label) 20

Slide 21

Slide 21 text

Horizontal Partitioning Horizontal partitioning (also known as sharding) involves splitting one set of data into different tables. http://en.wikipedia.org/wiki/Partition_(database) Your Blog CNN Disqus Telegraph 21

Slide 22

Slide 22 text

Horizontal Partitions • Some forums have very large datasets • Partners need high availability • Helps scale the write load on the master • We rely more on vertical partitions 22

Slide 23

Slide 23 text

Routing by Partition class ForumPartitionRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None forum_id = getattr(instance, 'forum_id', None) if not forum_id: return None return get_forum_alias(forum_id) # Now, making sure hints are available forum.post_set.all() # What we used to do Post.objects.filter(forum=forum) 23

Slide 24

Slide 24 text

Optimizing QuerySets • We really dislike raw SQL • It creates more work when dealing with partitions • Built-in cache allows sub-slicing • But isn’t always needed • We removed this cache 24

Slide 25

Slide 25 text

Removing the Cache • Django internally caches the results of your QuerySet • This adds additional memory overhead • Many times you only need to view a result set once • So we built SkinnyQuerySet # 1 query qs = Model.objects.all()[0:100] # 0 queries (we don’t need this behavior) qs = qs[0:10] # 1 query qs = qs.filter(foo=bar) 25

Slide 26

Slide 26 text

Removing the Cache (cont’d) class SkinnyQuerySet(QuerySet): def __iter__(self): if self._result_cache is not None: # __len__ must have been run return iter(self._result_cache) has_run = getattr(self, 'has_run', False) if has_run: raise QuerySetDoubleIteration("...") self.has_run = True # We wanted .iterator() as the default return self.iterator() Optimizing memory usage by removing the cache http://gist.github.com/550438 26

Slide 27

Slide 27 text

Atomic Updates • Keeps your data consistent • save() isnt thread-safe • use update() instead • Great for things like counters • But should be considered for all write operations 27

Slide 28

Slide 28 text

Atomic Updates (cont’d) post = Post(pk=1) # a moderator approves post.approved = True post.save() Thread safety is impossible with .save() Request 1 post = Post(pk=1) # the author adjusts their message post.message = ‘Hello!’ post.save() Request 2 28

Slide 29

Slide 29 text

Atomic Updates (cont’d) post = Post(pk=1) # a moderator approves Post.objects.filter(pk=post.pk)\ .update(approved=True) So we need atomic updates Request 1 post = Post(pk=1) # the author adjusts their message Post.objects.filter(pk=post.pk)\ .update(message=‘Hello!’) Request 2 29

Slide 30

Slide 30 text

Atomic Updates (cont’d) def update(obj, using=None, **kwargs): """ Updates specified attributes on the current instance. """ assert obj, "Instance has not yet been created." obj.__class__._base_manager.using(using)\ .filter(pk=obj) .update(**kwargs) for k, v in kwargs.iteritems(): if isinstance(v, ExpressionNode): # NotImplemented continue setattr(obj, k, v) A better way to approach updates http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py 30

Slide 31

Slide 31 text

Delayed Signals • Queueing low priority tasks • even if they’re fast • Asynchronous (Delayed) signals • very friendly to the developer • ..but not as friendly as real signals 31

Slide 32

Slide 32 text

Delayed Signals (cont’d) from disqus.common.signals import delayed_save def my_func(data, sender, created, **kwargs): print data[‘id’] delayed_save.connect(my_func, sender=Post) We send a speciﬁc serialized version of the model for delayed signals This is all handled through our Queue 32

Slide 33

Slide 33 text

Caching • Memcached • Use pylibmc (newer libMemcached-based) • Ticket #11675 (add pylibmc support) • Third party applications: • django-newcache, django-pylibmc 33

Slide 34

Slide 34 text

Caching (cont’d) • libMemcached / pylibmc is conﬁgurable with “behaviors”. • Memcached “single point of failure” • Distributed system, but we must take precautions. • Connection timeout to memcached can stall requests. • Use `_auto_eject_hosts` and `_retry_timeout` behaviors to prevent reconnecting to dead caches. 34

Slide 35

Slide 35 text

Caching (cont’d) • Default (naive) hashing behavior • Modulo hashed cache key cache for index to server list. • Removal of a server causes majority of cache keys to be remapped to new servers. CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’] key = ‘my_cache_key’ cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)] 35

Slide 36

Slide 36 text

Caching (cont’d) • Better approach: consistent hashing • libMemcached (pylibmc) uses libketama (http://tinyurl.com/lastfm-libketama) • Addition / removal of a cache server remaps (K/n) cache keys (where K=number of keys and n=number of servers) Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction 36

Slide 37

Slide 37 text

Caching (cont’d) • Thundering herd (stampede) problem • Invalidating a heavily accessed cache key causes many clients to refill cache. • But everyone refetching to fill the cache from the data store or reprocessing data can cause things to get even slower. • Most times, it’s ideal to return the previously invalidated cache value and let a single client refill the cache. • django-newcache or MintCache (http:// djangosnippets.org/snippets/793/) will do this for you. • Prefer filling cache on invalidation instead of deleting from cache also helps to prevent the thundering herd problem. 37

Slide 38

Slide 38 text

Transactions • TransactionMiddleware got us started, but down the road became a burden • For postgresql_psycopg2, there’s a database option, OPTIONS[‘autocommit’] • Each query is in its own transaction. This means each request won’t start in a transaction. • But sometimes we want transactions (e.g., saving multiple objects and rolling back on error) 38

Slide 39

Slide 39 text

Transactions (cont’d) • Tips: • Use autocommit for read slave databases. • Isolate slow functions (e.g., external calls, template rendering) from transactions. • Selective autocommit • Most read-only views don’t need to be in transactions. • Start in autocommit and switch to a transaction on write. 39

Slide 40

Slide 40 text

Scaling the Team • Small team of engineers • Monthly users / developers = 40m • Which means writing tests.. • ..and having a dead simple workﬂow 40

Slide 41

Slide 41 text

Keeping it Simple • A developer can be up and running in a few minutes • assuming postgres and other server applications are already installed • pip, virtualenv • settings.py 41

Slide 42

Slide 42 text

Setting Up Local 1. createdb -E UTF-8 disqus 2. git clone git://repo 3. mkvirtualenv disqus 4. pip install -U -r requirements.txt 5. ./manage.py syncdb && ./manage.py migrate 42

Slide 43

Slide 43 text

Sane Defaults from disqus.conf.settings.default import * try: from local_settings import * except ImportError: import sys, traceback sys.stderr.write("Can't find 'localsettings.py’\n”) sys.stderr.write("\nThe exception was:\n\n") traceback.print_exc() settings.py from disqus.conf.settings.dev import * local_settings.py 43

Slide 44

Slide 44 text

Continuous Integration • Daily deploys with Fabric • several times an hour on some days • Hudson keeps our builds going • combined with Selenium • Post-commit hooks for quick testing • like Pyﬂakes • Reverting to a previous version is a matter of seconds 44

Slide 45

Slide 45 text

Continuous Integration (cont’d) Hudson makes integration easy 45

Slide 46

Slide 46 text

Testing • It’s not fun breaking things when you’re the new guy • Our testing process is fairly heavy • 70k (Python) LOC, 73% coverage, 20 min suite • Custom Test Runner (unittest) • We needed XML, Selenium, Query Counts • Database proxies (for read-slave testing) • Integration with our Queue 46

Slide 47

Slide 47 text

Testing (cont’d) # failures yield a dump of queries def test_read_slave(self): Model.objects.using(‘read_slave’).count() self.assertQueryCount(1, ‘read_slave’) def test_button(self): self.selenium.click('//a[@class=”dsq-button”]') Query Counts Selenium Queue Integration class WorkerTest(DisqusTest): workers = [‘fire_signal’] def test_delayed_signal(self): ... 47

Slide 48

Slide 48 text

Bug Tracking • Switched from Trac to Redmine • We wanted Subtasks • Emailing exceptions is a bad idea • Even if its localhost • Previously using django-db-log to aggregate errors to a single point • We’ve overhauled db log and are releasing Sentry 48

Slide 49

Slide 49 text

django-sentry Groups messages intelligently http://github.com/dcramer/django-sentry 49

Slide 50

Slide 50 text

django-sentry (cont’d) Similar feel to Django’s debugger http://github.com/dcramer/django-sentry 50

Slide 51

Slide 51 text

Feature Switches • We needed a safety in case a feature wasn’t performing well at peak • it had to respond without delay, globally, and without writing to disk • Allows us to work out of trunk (mostly) • Easy to release new features to a portion of your audience • Also nice for “Labs” type projects 51

Slide 52

Slide 52 text

Feature Switches (cont’d) 52

Slide 53

Slide 53 text

Final Thoughts • The language (usually) isn’t your problem • We like Django • But we maintain local patches • Some tickets don’t have enough of a following • Patches, like #17, completely change Django.. • ..arguably in a good way • Others don’t have champions Ticket #17 describes making the ORM an identify mapper 53

Slide 54

Slide 54 text

Housekeeping Want to learn from others about performance and scaling problems? Birds of a Feather We’re Hiring! DISQUS is looking for amazing engineers Or play some StarCraft 2? 54

Slide 55

Slide 55 text

Questions 55

Slide 56

Slide 56 text

References django-sentry http://github.com/dcramer/django-sentry Our Feature Switches http://cl.ly/2FYt Andy McCurdy’s update() http://github.com/andymccurdy/django-tips-and-tricks Our PyFlakes Fork http://github.com/dcramer/pyﬂakes SkinnyQuerySet http://gist.github.com/550438 django-newcache http://github.com/ericﬂo/django-newcache attach_foreignkey (Pythonic Joins) http://gist.github.com/567356 56