Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Lanyrd

Building Lanyrd

The challenges involved in building large web applications using a variety of powerful open source components - presented at BrightonPy on 9th August 2011.

http://lanyrd.com/2011/brightonpy-building-lanyrd/sgptt/

Simon Willison

August 17, 2011
Tweet

More Decks by Simon Willison

Other Decks in Programming

Transcript

  1. Lanyrd.com Social event recommendation Comprehensive speaker profiles Archive of slides,

    notes and video Definitive database of professional events and speakers
  2. • Aug 31st, 11:22: Launch! (1 linode) • Aug 31st,

    12:41: Unlaunch • Aug 31st, 12:54: Read only mode • Aug 31st, 14:15: DB server (2 linodes) • Sep 1st: Limit 50 on dashboard • Sep 1st: disable-dashboard setting
  3. • Sep 3rd: dConstruct (and Twitter bot) • Sep 4th:

    TechCrunched (read only :( ) • Sep 5th: 3 large EC2 + 1 RDS • Sep 6th: Downgrade to 3 small EC2
  4. • Dec 8: Calacanis + Scoble at the same time!

    • Upgrade to next size of RDS • (Sometimes scaling vertically does the job)
  5. Load balancer (nginx) HTTP cache (varnish) lanyrd.com badges.lanyrd.net app server

    (django/mod_wsgi) app server (django/mod_wsgi) app server (django/mod_wsgi) search master (solr) search slave (solr) search slave (solr) Database (MySQL RDS) Redis (data structures + message queue) worker (celery) worker (celery) logging (MongoDB)
  6. Main Wiki apache > lucene > solr Search the site

    with Solr Search Powered by Lucid Imagination Last Published: Sat, 04 Jun 2011 12:23:42 GMT Welcome to Solr What Is Solr? Get Started News May 2011 - Solr 3.2 Released March 2011 - Solr 3.1 Released 25 June 2010 - Solr 1.4.1 Released 7 May 2010 - Apache Lucene Eurocon 2010 Coming to Prague May 18-21 10 November 2009 - Solr 1.4 Released 20 August 2009 - Solr's first book is published! 18 August 2009 - Lucene at US ApacheCon 09 February 2009 - Lucene at ApacheCon Europe 2009 in Amsterdam 19 December 2008 - Solr Logo Contest Results 03 October 2008 - Solr Logo Contest 15 September 2008 - Solr 1.3.0 Available 28 August 2008 - Lucene/Solr at ApacheCon New Orleans 03 September 2007 - Lucene at ApacheCon Atlanta 06 June 2007: Release 1.2 available 17 January 2007: Solr graduates from Incubator 22 December 2006: Release 1.1.0 available 15 August 2006: Solr at ApacheCon US 21 April 2006: Solr at ApacheCon 21 February 2006: nightly builds 17 January 2006: Solr Joins Apache Incubator What Is Solr? PDF About Welcome Who We Are Documentation Resources Related Projects
  7. Find the needle you're looking for. Download Documentation Search doesn't

    have to be hard. Haystack lets you write your search code once and choose the search engine you want it to run on. With a familiar API that should make any Djangonaut feel right at home and an architecture that allows you to swap things in and out as you need to, it's how search ought to be. Haystack is BSD licensed, plays nicely with third-party app without needing to modify the source and supports Solr, Whoosh and Xapian . Get started 1. Get the most recent source. 2. Add haystack to your INSTALLED_APPS. 3. Create search_indexes.py files for your models. 4. Setup the main SearchIndex via autodiscover. 5. Include haystack.urls to your URLconf. 6. Search! Sprinting to 1.1-final Posted on 2010/11/16 by Daniel Though this site has sat out of date, there has been a lot of work put into Haystack 1.1. As of writing, there are eight issues blocking the release. I aim to have those down to zero by the end of the week. Once those eight are done, I will be releasing 1.1-final. The RC process really didn't do much last time and this release has been a long time in coming. This release will feature: Vastly improved faceting Whoosh 1.X support! Document & field boost support More Like This Faceting Stored (non-indexed) fields Highlighting Spelling Suggestions Boost
  8. Model-oriented search • Define search_indexes.py (like admin.py) for your application

    • Hook up default haystack search views • Write a quick search.html template • Run ./manage.py rebuild_index
  9. add a conference add a conference you are signed in

    as simonw, do you want to sign out? calendar calendar conferences conferences coverage coverage profile profile search search EVENT TIME SPEAKERS EVENT TIME SPEAKERS EVENT TIME SPEAKERS Your current filters are… TYPE: Sessions TOPIC: NoSQL PLACE: United States Clear all filters NoSQL and Django Panel DjangoCon US 2010 9th September 2010 09:00-10:00 Jacob Burch Step Away From That Database DjangoCon US 2010 8th September 2010 11:20-12:00 Andrew Godwin Apache Cassandra in Action Strata 2011 1st February 2011 13:30-17:00 Jonathan Ellis FILTER BY type FILTER BY topic NoSQL 3 Django 2 Cassandra 1 FILTER BY place United States 3 Multnomah 2 Oregon 2 Portland 2 Santa Clara 1 California 1 Search Search We found 3 results for “django” django Search Search Sessions 3
  10. class BookIndex(indexes.SearchIndex): text = indexes.CharField(document=True, use_template=True) speakers = indexes.MultiValueField() topics

    = indexes.MultiValueField() def prepare_speakers(self, obj): return [a.user.t_id for a in obj.authors.exclude( user = None ).select_related('user')] def prepare_topics(self, obj): return list(obj.topics.values_list('pk', flat=True))
  11. search/indexes/books/ book_text.txt {{ object.title }} {{ object.tagline }} {% for

    author in object.authors.all %} {{ author.display_name }} {{ author.user.t_screen_name }} {% endfor %} {% for topic in object.topics.all %} {{ topic.name_en }} {% endfor %}
  12. Staying fresh • Search engines usually don’t like accepting writes

    too frequently • RealTimeSearchIndex for low traffic sites • ./manage.py update_index --age=6 (hours) • Uses index.get_updated_field() • Roll your own (message queue or similar...)
  13. Smarter indexing class Article(models.Model): needs_indexing = models.BooleanField( default = True,

    db_index = True ) ... def save(self, *args, **kwargs): self.needs_indexing = True super(Article, self).save(*args, **kwargs)
  14. index = site.get_index(model) updated_pks = [] objects = index.load_all_queryset().filter( needs_indexing=True

    )[:100] if not objects: return for object in objects: updated_pks.append(object.pk) index.update_object(object) index.load_all_queryset().filter( pk__in = updated_pks ).update(needs_indexing = False)
  15. nginx + Solr replication trick upstream solrmaster { server 10.68.43.214:8080;

    } upstream solrslaves { server 10.68.43.214:8080; server 10.193.138.80:8080; server 10.204.143.106:8080; } server { listen 8983; location /solr/update { proxy_pass http://solrmaster; } location /solr/select { proxy_pass http://solrslaves; } }
  16. add a conference add a conference you are signed in

    as simonw, do you want to sign out? calendar calendar conferences conferences coverage coverage profile profile search search TODAY We've found 182 conferences your Twitter contacts are interested in. From our blog Welcoming Sophie Barrett to team Lanyrd Today we have a very special announcement (and for once, it's not a new feature!) We would like to welcome the super-wonderful Sophie Barrett to the Lanyrd team. Session schedules in your calendar You can now subscribe to event schedules in your calendar of choice. Stay up to date at the event with the schedule in the pocket where you need it. Venues (and venue maps) Your contacts' calendar Your contacts' calendar yours 24 contacts 182 Astronomy Science Café Scientifique: Exploring the dark side of star formation with the Herschel Space Observatory United Kingdom / Brighton 21st June 2011 4 contacts tracking 21 Attend Track Usability User Experience Usability Professionals' Association – International Conference United States / Atlanta 21st–24th June 2011 1 contact speaking and 3 contacts tracking 21 Attend Track Simon Willison Your profile page
  17. # Original implementation twitter_ids = [11134, 223455, 33221, ...] #

    fetch from Twitter attendees = Attendee.objects.filter( user__t_id__in = twitter_ids ).filter( conference__start_date__gte = datetime.date.today() )
  18. # Current implementation twitter_ids = [11134, 223455, 33221, ...] #

    fetch from Twitter sqs = SearchQuerySet() sqs = sqs.models(Conference) or_string = ' OR '.join(twitter_ids) sqs = sqs.narrow('attendees:(%s)' % or_string)
  19. Try it Ready for a test drive? Check this interactive

    interactive tutorial tutorial that will walk you through the most important features of Redis. Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings strings, hashes hashes, lists lists, sets sets and sorted sorted sets sets. Learn more Learn more → → Download it Redis 2.2.10 is the latest stable version. Redis 2.2.10 is the latest stable version. Interested in legacy or unstable versions? Check the downloads page. Check the downloads page. What people are saying More... More... Comparison of CouchDB, Redis, MongoDB, Casandra, Neo4J & others http://j.mp/l32SqM http://j.mp/l32SqM via @DZone @__NeverGiveup Oh YAY, oui tu me redis ! *-* Hm, on s'rejoint à 14h au bahut ? :o JE L REDIS JE FOLLOW BACK SUR @Fuckement_TL une question : "How to use ServiceStack Redis in a web application to take advantage of pub / sub paradigm" http://t.co/EOgyLU1 http://t.co/EOgyLU1 #redis #web Nice - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison http://bit.ly/l32SqM http://bit.ly/l32SqM from @kkovacs This website is open source software developed by Citrusbyte. The Redis logo was designed by Carlos Prioglio. Sponsored by Commands Clients Documentation Community Download Issues
  20. Lanyrd.com add a conference add a conference you are signed

    in as simonw, do you want to sign out? calendar calendar conferences conferences coverage coverage profile profile search search JUNE 2011 Florence in Italy EuroPython 2011 EuroPython 2011 The European Python Conference You're speaking AT THIS EVENT (short URL) 119 speakers 97 80 PEOPLE attending PEOPLE tracking TELL YOUR FRIENDS! Tweet about this event Topics Django Plone Pyramid Python Twisted 19–26 http://ep2011.europython.eu/ View the schedule on Lanyrd Save to iCal / iPhone / Outlook / GCal @europython #europython lanyrd.com/ccdpc Andreas Schreiber @onyame Andrew Godwin @andrewgodwin Andrii Mishkovskyi @mishok13 Armin Alan Franzoni @franzeur Alessandro Dentella Alex Martelli Ali Afshar Anna Ravenscroft Anselm Kruis Antonio Cuni @antocuni Armin Rigo Edit topics
  21. Distributed Task Queue Celery is an asynchronous task queue/job queue

    based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. The execution units, called tasks, are executed concurrently on a single or more worker servers using multiprocessing, Eventlet, or gevent. Tasks can execute asynchronously (in the background) or synchronously (wait until ready). Celery is used in production systems to process millions of tasks a day. Celery is written in Python, but the protocol can be implemented in any language. It can also operate with other languages using webhooks. The recommended message broker is RabbitMQ, but limited support for Redis, Beanstalk, MongoDB, CouchDB, and databases (using SQLAlchemy or the Django ORM) is also available. Celery is easy to integrate with Django, Pylons and Flask, using the django-celery, celery-pylons and Flask-Celery add-on packages. Example This is a simple task adding two numbers: Celery 2.2 released! By @asksol on 2011-02-01. A great number of new features, including Jython, eventlet and gevent support. Everything is detailed in the Changelog, which you should have read before upgrading. Users of Django must also upgrade to django-celery 2.2. This release would not have been possible without the help of contributors and users, so thank you, and congratulations! Celery 2.1.1 bugfix release By @asksol on 2010-10-14. All users are urged to upgrade. For a list of changes see the Changelog. Users of Django must also upgrade to django-celery 2.1.1. Background Processing Background Processing Distributed Distributed Asynchronous/Synchronous Asynchronous/Synchronous Concurrency Concurrency Periodic Tasks Periodic Tasks Retries Retries Home Code Documentation Community Download
  22. Tasks? • Anything that takes more than about 200ms •

    Updating a search index • Resizing images • Hitting external APIs • Generating reports
  23. Trivial example • Fetch the content of a web page

    from celery.task import task @task def fetch_url(url): return urllib.urlopen(url).read() >>> result = fetch_url.delay(‘http://cnn.com/’) >>> html = result.wait()
  24. Python and MongoDB Python and MongoDB tutorial tutorial A session

    at EuroPython 2011 MongoDB is the new star of the so-called NoSQL databases. Using Python with MongoDB is the next logical step after having used Python for years with relational databases. This talk will give an introduction into MongoDB and demonstrate how MongoDB can be be used from Python. More information can be found under: http://www.zopyx.com/resources/python-mongodb-tutorial-at... More sessions at EuroPython 2011 on Python Add coverage to this session A URL to coverage such as videos, slides, podcasts, handouts, sketchnotes, photos etc. Add Add EuroPython 2011 Italy / Florence 19th–26th June 2011 TELL YOUR FRIENDS! Tweet about this session WHEN Time 14:30–18:30 CET Date 20th June 2011 SESSION HASH TAG #sftzh SHORT URL lanyrd.com/sftzh OFFICIAL SESSION PAGE ep2011.europython.eu/conf Topics MongoDB Python SCHEDULE INCOMPLETE? Add another session Andreas Jung CEO, ZOPYX Ltd View the schedule Edit topics http://www.slideshare.net/ajung/python-mo
  25. Link Write-up Slides Video Audio Sketch notes Transcript Handout Liveblog

    Photos Notes Link title Python mongo db-training-europython-2011 Type of coverage Coverage preview From SlideShare: EuroPython 2011 Italy / Florence 19th–26th June 2011 Add coverage Add coverage http://www.slideshare.net/ajung/python-mongo- dbtrainingeurop... Python and MongoDB tutorial
  26. The task itself... • Tries using http://embed.ly/ to find a

    preview • Fetches the HTTP headers and first 2048 bytes • If HTML, attempts to extract the <title> • If other, gets the file type and size from headers
  27. Behind the scenes... ar = enhance_link.delay(url) poll_url = '/working/%s/' %

    signed.dumps({ 'task_id': ar.task_id, 'on_done_url': on_done_url, }) if 'ajax' in request.POST: return render_json(request, { 'ok': True, 'poll_url': poll_url, }) else: return HttpResponseRedirect(poll_url)
  28. And when it’s done... from celery.backends import default_backend ... task_id

    = request.REQUEST.get('id', '') result = default_backend.get_result(task_id)
  29. Configuration # Carrot / Celery: queue uses Redis CARROT_BACKEND =

    "ghettoq.taproot.Redis" BROKER_HOST = " 10.11.11.11" # redis server BROKER_PORT = 6379 BROKER_VHOST = "6" # Task results stored in memcached, so they can # expire automatically CELERY_RESULT_BACKEND = "cache" CELERY_CACHE_BACKEND = \ "memcached://10.11.11.12:11211;..."
  30. Phantom load testing • Deploy a new architecture on a

    brand new EC2 cluster • Leave your existing site on the old cluster • Invisibly link to the new stack from an <img width=1 height=1> element on your live site (not for very long though) • (sensible alternative: find a way to replay log files)
  31. add a conference add a conference you are signed in

    as simonw, do you want to sign out? calendar calendar conferences conferences coverage coverage profile profile search search ON NOW Django Plone Pyramid Python Twisted EuroPython 2011 Italy / Florence 19th–26th June 2011 SEPTEMBER 2011 Django Open Source Python Django Python DjangoCon US 2011 United States / Portland 6th–8th September 2011 PyCON FR 2011 France / Rennes 17th–18th September 2011 OCTOBER PyCon DE 2011 Django events looking for participants 1 Django event is looking for participants Django coverage By country Ireland 1 Django conferences Django conferences 19 6 17 4 52 videos Most recent added 3 weeks ago 52 slide decks Most recent added 4 hours ago 3 audio clips Most recent added 1 week ago 27 write-ups Most recent added 1 week ago 11 handouts Most recent added 18 hours ago 3 notes Most recent added 10 hours ago
  32. class Conference(models.Model): ... cache_version = models.IntegerField(default = 0) def save(self,

    *args, **kwargs): self.cache_version += 1 super(Conference, self).save(*args, **kwargs) def touch(self): Conference.objects.filter(pk = self.pk).update( cache_version = F('cache_version') + 1 )
  33. {% cache 36000 conf-topics conference.pk conference.cache_version %} <ul class="tags inline-tags

    meta"> {% for topic in conference.topics.all %} <li><a href="{{ topic.get_absolute_url }}">{{ topic }}</a></li> {% endfor %} </ul> {% endcache %}
  34. Signing uses • "Unsubscribe" links in emails • lanyrd.com/un/ImN6VyI.ii0Hwm7p71DEcGfaVzziQaxeuu ?redirect_to=URL

    protection Signed cookies "You are logged in as simonw" without hitting the database
  35. Signing in Django 1.4 from django.core import signing signing.dumps({"foo": "bar"})

    signing.loads(signed_string) response.set_signed_cookie(key, value...) response.get_signed_cookie(key)
  36. Benefits • Far futures expiry headers • Cache-Control: max-age=315360000 •

    Expires: Fri, 18 Jun 2021 06:45:00 -0000 GMT • Guaranteed updated CSS in IE • Deploy new assets in advance of application • Old versions stick around for rollbacks
  37. ./manage.py push_static • Minifies JavaScript and CSS • Renames files

    to include sha1(contents)[:6] • Pushes all assets to S3
  38. UserBasedExceptionMiddleware from django.views.debug import technical_500_response import sys class UserBasedExceptionMiddleware(object): def

    process_exception(self, request, exception): if request.user.is_superuser: return technical_500_response(request, *sys.exc_info())
  39. mysql-proxy • Very handy lua-customisable proxy for all of your

    MySQL traffic • Worst documented software ever • log.lua - logs out ALL queries • https://gist.github.com/1039751
  40. django_instrumented • (Unreleased) code I wrote for Lanyrd • Collects

    various runtime stats about the current request, stashes a profile JSON in memcached • Writes out the profile UUID as part of the HTML • A bookmarklet to view the profile
  41. mongodb logging • Super-fast inserts, log everything! • Capped collections

    • Structured queries • Ask me about it in a few months
  42. For the future... • Much better profiling, monitoring and alerts

    • Varnish in front of everything • Replicated MySQL for analytics + upgrades