Design patterns for mega traffic on a budget
Scaling web apps
Slide 2
Slide 2 text
Loading, please wait
Estimated time remaining: ages.
Slide 3
Slide 3 text
Crikey.
We hadn't quite counted on welcoming
quite so many of you to OnOneMap
in one go, and to be quite honest
we've completely run out of oomph.
Please do come back tomorrow, and
we promise to make you a lovely cup
of tea to make up for not being quite on
top form today.
Slide 4
Slide 4 text
Two problems
• Performance
You need to make your app more efficient
• Scaling
You need to increase capacity
Slide 5
Slide 5 text
performance != scaling
Slide 6
Slide 6 text
but you need both
Slide 7
Slide 7 text
Vertical scaling
Weedy
server
Powerful
server
Deep
thought
Easy, expensive, limited
Slide 8
Slide 8 text
Horizontal scaling
Cheap, limitless. HARD.
Slide 9
Slide 9 text
It's obvious, really…
Slide 10
Slide 10 text
1
2
Sessions
Recognise your reader
Caching
Love being lazy
3 Writing
Learn to feel the pain
Slide 11
Slide 11 text
Fear sessions, and you will scale well.
The master Jedi plans ahead.
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
Scenario
• Lots of non-personalised content (newspaper,
blog, web store)
• Some minor session-based data (eg. 'Welcome
Andrew')
Slide 14
Slide 14 text
Not cool.
Slide 15
Slide 15 text
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: post-check=0, must-revalidate, no-
store, no-cache, pre-check=0
Last-Modified: not present
ETag: not present
Set-Cookie: path=/;
phpsessid=b7977f7c69eb898bf42526652dda4c6c
BAD
BAD
BAD
BAD
Sascha Schumann's
Birthday
Slide 16
Slide 16 text
b7977f7c69e
b898bf42526
652dda4c6c
name: Andrew
email: [email protected]
logindate: 2009-02-28 20:42:12
userid: 453245
Client Server 1
Slide 17
Slide 17 text
b7977f7c69e
b898bf42526
652dda4c6c
unknown session
Client Server 2
Slide 18
Slide 18 text
Defeat, summarised.
• Can't cache it
• Need sessions everywhere
• Sessions are lost if you switch server
• Session-enabled requests are processed
sequentially, due to file locking
• Nightmare.
Slide 19
Slide 19 text
1
Use JavaScript to inject
session state
Slide 20
Slide 20 text
Solution
• Generate only generic content
• Leave gaps (login status, shopping basket etc)
• Load session data from somewhere else
• Merge in browser using magic (or JavaScript)
Result.
• Most scripts don't need to track sessions
• You can cache stuff (even use a CDN)
• Cache it for ages. Reduce load on your kit.
• Sessions become a separate issue - build a
scalable session store on a separate vhost /
machine / cluster
Slide 23
Slide 23 text
Hmm.
Slide 24
Slide 24 text
OK, so….
• Your pages are mostly dynamic content
(webmail, identity manager etc)
• Almost entire page content is session-specific
Slide 25
Slide 25 text
2
Use cookies for client-side
session storage
Slide 26
Slide 26 text
Solution
• Don't use server sessions at all
• Store all session state data in a cookie
• Sign it with a hash (sha1)
• Timestamp allows you to expire it
• You can get a lot in there
• Remember it's not encrypted on the wire, and
adds to your bandwidth
27478932510|triblondon|1231936510|2,4,6,52,183|a152c24d9874ba15235f
userid | username | sessionstart | groupmemberships | signature
Slide 27
Slide 27 text
Other scalable session solutions
• memcached (php.net/memcache)
– Performs well, scales nicely.
– All the cool kids are doing it
• Sticky sessions (Varnish / Squid)
– Or redirect-and-stick, ie:
www.example.com -> (302) -> www4.example.com
But doesn't work for some apps (Wordpress)
• Database sessions
– Bit pointless. Definitely not cool.
Slide 28
Slide 28 text
Caching. Not using intelligence when
stupidity will do just fine.
Slide 29
Slide 29 text
Scenario
• Your CSS/JS/images don't change often, so
users should cache them
• But when they do change, you want everyone to
flush their cache, else the site will stop working.
Slide 30
Slide 30 text
Not cool.
Slide 31
Slide 31 text
3
Add query strings to enable
far-future caching
Slide 32
Slide 32 text
Solution
• /lib/img/my_website_header.png
Expires: Sun, 17 Jan 2038 19:26:00 GMT
But these are not the same object:
• /lib/img/my_website_header.png?v=2
• /lib/img/my_website_header.png?v=3
• /lib/img/my_website_header.png?v=4
Slide 33
Slide 33 text
Result.
• Changing the filename or adding a query string
will cause all browsers to re-request the file.
• All the benefits of long term caching
• No update latency
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
4
Use a CDN for huge capacity,
low cost, and no hassle
Slide 36
Slide 36 text
Solution
• Choose a reverse proxy CDN
• Put some thought into these headers
– Expires:
– Cache-control:
– Last-Modified:
– Content-Length:
– Etag:
• then offload your traffic …
Slide 37
Slide 37 text
CDN providers
• Velocix - 500GB/mo, free
http://www.velocix.com
• Edgecast - 1TB/mo, $500
http://www.edgecast.com
• Limelight - 1TB/mo, $1000
http://www.limelight.com
• Amazon Cloudfront - 1TB/mo, $400 (ish!)
http://aws.amazon.com (cheap for big files)
Note: I have no
affiliation with any
of these providers
Slide 38
Slide 38 text
Are you cachable?
• http://www.ircache.net/cgi-bin/cacheability.py
Slide 39
Slide 39 text
It's not hard. Or is it?
• yahoo.co.uk - server clock is wrong
• microsoft.com - sends malformed headers
• timesonline.com - no cache control
• digg.com - Has PHP's 19 Nov 1981 expiry date
• msn.co.uk - Two redirects, no caching
• gumtree.com - tries to cache for 10 mins, but
has no validator or content length
Slide 40
Slide 40 text
Overburden your site with writes,
and you're going nowhere fast.
Slide 41
Slide 41 text
Scenario
• Your app runs on a single server / shared host
• You connect to a database using some kind of
DB abstraction class / framework
Slide 42
Slide 42 text
Doesn't scale.
Slide 43
Slide 43 text
Scales. (a bit)
Slide 44
Slide 44 text
5
Splitting database connections
for easier scaling later
Slide 45
Slide 45 text
Solution
• Always plan for your write queries to go
somewhere different to your reads.
– Even if they won't in the immediate future
• And assume that writes take a non-negligible
amount of time to become readable.
Slide 46
Slide 46 text
Scenario
• 'Most viewed/emailed' widgets
• Thinking about doing this?
Obligatory BBC News Online
Screenshot.
UPDATE content SET viewcount=viewcount+1 WHERE contentid=5309342;
NOT
COOL
Slide 47
Slide 47 text
• You're writing on every page load!
• Low read:write ratio
• High page generation overhead
• Can't cache.
• Disaster.
Slide 48
Slide 48 text
8
Using hosted analytics to
avoid logging
Slide 49
Slide 49 text
Solution
• You want to optimise for reads.
• You don't really need all this data. Just the
aggregated results.
• So let someone else do it!
Slide 50
Slide 50 text
Solution
• Hosted analytics:
– Google Analytics (free), SiteIntelligence, Webtrends
• But what about AJAX / downloads / outbound
links / JavaScript actions?
Slide 51
Slide 51 text
Scenario
• Script reads from cache, or regenerates and
stores in cache if cache is stale
• At the moment the cache expires, lots of threads
try to write to it at the same time.
• Evil writes kill your web server.
Slide 52
Slide 52 text
6
Prep content in advance to
avoid cache slamming
Slide 53
Slide 53 text
Solution
• Use a separate process to write to the cache,
periodically or event driven, but not triggered
by web requests.
• Scripts handling HTTP requests never write
Slide 54
Slide 54 text
Quick recap
• Sessions: Try JavaScript injection, cookie-
stored session data, sticky sessions,
memcached.
• Caching: Use a CDN and far-future caching
• Writes: Split reads and writes, reduce writes,
use hosted analytics, prep content on a schedule