–Wikipedia “A web cache is a mechanism for the temporary storage (caching) of web documents, such as HTML pages and images, to reduce bandwidth usage, server load, and perceived lag. A web cache stores copies of documents passing through it; subsequent requests may be satisfied from the cache if certain conditions are met.”
What changed? $ diff local hotel 1,2c1,2 < HTTP/1.1 200 OK < Date: Wed, 29 Oct 2014 15:04:20 GMT --- > HTTP/1.0 200 OK > Date: Wed, 29 Oct 2014 15:05:32 GMT 8c8 < Expires: Thu, 29 Oct 2015 15:04:20 GMT --- > Expires: Thu, 29 Oct 2015 15:05:32 GMT 10d9 < Connection: close 11a11,14 > X-Cache: MISS from localhost > X-Cache-Lookup: MISS from localhost:3128 > Via: 1.1 localhost:3128 (squid/2.7.STABLE3) > Connection: close
ETag: • Don’t use them • Generation is not specified by the HTTP standard, and is often not consistent across a cluster. • Error-prone and can be used to track users who refuse cookies. • Turn them off; don’t use them
Expires: • Indicates when the resource is stale. • Specifies a date/time rather than delta seconds (Cache-Control: max-age=S) • Mostly used for compatibility with HTTP 1.0; Cache-Control: is more semantically rich.
Is data cacheable? • Highly cacheable data: news stories, blog posts, aggregated data such as ratings or reviews (“likes”). • Uncacheable: secure, private, personal data such as user login information, credit card info, etc. Data that must change rapidly—stock quotes, for example, or health monitoring systems.
Web Server Cache (Proxy) Service Web Server Web Server Cache (Proxy) Web Server ICP (local cache) (local cache) (local cache) (local cache) Example 4. Local+Remote Cache
Squid • Old, venerable; the reference implementation for the HTTP standard • Single-threaded • Can be tricky to configure (a multitude of options) but very high-performance • Implements ICP (Internet Cache Protocol) for distributed and hierarchical caches
Varnish • More modern implementation than Squid; relies on virtual memory and multi-threaded access • Easier to set up and configure than squid • Does not support ICP or cache hierarchies
nginx • reverse proxy and webserver - does not need a separate web server process • great for static content, according to users • uses asynchronous sockets; one process per core architecture
DIY caching • Tools let you build your own cache system. • Not transparent, but can build transparency. • Most are simple key/value stores • Requires writing code
DIY cache example • Object retrieval interface fetches data from service. • Internal methods query the data store (memcached, Redis) first and use stored data if possible. • If data is not in the cache, fetch it from the backend service and store it in the cache.
Upsides for DIY caching • Provides a very clean programmatic interface (transparent at the application level) • Can be tailored to specific solutions where you understand the data. • Often very high performance
Downsides to DIY caching • Requires code to be written, tested, etc. • Requires code maintenance if the underlying data model is changed. • Not standardized like HTTP for specifying age, freshness of data (i.e., not a generic solution, but a custom one)
What is an “edge cache?” • A content delivery network (CDN) that holds static content on the “edges” of the Internet • Akamai is the biggest, but there are others: LimeLight, Microsoft Azure, Amazon CloudFront • Stores static content in multiple data centers • Content like JavaScript, CSS, images, and other media
How does a CDN work? • Primary site (www.example.com) serves the HTML page. • <style> <img> etc. tags reference<br/>static content on the CDN<br/>• User’s browsers loads (and often stores) the<br/>static content locally, because it’s served with a<br/>Cache-Control: max-age=32767 header.<br/>