Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to Caching: For Fun or For Profit

Intro to Caching: For Fun or For Profit

Talk given at the SF PyLadies meetup in June 2013 on caching in Python and Django

Julia Grace

June 28, 2013
Tweet

More Decks by Julia Grace

Other Decks in Technology

Transcript

  1. Disclaimer •  Many of the techniques and topics discussed in

    this talk were used by Tindie at one point or another over the past year. •  Tindie has grown substantially and our caching needs have changed. •  Thus, most of what is discussed will work for you most of the time, but may not work optimally at very large “web scale”.
  2. Definition: Caching •  Keeping duplicate data in multiple locations. • 

    Why on earth would you want to do this? •  The beauty of caching is the one of the locations is often faster & easier to access than others.
  3. Analogy: Kitchen Cache •  Look at your kitchen. It’s a

    food cache (sort of)! •  You keep groceries there. But you could just go to the store every time you needed something. •  After all the store has everything. So many delicious foods you can’t fit in your kitchen!
  4. Analogy: Kitchen Cache •  But going to the grocery store

    every time you need milk for your coffee is inconvenient and inefficient. •  So how often do you go? •  How do you balance wanting to buy everything with not having space for all of it? •  These are real problems of caching in life and in business.
  5. Analogy: Kitchen Cache •  Kitchen == fast but space limited

    storage (memory) •  Grocery Store == slow storage but more space (database, file system, remote API) •  It’s easier to go to your kitchen than the store. •  Just like it’s faster to read from memory than to read from disk.
  6. Cache All Things! (not really) •  Our gut reaction to

    caching is usually “So let’s stick everything in memory! It’s so fast!” •  Memory is expensive (and has other limitations). •  You could fit the content of the grocery store in your kitchen but that would be expensive overkill. Same concept applies in programming.
  7. Surprise! It’s cached! •  You’re probably caching already and you

    don’t know it. •  Cache the results of a database query in a variable (using Django ORM): a_user = User.objects.get(id=1) print a_user.username print a_user.first_name •  We could just query the DB every time (like when we access username & first_name), but that would probably be slower than accessing a variable.
  8. Caching to Avoid Computationally Intensive Tasks •  Store the results

    of the function in a variable to avoid repeatedly calling complex computation. a = really_hairy_function(106,269,844,789) # Don’t ever do this def really_hairy_function(a,b,c,d): for j in range(a): for d in range(b): for d1 in range(c): for d2 in range(d): # block which runs in O(n^2)
  9. Caching HTTP Requests •  Tindie used to use IPInfoDB API

    to map country to IP address. Example: 98.207.195.205 == California, USA •  If we see future requests from that IP, we could call IPInfoDB API again or store the mapping in a cache. •  Here the cache could be a database or Memcache or Redis (both are likely faster than calling external API).
  10. Caching Libraries •  There are Python libraries for storing the

    results of external HTTP requests: •  CacheControl (https://github.com/ ionrock/cachecontrol)
  11. Caching data that changes •  In previous examples we were

    caching data that probably didn’t change often. •  But what about data that does change? •  Example: Tindie uses the GitHub API to get data about Open Hardware repositories. •  These repos could change at any time (add followers, accept pull requests, etc.)
  12. Accuracy vs Speed •  We query GitHub for a lot

    of data, and we don’t want to update our data if nothing has changed (we don’t want to update our cached copy). •  But we want our version to reflect most recent version in GitHub. You can’t have your cake and eat it too! •  Tradeoff: Data is behind/stale or you spend computational resources ensuring it’s fresh.
  13. Push vs Poll •  This is a more difficult problem

    that is often solved by having the service push you notifications instead of polling (polling == querying the service at specific time intervals). •  For example purposes we’ll poll GitHub. •  Many APIs don’t support push notifications so you have to poll.
  14. GitHub API Example # Only update repos that have been

    modified in past 3 weeks headers = {'Authorization': 'token %s' % settings.GITHUB_TOKEN} headers['If-Modified-Since'] = datetime.timedelta(days=21) r = requests.get("https://api.github.com/repos/%s/%s" % (repo_owner, repo_name),headers=headers) if r.status_code == 304: # Not Modified break
  15. Query Caching •  Tindie Python layer queries our database very

    often. •  Sometimes we are simultaneously updating values in the database. •  Doesn’t always make sense to cache query results in a variable because it might quickly become incorrect or out of date.
  16. Query Cache •  Insert a cache layer (“query cache”) between

    your application and your database. •  Typically Memcache and/or Redis are used for this. •  Memcache, Redis == in-memory (so they are fast) key-value data stores. •  Lookup in a key-value data store is O(n), so it is very computationally fast.
  17. Query Cache Example •  The result of every select is

    cached: select * from auth_user where id=1 id | username | first_name | last_name | email ----+--------------+------------+-----------+------------------------- 1 | julialovescaching | Julia | Grace | [email protected] •  Updates, deletes invalidate the cache (invalidation == remove the value from the cache b/c it has changed). Key:   Value:  
  18. Johnny Cache •  Johnny Cache is one such query cache

    for Django •  http://pythonhosted.org/johnny-cache/ •  Before you use this, ensure you understand cache invalidation and read http://jmoiron.net/blog/is-johnny- cache-for-you/
  19. Template/HTML caching •  Tindie is built on Python/Django •  Django

    templates must be compiled and rendered (typically “fast”, but what if you have hundreds of people on the same page and the content on that page hasn’t changed)? •  Cache blocks of HTML in Memcache or Redis.
  20. Django Cache Example •  Django has built in support for

    template fragment caching: {% load cache %} {% cache 500 sidebar %} <html goes here!> {% endcache %}
  21. Implications of Caching •  We did more Memcache reads than

    DB reads. •  This worked for us because Memcache reads are cheaper and faster than DB reads or compiling Django templates. •  But if your Memcache becomes slow then you have a problem. •  There is no free lunch (or silver bullet).
  22. Too Good to be True? •  Doesn’t caching almost seem

    too good to be true? •  Yes, sometimes it is. •  How do you decide which data “doesn’t change very often”? •  How often should you update your cached data? Hours? Days? Weeks? •  Every answer is wrong (or right! J)
  23. Cache Invalidation •  Not everything fits in our cache. • 

    Sometimes the data we have cached has changed and we have to update the cache (query cache handles this by detecting updates/deletes). •  Example: if I update my username, then in our cached version would return the wrong username.
  24. Cache Invalidation •  For template caching we invalidate the cache

    when, for example, a user updates or deletes their product. •  Don’t want to show stale data (“I updated my product but why isn’t it updated?!”) •  Alternatively we could let the data expire and after 10 minutes would be fresh.
  25. Gotchas •  Almost all of Tindie’s content that doesn’t change

    very often is cached. •  How long we cache pieces of data is something we continually tune. •  We actually got to the point where Memcache was a bottleneck (story for another PyLadies meetup J