Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConZA 2013: "A Real World Example of Caching Gone Wrong" by Matt Hampton

Pycon ZA
October 04, 2013

PyConZA 2013: "A Real World Example of Caching Gone Wrong" by Matt Hampton

So I’ve heard that there are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors. Perhaps they are right.

In this talk, I’m going to tell a story of how we got caught out by a somewhat naive approach to caching ... or, to be more honest, a series of naive approaches to caching. It is my hope that by listening to my sorry tale, you will be introduced to some of the key concepts that will allow you to avoid our fate.

More specifically, I’m going to talk about our experience at St James Software working with in-process caches, and then with inter-process caching using Python and memcached. Along the way, I will introduce a couple of different caching strategies, including time-based and least-recently-used, and, of course, will devote some time to the perils of cache invalidation.

Finally, I will take a small diversion to discuss some pitfalls that you should watch out for, if and when you choose to add a new application to your solution.

Pycon ZA

October 04, 2013
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. We should forget about small efficiencies, say about 97% of

    the time: premature optimization is the root of all evil. “ ” Donald Knuth, 1974
  2. Yet we should not pass up our opportunities in that

    critical 3%. “ ” Donald Knuth
  3. def slow_operation(x,y): … def faster_operation(x,y): result = self.cache.get((x,y)) if result

    is None: result = self.slow_operation(x,y) self.cache[(x,y)] = result return result
  4. def faster_operation(x,y): result = self.cache.get((x,y)) if result is None: result

    = self.slow_operation(x,y) self.cache[(x,y)] = result return result A Race Condition
  5. Based On A True Story Any resemblance to developers and/or

    code, living or dead, is purely coincidental
  6. User Rights in j5 • Horizontal Partitioning • acknowledge.modify =

    "approval_status in ('submitted', 'ope_acknowledged')" • userid.view = '(groups IN (${area1}) or ${area1} IS NULL) or j5username = ${user}'
  7. User Rights in j5 • Parsing the user rights definition

    takes a while • Parsed group rights data structure is merged in a non-trivial way per user • Solution: self.cache[username]= \ merged_user_rights
  8. There are only two hard things in Computer Science: cache

    invalidation and naming things. “ ” Phil Karlton
  9. Enter the GIL • Consider this trivial CPU-bound function def

    count(n): while n > 0: n -= 1 • Run it twice in series: count(100000000) count(100000000) Copyright (C) 2009, David Beazley, http://www.dabeaz.com
  10. Enter the GIL • Now, run it in parallel in

    two threads t1 = Thread(target=count,args=(100000000,)) t1.start() t2 = Thread(target=count,args=(100000000,)) t2.start() t1.join(); t2.join() • On my laptop (8 cores): • Sequential: 13.2s (py2) 17.2s (py3) • Parallel: 18.1s (py2) 57.0s (py3) Copyright (C) 2009, David Beazley, http://www.dabeaz.com
  11. There are only two hard things in Computer Science: cache

    invalidation and naming things. “ ” Phil Karlton
  12. Time-based Expiry • Invalidate cache entry after a specified period

    of time • User rights don’t change often • 20 minute period is tolerable
  13. We Were Mistaken • “20 minute period is tolerable” •

    Weird mixed up user rights bug • Many long and boring conversations • When you don’t have a convincing reason for a limitation, remove it!
  14. Memcached • “We Need a Shared Cache” • Simple Key/Value

    Store • Forgetting Data is a Feature (LRU) • Smarts Half in Client, Half in Server • Distributed Independent Servers • O(1) Everything
  15. Operations • set, get, delete • cas – ‘Check and

    Set’ • incr/decr • append/prepend
  16. I have not failed. I've just found 10,000 ways that

    won't work. “ ” Thomas Edison*
  17. Leaky Abstractions • python-memcached • Just cos it looks like

    a dict, doesn’t mean it is one! • Know when you’re doing a Remote Procedure Call • Key size limits • Data size limits
  18. The Deceitfulness of Cache • Transparent to your QA team

    • Test with appropriate datasets • Publish (and look at) hit / miss statistics*
  19. Minimal Configuration (Done Wrong) • Don’t try to be too

    smart if Ping.test_socket_available( "localhost", 11211): self.cache = MemCache.MemCache( "localhost:11211") else: self.cache = DictCache.DictCache()
  20. A Library of Surprises • Inconsistent handling of stale connections

    - ConnectionDeadError • Interesting approach to thread safety • Some errors swallowed quietly • Values bigger than 1MB • 1 unit test • New lines in key names
  21. A New Critical Component • Monitoring / Logging / Metrics?

    • Temporarily unavailable at start up? • (Another) Weird mixed up user rights bug • Temporarily unavailable during an operation?
  22. Transactional Assumptions • Cache invalidation by delete • What if

    that fails? • If you need a transaction use a transaction • That unlikely thing will happen. • (But not in test)
  23. Cache Version Table user_rights 4cfc2da9 area_heirarchy 510ca44e Process Pool Service-1

    Service-n Service-2 site1.user_rights. 4cfc2da9.matth site1.user_rights. 4cfc2da9.matth matth operator jonom supervisor operator operator
  24. Bump Version on Change Process Pool Service-1 Service-n Service-2 site1.user_rights.

    1da21f40.matth user_rights 4cfc2da9 1da21f40 area_heirarchy 510ca44e matth operator manager jonom supervisor manager
  25. Solving the Right Problem Process Pool Service-1 user_rights 4cfc2da9 1da21f40

    area_heirarchy 510ca44e matth operator manager jonom supervisor { } Service-2 { } Service-n { } ...
  26. memcached allows you to take memory from parts of your

    system where you have more than you need and make it accessible to areas where you have less than you need. “ ” memcached.org/about
  27. Misunderestimation • Infrastructure for monitoring and management • Documentation and

    Training • Debugging and Support • Deployment pain on Windows
  28. We should forget about small efficiencies, say about 97% of

    the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. “ ” Donald Knuth