PyConZA 2013: "A Real World Example of Caching Gone Wrong" by Matt Hampton

A Real World Example of Caching Gone Wrong Matt Hampton

We should forget about small efficiencies, say about 97% of
the time: premature optimization is the root of all evil. “ ” Donald Knuth, 1974

The Last Minute

Yet we should not pass up our opportunities in that
critical 3%. “ ” Donald Knuth

Optimise Last + The Last Minute = Performing (Almost) As
Well As Is Strictly Necessary

self.cache = {}

def slow_operation(x,y): … def faster_operation(x,y): result = self.cache.get((x,y)) if result
is None: result = self.slow_operation(x,y) self.cache[(x,y)] = result return result

def faster_operation(x,y): result = self.cache.get((x,y)) if result is None: result
= self.slow_operation(x,y) self.cache[(x,y)] = result return result A Race Condition

Based On A True Story Any resemblance to developers and/or
code, living or dead, is purely coincidental

User Rights in j5 • Horizontal Partitioning • acknowledge.modify =
"approval_status in ('submitted', 'ope_acknowledged')" • userid.view = '(groups IN (${area1}) or ${area1} IS NULL) or j5username = ${user}'

User Rights in j5 • Parsing the user rights definition
takes a while • Parsed group rights data structure is merged in a non-trivial way per user • Solution: self.cache[username]= \ merged_user_rights

del self.cache[username]

There are only two hard things in Computer Science: cache
invalidation and naming things. “ ” Phil Karlton

Thread Pool Thread-1 Thread-n Thread-2 … HTTP Requests Server Socket

Enter the GIL • Consider this trivial CPU-bound function def
count(n): while n > 0: n -= 1 • Run it twice in series: count(100000000) count(100000000) Copyright (C) 2009, David Beazley, http://www.dabeaz.com

Enter the GIL • Now, run it in parallel in
two threads t1 = Thread(target=count,args=(100000000,)) t1.start() t2 = Thread(target=count,args=(100000000,)) t2.start() t1.join(); t2.join() • On my laptop (8 cores): • Sequential: 13.2s (py2) 17.2s (py3) • Parallel: 18.1s (py2) 57.0s (py3) Copyright (C) 2009, David Beazley, http://www.dabeaz.com

Dashing and Daring Software Development Methodology

Process Pool Multi-processing Service-1 Service-n Service-2 … Load Balancer HTTP
Requests The Last Minute

There are only two hard things in Computer Science: cache
invalidation and naming things. “ ” Phil Karlton

Time-based Expiry • Invalidate cache entry after a specified period
of time • User rights don’t change often • 20 minute period is tolerable

Bounded Caches • Least Recently Used (LRU) @lru_cache(maxsize=32) def slow_operation(a1,a2,**kw):
... • Least Frequently Used (LFU) • Hybrid

We Were Mistaken • “20 minute period is tolerable” •
Weird mixed up user rights bug • Many long and boring conversations • When you don’t have a convincing reason for a limitation, remove it!

Memcached • “We Need a Shared Cache” • Simple Key/Value
Store • Forgetting Data is a Feature (LRU) • Smarts Half in Client, Half in Server • Distributed Independent Servers • O(1) Everything

Operations • set, get, delete • cas – ‘Check and
Set’ • incr/decr • append/prepend

http://work.tinou.com/2011/04/memcached-for-dummies.html

I have not failed. I've just found 10,000 ways that
won't work. “ ” Thomas Edison*

Leaky Abstractions • python-memcached • Just cos it looks like
a dict, doesn’t mean it is one! • Know when you’re doing a Remote Procedure Call • Key size limits • Data size limits

The Deceitfulness of Cache • Transparent to your QA team
• Test with appropriate datasets • Publish (and look at) hit / miss statistics*

Minimal Configuration (Done Wrong) • Don’t try to be too
smart if Ping.test_socket_available( "localhost", 11211): self.cache = MemCache.MemCache( "localhost:11211") else: self.cache = DictCache.DictCache()

A Library of Surprises • Inconsistent handling of stale connections
- ConnectionDeadError • Interesting approach to thread safety • Some errors swallowed quietly • Values bigger than 1MB • 1 unit test • New lines in key names

A New Critical Component • Monitoring / Logging / Metrics?
• Temporarily unavailable at start up? • (Another) Weird mixed up user rights bug • Temporarily unavailable during an operation?

Transactional Assumptions • Cache invalidation by delete • What if
that fails? • If you need a transaction use a transaction • That unlikely thing will happen. • (But not in test)

Cache Version Table user_rights 4cfc2da9 area_heirarchy 510ca44e Process Pool Service-1
Service-n Service-2 site1.user_rights. 4cfc2da9.matth site1.user_rights. 4cfc2da9.matth matth operator jonom supervisor operator operator

Bump Version on Change Process Pool Service-1 Service-n Service-2 site1.user_rights.
1da21f40.matth user_rights 4cfc2da9 1da21f40 area_heirarchy 510ca44e matth operator manager jonom supervisor manager

Solving the Right Problem Process Pool Service-1 user_rights 4cfc2da9 1da21f40
area_heirarchy 510ca44e matth operator manager jonom supervisor { } Service-2 { } Service-n { } ...

memcached allows you to take memory from parts of your
system where you have more than you need and make it accessible to areas where you have less than you need. “ ” memcached.org/about

Misunderestimation • Infrastructure for monitoring and management • Documentation and
Training • Debugging and Support • Deployment pain on Windows

We should forget about small efficiencies, say about 97% of
the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. “ ” Donald Knuth

Design for Performance • Don’t optimise up front • But
be thoughtful • And scientific

One Minute Left Any Questions?

PyConZA 2013: "A Real World Example of Caching ...

PyConZA 2013: "A Real World Example of Caching Gone Wrong" by Matt Hampton

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript