A short overview of how we think about problems and create the mental shortcuts needed in a face paced environment. And eventually how we need to reevaluate those conclusions from earlier points in our career.
• Telco, ISP, webhosting, video streaming, SaaS • Windows office automation consulting • Consumer and enterprise web apps • 20 years (almost) and 13 companies • Currently an SRE at Twitter
It’s always been done that way • We don’t run $common_tech, it doesn’t work • Cargo-culting of configs • Processes that are inappropriate for the workloads • Institutional Inertia
Rely on experience first • Match current circumstances against past experiences • If that fails, try a bunch of things • Finally if all else fails, start from the beginning and work logically towards a solution
Effect • Predisposition to solve a problem in a specific manner based on recent successes • Human brain's ways of finding an appropriate solution/behavior as efficiently as possible, but not the most efficient solution
limited by the available information, the tractability of the decision problem, the cognitive limitations of their minds, and the time available to make the decision” --wikipedia
skills, cognitions, emotions and behaviors with more adaptive ones’, by challenging an individual's way of thinking and the way that they react to certain habits or behaviors” --wikipedia
it differently • We were seven years old • Left toy at supermarket, had to go back • Caused a fight between our parents • This fight was actually one of dozens • Other stress factors in the relationship also recalled like moving, new job, death in the family, etc.
• One of the DB slaves stopped replicating • Didn’t notice the alert on that slave • Web app using bad slave couldn’t serve API traffic • Distracted by drop off in serving out of cache • Even after fixed, site was slow for a few hours
Release went out • All DB slaves experienced replication lag. It wasn’t till they started to catch up that it was possible to notice one was wedged • Replication alert lost in the general alert noise • Site performance was worse than expected
system • Schema update locked busiest table for 8 minutes • All DB slaves experienced replication lag. It wasn’t till they started to catch up that it was possible to notice one was wedged • The TTL on hot data was 5 minutes which expired during the schema update. • API could not serve the high traffic event w/ cold cache
avoid on production sized data. • Compare req/rate of API proxy cache vs API in cases where we have a cold cache • Throwing hardware at the problem near impossible at the request rates we have on this system
Releases are dangerous • Release with schema changes are dangerous • Releases w/schema changes on busy tables are dangerous. • Releases w/schema changes on large busy tables can affect the system's ability to serve traffic at production levels.
least) • Keeps you honest with yourself • Keeps you from taking too much responsibility • Asks the questions you didn’t think to ask • “use your words” • Post mortems, done well, should look like this