Empathy is a fundamental engineering skill

Empathy is everywhere Product: understanding customer needs (usually) Management: understanding
employee needs (hopefully) #hugops Devops' "People over process"

Why talk at devopsdays?

Why talk at devopsdays? (any guesses?)

Why get better at ops? (but how?) smaller teams +
more software = scaling problem

Operational Empathy

"developer induction" The system is functioning now. I make a
well-tested, internally-consistent, "correct" change. The system will continue functioning.

well-tested, internally-consistent, "correct" change. The system will continue functioning. Are you sure?

well-tested, internally-consistent, "correct" change. The system will continue functioning. Are you sure? Well...

well-tested, internally-consistent, "correct" change. The system will continue functioning. Are you sure? Well... Good luck with that.

Bugs vs. "Bugs" It's important to draw this distinction. Your
system probably has latent bugs: obviously incorrect behaviour that hasn't been found yet. New changes may also invalidate previously valid constraints, which are not always explicit, and not always known.

Production Failures Happen even to good people

What about tests? Tests are great, but recognize their limitations.
The better your tests, the worse your eventual failures will be. (They'll also be further apart, which is a huge net win.)

What about code review? Formatting, style, "manual" linting, idiom: easy
to have opinions on, but often low value changes*. Ask questions that are hard, but necessary, and provide high value: deployment strategy, interaction with existing systems, monitoring, documentation. * Do this during pairing, mentoring, onboarding. Have a "house style" or at least a "repo style".

What about $X? Where $X is anything from "framework" to
"type system" to "but we implemented bounded eventual consistency" to "my PaaS takes care of that" “All human actions are equivalent and all are on principle doomed to failure.” – Jean-Paul Sartre

Production Failures Happen (and they're messy)

The GCE Failure

The GCE Failure A routine change to networking conﬁg.

The GCE Failure A routine change to networking conﬁg. triggered
a race condition!

a race condition! That's OK, we have an automatic fail-safe.

a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty conﬁg!

a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty conﬁg! No problem, there's a canary.

a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty conﬁg! No problem, there's a canary. previously unknown bug ignored the canary!

a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty conﬁg! No problem, there's a canary. previously unknown bug ignored the canary! All of GCE goes off the internet.

The GCE Failure Three novel bugs interacted with a routine
conﬁg change to bring down a global service. Two of these bugs were in systems designed to automatically correct or avert failures. Reverted to known-good conﬁg before root- causing the outage.

Allspaw's thesis "Trade-offs Under Pressure: Heuristics and Observations of Teams
Resolving Internet Service Outages" In-depth examination of an outage at Etsy, concentrating on human-human interactions while dealing with the failure. It also has a great literature review. Seriously go read it.

Etsy outage Not a complete outage; performance degradation on the
Etsy homepage. Systems involved: personalized homepage module (two separate submodules), cacheing layer, and speciﬁc production data. Team focused on mitigation (worked this time) and investigation of the live system.

Allspaw's Heuristics Allspaw identiﬁed four heuristics in use during the
outage. Important: these aren't recommendations, but observations of an experienced team dealing with an outage in a familiar environment.

Allspaw's Heuristics 1. Look for correlation to the most recent
change 2. Widen the scope of potential signals you're looking at (look for anything unusual). 3. Look for recurrences of "familiar" failures 3.a) Investigate speciﬁc past problem systems 3.b) Investigate speciﬁc recent problem systems

4. "In outage scenarios, use peer review of any code
changes to gain conﬁdence, as opposed to other methods, such as automated tests or other procedures."

Interesting points Both of these failures involved interactions between distinct
levels & layers of the system. Neither of these failures were related to "new code" or other aspects of deployment as typically understood by developers.

Post-mortem debugging The GCE outage was recovered ﬁrst, and root-
caused later. How can you facilitate this? Logging. Machine-parsable, easily greppable (see trentm/node-bunyan). Application-level monitors. Core ﬁles.

Prevent flailing Allspaw's 2nd Heuristic: "Cast a wide net" Learn
debugging methodologies, like Gregg's USE method, to limit the scope of inquiries or redirect from "it could be anything". Have simple and discoverable tools for at least initial investigations.

Simple discoverable tools

Allspaw's 4th Heuristic How do you scale expertise to perform
peer review in a crisis? Safe environments, Just Culture. Good project structure.

"For an occurrence to become an adventure, it is necessary
and sufﬁcient for one to recount it." – Jean Paul Sartre

(thanks for the art) @choplogik

Empathy, conclusions: (Questions?) (PS: References to follow) Operable software must
exert a minimal human cost in order to successfully scale. Understanding how people recover from production failure is crucial to producing operable software.

Empathy is a fundamental engineering skill

Empathy is a fundamental engineering skill

Other Decks in Programming

Featured

Transcript