Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Empathy is a fundamental engineering skill

Sponsored · SiteGround - Reliable hosting with speed, security, and support you can count on.

Empathy is a fundamental engineering skill

A talk delivered at devopsdays Vancouver on operational empathy as a method to improve your software engineering practice.

Art by @choplogik.

Avatar for Matt Smillie

Matt Smillie

April 15, 2016
Tweet

Other Decks in Programming

Transcript

  1. Empathy is everywhere Product: understanding customer needs (usually) Management: understanding

    employee needs (hopefully) #hugops Devops' "People over process"
  2. Why get better at ops? (but how?) smaller teams +

    more software = scaling problem
  3. "developer induction" The system is functioning now. I make a

    well-tested, internally-consistent, "correct" change. The system will continue functioning.
  4. "developer induction" The system is functioning now. I make a

    well-tested, internally-consistent, "correct" change. The system will continue functioning. Are you sure?
  5. "developer induction" The system is functioning now. I make a

    well-tested, internally-consistent, "correct" change. The system will continue functioning. Are you sure? Well...
  6. "developer induction" The system is functioning now. I make a

    well-tested, internally-consistent, "correct" change. The system will continue functioning. Are you sure? Well... Good luck with that.
  7. Bugs vs. "Bugs" It's important to draw this distinction. Your

    system probably has latent bugs: obviously incorrect behaviour that hasn't been found yet. New changes may also invalidate previously valid constraints, which are not always explicit, and not always known.
  8. What about tests? Tests are great, but recognize their limitations.

    The better your tests, the worse your eventual failures will be. (They'll also be further apart, which is a huge net win.)
  9. What about code review? Formatting, style, "manual" linting, idiom: easy

    to have opinions on, but often low value changes*. Ask questions that are hard, but necessary, and provide high value: deployment strategy, interaction with existing systems, monitoring, documentation. * Do this during pairing, mentoring, onboarding. Have a "house style" or at least a "repo style".
  10. What about $X? Where $X is anything from "framework" to

    "type system" to "but we implemented bounded eventual consistency" to "my PaaS takes care of that" “All human actions are equivalent and all are on principle doomed to failure.” – Jean-Paul Sartre
  11. The GCE Failure A routine change to networking config. triggered

    a race condition! That's OK, we have an automatic fail-safe.
  12. The GCE Failure A routine change to networking config. triggered

    a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty config!
  13. The GCE Failure A routine change to networking config. triggered

    a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty config! No problem, there's a canary.
  14. The GCE Failure A routine change to networking config. triggered

    a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty config! No problem, there's a canary. previously unknown bug ignored the canary!
  15. The GCE Failure A routine change to networking config. triggered

    a race condition! That's OK, we have an automatic fail-safe. bug in the fail-safe code used an empty config! No problem, there's a canary. previously unknown bug ignored the canary! All of GCE goes off the internet.
  16. The GCE Failure Three novel bugs interacted with a routine

    config change to bring down a global service. Two of these bugs were in systems designed to automatically correct or avert failures. Reverted to known-good config before root- causing the outage.
  17. Allspaw's thesis "Trade-offs Under Pressure: Heuristics and Observations of Teams

    Resolving Internet Service Outages" In-depth examination of an outage at Etsy, concentrating on human-human interactions while dealing with the failure. It also has a great literature review. Seriously go read it.
  18. Etsy outage Not a complete outage; performance degradation on the

    Etsy homepage. Systems involved: personalized homepage module (two separate submodules), cacheing layer, and specific production data. Team focused on mitigation (worked this time) and investigation of the live system.
  19. Allspaw's Heuristics Allspaw identified four heuristics in use during the

    outage. Important: these aren't recommendations, but observations of an experienced team dealing with an outage in a familiar environment.
  20. Allspaw's Heuristics 1. Look for correlation to the most recent

    change 2. Widen the scope of potential signals you're looking at (look for anything unusual). 3. Look for recurrences of "familiar" failures 3.a) Investigate specific past problem systems 3.b) Investigate specific recent problem systems
  21. 4. "In outage scenarios, use peer review of any code

    changes to gain confidence, as opposed to other methods, such as automated tests or other procedures."
  22. Interesting points Both of these failures involved interactions between distinct

    levels & layers of the system. Neither of these failures were related to "new code" or other aspects of deployment as typically understood by developers.
  23. Post-mortem debugging The GCE outage was recovered first, and root-

    caused later. How can you facilitate this? Logging. Machine-parsable, easily greppable (see trentm/node-bunyan). Application-level monitors. Core files.
  24. Prevent flailing Allspaw's 2nd Heuristic: "Cast a wide net" Learn

    debugging methodologies, like Gregg's USE method, to limit the scope of inquiries or redirect from "it could be anything". Have simple and discoverable tools for at least initial investigations.
  25. Allspaw's 4th Heuristic How do you scale expertise to perform

    peer review in a crisis? Safe environments, Just Culture. Good project structure.
  26. "For an occurrence to become an adventure, it is necessary

    and sufficient for one to recount it." – Jean Paul Sartre
  27. Empathy, conclusions: (Questions?) (PS: References to follow) Operable software must

    exert a minimal human cost in order to successfully scale. Understanding how people recover from production failure is crucial to producing operable software.