Save 37% off PRO during our Black Friday Sale! »

It's Not Your Fault

516fcd20ab7b946f50090ce1d557638c?s=47 j.hand
July 22, 2014

It's Not Your Fault

A look at blameless post-mortems



July 22, 2014


  1. (Blameless) post-mortems @jasonhand It’s Not Your Fault

  2. Jason Hand DevOps “Handyman” ! @jasonhand @jasonhand

  3. A little about me… Dir. of Platform Support - AppDirect

    Dir. of Technical Support - Standing Cloud Dir. of Operational Systems - American Fasteners, Inc. Hiker, climber, brewer, runner, biker, boarder, surfer, painter, singer, reader, writer, picker, coder, racer, camper, volunteer …. all the usual “Colorado 1-upper” crap. @jasonhand
  4. Alternative names Also known as: (Note: Public & Internal) Project

    Retrospectives Post-mortem analysis Post-project review Project Analysis Review Quality Improvement Review Autopsy Review Santayana Review After Action Review Touchdown Meeting @jasonhand
  5. Post-mortem Defined A process intended to inform improvements by determining

    aspects that were successful or unsuccessful. What ? @jasonhand
  6. Post-mortem Defined As soon as feasible after the Incident is

    resolved. When ? @jasonhand
  7. Post-mortem Defined Everybody Who ? @jasonhand

  8. Post-mortem Defined To communicate with your team Why ? To

    understand what happened for learning and improving @jasonhand
  9. Post-mortem Defined Talk about the incident timeline Escalation steps What

    was done to resolve the problem Create a remediation plan Make it available How ? @jasonhand
  10. The Three R’s Regret Acknowledgement and apology Reason Initial incident

    detection to resolution, including the so-called “root causes.” Remedy Actionable remediation items Dave Zwieback VP Engineering - Next Big Sound @jasonhand ( simple format )
  11. (Remedy) Specific Measurable Agreed Upon/Agreeable Realistic Timebound Use SMART recommendations

    Moving from Reaction to Action @jasonhand
  12. Blameless image from “Across the Universe” @jasonhand

  13. 2011 - Hired to Standing Cloud Cool story, bro Cloud

    marketplace & automated deployment of apps Build Support team Provide Managed services @jasonhand
  14. Cool story, bro @jasonhand

  15. – Sydney Dekker “Reprimanding bad apples may seem like a

    quick and rewarding fix, but it’s like peeing in your pants. ! You feel relieved and perhaps even nice and warm for a little while, but then it gets cold and uncomfortable. ! And you look like a fool” Quote first seen in J. Paul Reed’s “A Look at Looking in the Mirror" @jasonhand
  16. What is a blameless post-mortem? Team members are accountable but

    not responsible Complete Transparency Deeper look at circumstances What happened and how to improve it (specific details) Real conditions of failure in complex systems @jasonhand
  17. – Dave Zwieback “Your organization must continually affirm that individuals

    are NEVER the “root cause” of outages.” @jasonhand
  18. Paraphrased from “Fallible Humans” by Ian Malpass - DevOpsDays -

    Minneapolis source: @jasonhand
  19. (Efficiency Thoroughness Trade Off) The trade off between: ! being

    efficient vs being thorough ETTO Efficient Thorough @jasonhand
  20. - Ian Malpass “We can be thorough and really dig

    into the task at hand and understand it well but this takes time: it is inefficient.” @jasonhand
  21. Cause & Effect There are many factors that played a

    part in the problem source: “may be” @jasonhand
  22. Stress & Cognitive Bias @jasonhand

  23. Yerkes-Dodson Model source: The Human Side of Postmortems @jasonhand

  24. @jasonhand

  25. Reduce Stress? … build muscle memory Simulate many types of

    problems and outages as “practice” … @jasonhand
  26. Evaluative Threat Being negatively judged plays a big role in

    stress @jasonhand
  27. What is stress surface? Variables of a situation Novel or

    unusual Unpredictable Controllable situation Negative judgement Lack of sleep Problems at home Health Relationships @jasonhand Evaluative threats ALSO Etc…
  28. Capturing the Human-side Ask questions @jasonhand

  29. Stress Questionnaire The situation was novel or unusual? The situation

    was unpredictable? You were unable to control the situation? Others could judge your actions negatively? 0 = Never 1 = Almost Never 2 = Sometimes 3 = Fairly Often 4 = Very Often During the outage, how often have you felt or thought that: @jasonhand
  30. Why we don’t punish De-incentivized to give the details Practically

    guarantees a repeat of the problem Understand why actions made sense (at the time) Create safety AND accountability Move away from idea of “individuals are problems” Create new “experts” @jasonhand
  31. @jasonhand

  32. Promoting from within Where do we start? • Document your

    timeline or log data • Document conversations • Leave room for notes • Mean time to resolution / Time calculations • Level of severity • Archive it for historical retrieval • Remediation. Make it actionable @jasonhand The basics:
  33. Tools Etsy’s Morgue VictorOps Post-mortem Report @jasonhand Internal Wiki

  34. @jasonhand Seek the truth Don’t blame others … ! Don’t

    blame yourself Thank You
  35. Questions ? @jasonhand

  36. Resources “The Human Side of Postmortems” - Dave Zwieback “The

    Field Guide to Understanding Human Error” - Sydney Dekker “A Look at Looking in the Mirror” - J. Paul Reed “Fallible Humans” - Ian Malpass ( “4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien ( 4-questions-effective-technical-post-mortem/) “Nine steps to IT post-mortem excellence” - Michael Krigsman ( post-mortem-excellence/1069) “Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr ( emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.pdf) “Blameless PostMortems and a Just Culture” - John Allspaw ( “What blameless really means” - Jessica Harllee ( “Each necessary, but only jointly sufficient” - John Allspaw ( only-jointly-sufficient/) @jasonhand