Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's Not Your Fault

j.hand
July 22, 2014

It's Not Your Fault

A look at blameless post-mortems

j.hand

July 22, 2014
Tweet

More Decks by j.hand

Other Decks in Technology

Transcript

  1. A little about me… Dir. of Platform Support - AppDirect

    Dir. of Technical Support - Standing Cloud Dir. of Operational Systems - American Fasteners, Inc. Hiker, climber, brewer, runner, biker, boarder, surfer, painter, singer, reader, writer, picker, coder, racer, camper, volunteer …. all the usual “Colorado 1-upper” crap. @jasonhand
  2. Alternative names Also known as: (Note: Public & Internal) Project

    Retrospectives Post-mortem analysis Post-project review Project Analysis Review Quality Improvement Review Autopsy Review Santayana Review After Action Review Touchdown Meeting @jasonhand
  3. Post-mortem Defined A process intended to inform improvements by determining

    aspects that were successful or unsuccessful. What ? @jasonhand
  4. Post-mortem Defined To communicate with your team Why ? To

    understand what happened for learning and improving @jasonhand
  5. Post-mortem Defined Talk about the incident timeline Escalation steps What

    was done to resolve the problem Create a remediation plan Make it available How ? @jasonhand
  6. The Three R’s Regret Acknowledgement and apology Reason Initial incident

    detection to resolution, including the so-called “root causes.” Remedy Actionable remediation items Dave Zwieback VP Engineering - Next Big Sound @jasonhand ( simple format )
  7. 2011 - Hired to Standing Cloud Cool story, bro Cloud

    marketplace & automated deployment of apps Build Support team Provide Managed services @jasonhand
  8. – Sydney Dekker “Reprimanding bad apples may seem like a

    quick and rewarding fix, but it’s like peeing in your pants. ! You feel relieved and perhaps even nice and warm for a little while, but then it gets cold and uncomfortable. ! And you look like a fool” Quote first seen in J. Paul Reed’s “A Look at Looking in the Mirror" @jasonhand
  9. What is a blameless post-mortem? Team members are accountable but

    not responsible Complete Transparency Deeper look at circumstances What happened and how to improve it (specific details) Real conditions of failure in complex systems @jasonhand
  10. – Dave Zwieback “Your organization must continually affirm that individuals

    are NEVER the “root cause” of outages.” @jasonhand
  11. Paraphrased from “Fallible Humans” by Ian Malpass - DevOpsDays -

    Minneapolis source: http://www.indecorous.com/fallible_humans/ @jasonhand
  12. (Efficiency Thoroughness Trade Off) The trade off between: ! being

    efficient vs being thorough ETTO Efficient Thorough @jasonhand
  13. - Ian Malpass “We can be thorough and really dig

    into the task at hand and understand it well but this takes time: it is inefficient.” @jasonhand
  14. Cause & Effect There are many factors that played a

    part in the problem source: http://xkcd.com “may be” @jasonhand
  15. Reduce Stress? … build muscle memory Simulate many types of

    problems and outages as “practice” … @jasonhand
  16. What is stress surface? Variables of a situation Novel or

    unusual Unpredictable Controllable situation Negative judgement Lack of sleep Problems at home Health Relationships @jasonhand Evaluative threats ALSO Etc…
  17. Stress Questionnaire The situation was novel or unusual? The situation

    was unpredictable? You were unable to control the situation? Others could judge your actions negatively? 0 = Never 1 = Almost Never 2 = Sometimes 3 = Fairly Often 4 = Very Often During the outage, how often have you felt or thought that: @jasonhand
  18. Why we don’t punish De-incentivized to give the details Practically

    guarantees a repeat of the problem Understand why actions made sense (at the time) Create safety AND accountability Move away from idea of “individuals are problems” Create new “experts” @jasonhand
  19. Promoting from within Where do we start? • Document your

    timeline or log data • Document conversations • Leave room for notes • Mean time to resolution / Time calculations • Level of severity • Archive it for historical retrieval • Remediation. Make it actionable @jasonhand The basics:
  20. Resources “The Human Side of Postmortems” - Dave Zwieback “The

    Field Guide to Understanding Human Error” - Sydney Dekker “A Look at Looking in the Mirror” - J. Paul Reed “Fallible Humans” - Ian Malpass (http://www.indecorous.com/fallible_humans/) “4 Questions to ask for an effective Technical Post Mortem” - Jeffrey O’Brien (http://www.maintenanceassistant.com/blog/ 4-questions-effective-technical-post-mortem/) “Nine steps to IT post-mortem excellence” - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-to-it- post-mortem-excellence/1069) “Postmortem reviews: purpose and approaches in software engineering” - Torgeir Dingsøyr (http://www.uio.no/studier/ emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.pdf) “Blameless PostMortems and a Just Culture” - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/) “What blameless really means” - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/) “Each necessary, but only jointly sufficient” - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-but- only-jointly-sufficient/) @jasonhand