Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Human Factors and PostMortems

Human Factors and PostMortems

Our daily work takes place in a myriad of systems. They are comprised of software, hardware and humans. And everybody who has worked with complex systems at any scale knows: Failure is not an option, it's inevitable.

At Etsy we are embracing the fact that failures happen and that the only way to understand how the accident happened is to investigate it without blaming the humans involved. This is why we have a blameless postmortem for every outage that occurs. It is an open meeting and everybody is invited to join and find out what happened and how we can make the system safer.

This talk will explain how postmortems at Etsy are conducted and how we maintain and scale the process as the team grows and new people start. It will go over the tools we built and utilize to make postmortems efficient and also share the learnings from each one with all the people in the company.

Daniel Schauenberg

November 11, 2014
Tweet

More Decks by Daniel Schauenberg

Other Decks in Technology

Transcript

  1. Human Factors &
    PostMortems
    Daniel Schauenberg
    [email protected]
    @mrtazz

    View Slide

  2. View Slide

  3. We deploy quite a lot

    View Slide

  4. MTTR

    View Slide

  5. MTTR
    >
    MTBF

    View Slide

  6. View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. realtalk:
    things break

    View Slide

  13. New View

    View Slide

  14. Complex Socio-
    Technical Systems

    View Slide

  15. Erkenntnis und Irrtum
    fließen aus denselben
    psychischen Quellen; nur
    der Erfolg vermag beide
    zu scheiden.
    — Ernst Mach, Erkenntnis und Irrtum (p. 116)

    View Slide

  16. Things made sense at
    the time

    View Slide

  17. People don't come to
    work to do a bad job

    View Slide

  18. Nietzschean
    Anxiety

    View Slide

  19. So I always get off the
    hook whatever I do?

    View Slide

  20. There is a difference
    between explaining and
    excusing human
    performance.
    — Sidney Dekker, The Field Guide to Understanding
    Human Error (p. 196)

    View Slide

  21. Blameless
    Postmortems

    View Slide

  22. Open Meeting

    View Slide

  23. Everybody is Invited

    View Slide

  24. What
    happened?

    View Slide

  25. Timeline

    View Slide

  26. Describe the past
    Don't excuse it away

    View Slide

  27. The Facilitator

    View Slide

  28. Guide the Discussion

    View Slide

  29. Look out for
    indicators of Old View
    thinking

    View Slide

  30. Counterfactuals

    View Slide

  31. - she should have
    - if he would have
    - if they just had
    - you failed to

    View Slide

  32. Biases

    View Slide

  33. Hindsight Bias
    Confirmation Bias
    Outcome Bias

    View Slide

  34. there are many
    more

    View Slide

  35. Who is in
    charge?

    View Slide

  36. Etsy School

    View Slide

  37. Taught Facilitator
    Course

    View Slide

  38. 3 x 90 minutes

    View Slide

  39. Remediation Items

    View Slide

  40. incorporate learning
    and takeaway from
    the meeting

    View Slide

  41. View Slide

  42. turn surprises into
    known factors

    View Slide

  43. MORGUE

    View Slide

  44. View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. https://github.com/
    etsy/morgue

    View Slide

  52. Near Miss

    View Slide

  53. Pre Mortem

    View Slide

  54. "Hey all, I just ran rm -rf $DIR/
    and since the variable was empty
    I deleted my whole VM. This
    would have been bad in
    production. Don't do that."

    View Slide

  55. Architecture
    Reviews

    View Slide

  56. Operability
    Reviews

    View Slide

  57. It is also worth pointing
    out that the bias towards
    investigating failures
    rather than success itself
    represents a trade-off.
    — Erik Hollnagel, The ETTO Principle: Efficiency-
    Thoroughness Trade-Off

    View Slide

  58. Investigate
    Success

    View Slide

  59. Why did it work?

    View Slide

  60. Human Error is where
    you stopped looking

    View Slide

  61. peakscale.com/
    postmortems

    View Slide

  62. codeascraft.com
    etsy.com/codeascraft/talks
    etsy.com/careers

    View Slide

  63. Questions?

    View Slide

  64. Human Factors &
    PostMortems
    Daniel Schauenberg
    [email protected]
    @mrtazz

    View Slide