Human Factors and PostMortems

Human Factors and PostMortems

Our daily work takes place in a myriad of systems. They are comprised of software, hardware and humans. And everybody who has worked with complex systems at any scale knows: Failure is not an option, it's inevitable.

At Etsy we are embracing the fact that failures happen and that the only way to understand how the accident happened is to investigate it without blaming the humans involved. This is why we have a blameless postmortem for every outage that occurs. It is an open meeting and everybody is invited to join and find out what happened and how we can make the system safer.

This talk will explain how postmortems at Etsy are conducted and how we maintain and scale the process as the team grows and new people start. It will go over the tools we built and utilize to make postmortems efficient and also share the learnings from each one with all the people in the company.

89e0ad1229121f46047977ac547bd7b4?s=128

Daniel Schauenberg

November 11, 2014
Tweet

Transcript

  1. Human Factors & PostMortems Daniel Schauenberg d@etsy.com @mrtazz

  2. None
  3. We deploy quite a lot

  4. MTTR

  5. MTTR > MTBF

  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. realtalk: things break

  13. New View

  14. Complex Socio- Technical Systems

  15. Erkenntnis und Irrtum fließen aus denselben psychischen Quellen; nur der

    Erfolg vermag beide zu scheiden. — Ernst Mach, Erkenntnis und Irrtum (p. 116)
  16. Things made sense at the time

  17. People don't come to work to do a bad job

  18. Nietzschean Anxiety

  19. So I always get off the hook whatever I do?

  20. There is a difference between explaining and excusing human performance.

    — Sidney Dekker, The Field Guide to Understanding Human Error (p. 196)
  21. Blameless Postmortems

  22. Open Meeting

  23. Everybody is Invited

  24. What happened?

  25. Timeline

  26. Describe the past Don't excuse it away

  27. The Facilitator

  28. Guide the Discussion

  29. Look out for indicators of Old View thinking

  30. Counterfactuals

  31. - she should have - if he would have -

    if they just had - you failed to
  32. Biases

  33. Hindsight Bias Confirmation Bias Outcome Bias

  34. there are many more

  35. Who is in charge?

  36. Etsy School

  37. Taught Facilitator Course

  38. 3 x 90 minutes

  39. Remediation Items

  40. incorporate learning and takeaway from the meeting

  41. None
  42. turn surprises into known factors

  43. MORGUE

  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. https://github.com/ etsy/morgue

  52. Near Miss

  53. Pre Mortem

  54. "Hey all, I just ran rm -rf $DIR/ and since

    the variable was empty I deleted my whole VM. This would have been bad in production. Don't do that."
  55. Architecture Reviews

  56. Operability Reviews

  57. It is also worth pointing out that the bias towards

    investigating failures rather than success itself represents a trade-off. — Erik Hollnagel, The ETTO Principle: Efficiency- Thoroughness Trade-Off
  58. Investigate Success

  59. Why did it work?

  60. Human Error is where you stopped looking

  61. peakscale.com/ postmortems

  62. codeascraft.com etsy.com/codeascraft/talks etsy.com/careers

  63. Questions?

  64. Human Factors & PostMortems Daniel Schauenberg d@etsy.com @mrtazz