Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Human Factors and PostMortems

Human Factors and PostMortems

Our daily work takes place in a myriad of systems. They are comprised of software, hardware and humans. And everybody who has worked with complex systems at any scale knows: Failure is not an option, it's inevitable.

At Etsy we are embracing the fact that failures happen and that the only way to understand how the accident happened is to investigate it without blaming the humans involved. This is why we have a blameless postmortem for every outage that occurs. It is an open meeting and everybody is invited to join and find out what happened and how we can make the system safer.

This talk will explain how postmortems at Etsy are conducted and how we maintain and scale the process as the team grows and new people start. It will go over the tools we built and utilize to make postmortems efficient and also share the learnings from each one with all the people in the company.

Daniel Schauenberg

November 11, 2014
Tweet

More Decks by Daniel Schauenberg

Other Decks in Technology

Transcript

  1. Erkenntnis und Irrtum fließen aus denselben psychischen Quellen; nur der

    Erfolg vermag beide zu scheiden. — Ernst Mach, Erkenntnis und Irrtum (p. 116)
  2. There is a difference between explaining and excusing human performance.

    — Sidney Dekker, The Field Guide to Understanding Human Error (p. 196)
  3. - she should have - if he would have -

    if they just had - you failed to
  4. "Hey all, I just ran rm -rf $DIR/ and since

    the variable was empty I deleted my whole VM. This would have been bad in production. Don't do that."
  5. It is also worth pointing out that the bias towards

    investigating failures rather than success itself represents a trade-off. — Erik Hollnagel, The ETTO Principle: Efficiency- Thoroughness Trade-Off