Pro Yearly is on sale from $80 to $50! »

DjangoCon 2019 - The Blameless Post Mortem: How Embracing Failure Makes Us Better

DjangoCon 2019 - The Blameless Post Mortem: How Embracing Failure Makes Us Better

5d95f89b9892e12c0d7fa2f671edac62?s=128

Christopher Wilcox

September 24, 2019
Tweet

Transcript

  1. @chriswilcox47 https://chriswilcox.dev The blameless post mortem How embracing failure makes

    us better.
  2. @chriswilcox47 https://chriswilcox.dev Software is pretty amazing

  3. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev And so are its bugs...

  4. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev What if we could learn from

    the bugs?
  5. @chriswilcox47 https://chriswilcox.dev crwilcox @chriswilcox47 https://crwilcox.com https://chriswilcox.racing https://speakerdeck.com/crwilcox Hi! I’m Chris.

  6. @chriswilcox47 https://chriswilcox.dev https://www.history.navy.mil

  7. @chriswilcox47 https://chriswilcox.dev A look at other industries.

  8. @chriswilcox47 https://chriswilcox.dev Sometimes decisions are actually, not in hyperbole, life

    and death.
  9. @chriswilcox47 https://chriswilcox.dev Every “Mistake” is an opportunity to better the

    system.
  10. @chriswilcox47 https://chriswilcox.dev The NTSB. National Transportation Safety Board

  11. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev The NTSB has no formal, prosecutorial,

    authority.
  12. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev And has provided 13,000+ safety recommendations.

  13. @chriswilcox47 https://chriswilcox.dev

  14. @chriswilcox47 https://chriswilcox.dev

  15. @chriswilcox47 https://chriswilcox.dev https://www.ntsb.gov/investigations/AccidentReports/Reports/MAB1917.pdf

  16. @chriswilcox47 https://chriswilcox.dev Industry Process: M&M Conferences https://www.marsfoodservices.com

  17. @chriswilcox47 https://chriswilcox.dev Morbidity & Mortality conferences - Provide medical education.

    - Improve quality of care. - Highlight cases with diagnostic uncertainty, complex management, etc. https://www.imperial.ac.uk
  18. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev Does it matter who is present?

  19. @chriswilcox47 https://chriswilcox.dev Adverse Event Procedure Patient Environment Medication

  20. @chriswilcox47 https://chriswilcox.dev What can we, as technologists, learn from these

    other industries?
  21. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev How might we conduct a post

    mortem?
  22. @chriswilcox47 https://chriswilcox.dev The structure of a post mortem. - Why

    did this happen? - Could it have been worse? Better? - How can we make sure it doesn’t happen again?
  23. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. The documentation for the Google Cloud Python libraries was unavailable for users.
  24. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. As part of repository cleanup, an engineer with write access to the development repository deleted the gh-pages branch.
  25. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. During a two hour period, hosted reference documentation for our libraries was unavailable.
  26. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. This was first detected via an external customer. Shortly after two internal teams also noticed and notified. The delay to the initial report was 30 minutes following the start of the outage.
  27. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. The documentation was available after republishing to gh-pages.
  28. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. 2 hours. 2019-12-05 (all times PDT) 09:45 Branch gh-pages deleted during clean up of repository branches. 10:21 GitHub issue filed stating that docs are not available. 10:45 Team responds to issue. 10:55 Branch has been republished 11:45 GitHub is serving docs again.
  29. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. We were notified fairly quick and the team monitors GitHub issues well enough that we knew about it pretty fast.
  30. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. The branch should have been protected from deletion. GitHub page publishing is a bit opaque. While the branch was pushed within about 5 minutes of finding out it took an additional hour or so to get it actually serving the files. The lack of debugging information made this more difficult.
  31. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. A member of the team had a local copy of the branch handy and was able to republish.
  32. @chriswilcox47 https://chriswilcox.dev An example incident. - Summary. - Root Cause/Trigger.

    - User Impact. - Detection. - Resolution. - Duration/Timeline. - What Went Well? - What Went Poorly? - Where We Got Lucky? - Action Items. Protect gh-pages branch. Don’t allow deletion. Investigate if debugging gh-pages could be improved. Investigate if there are other technologies we ought to be using instead of gh-pages.
  33. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev We fail so we can learn.

  34. @chriswilcox47 https://chriswilcox.dev Blame isn’t a deterrent.

  35. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev Excusing failure as human error stops

    the conversation.
  36. @chriswilcox47 https://chriswilcox.dev Why we don’t blame; an example.

  37. @chriswilcox47 https://chriswilcox.dev

  38. @chriswilcox47 https://chriswilcox.dev Building a blameless post mortem culture.

  39. @chriswilcox47 https://chriswilcox.dev Create psychological safety.

  40. @chriswilcox47 https://chriswilcox.dev

  41. @chriswilcox47 https://chriswilcox.dev Find positive examples. https://landing.google.com/sre/books

  42. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev Outages happen and no one wanted

    them to.
  43. @chriswilcox47 https://chriswilcox.dev A Test for Blamelessness 1. A script was

    executed that deleted the production database. 2. An engineer executed a script that deleted the production database without confirmation. 3. Chris executed a script that deleted the production database without confirmation.
  44. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev “But Chris, we never have outages

    at my work, we are perfect, in every way” -Someone, Probably.
  45. @chriswilcox47 https://chriswilcox.dev “If you've never missed a flight, you're probably

    spending too much time in airports“ -George Stigler
  46. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev Raise your own bar, Break stuff.

  47. @chriswilcox47 https://chriswilcox.dev @chriswilcox47 https://chriswilcox.dev Add some 9’s to that SLA!

  48. @chriswilcox47 https://chriswilcox.dev Thanks! @chriswilcox47 https://chriswilcox.dev

  49. @chriswilcox47 https://chriswilcox.dev References. https://www.ntsb.gov/investigations/AccidentReports/Reports/MAB1917.pdf https://acphospitalist.org/archives/2018/07/the-new-improved-m-and-m.htm https://insights.ovid.com/pubmed?pmid=26983075&_ga=2.103365704.971425802.15638 99461-1002013713.1563899459 https://www.ama-assn.org/residents-students/residency/presenting-your-first-mm -conference-5-things-you-need-know https://landing.google.com/sre/books

    https://cloud.google.com/blog/products/gcp/fearless-shared-postmortems-cre-lif e-lessons https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/