Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blameless Post-mortems (And why the blameless bit really matters)

52b6e5eec67dc13917b4b78c268005d4?s=47 sophieweston
November 26, 2021

Blameless Post-mortems (And why the blameless bit really matters)

Many of us are familiar with the practice of conducting blameless post-mortems after production outages but are we really making the most of these opportunities to learn, or has it just become something we do out of habit?

In this talk, Sophie looks again at the reasons for holding post-mortems, explore some best practices and what goes into making a successful post-mortem, and explain why the blameless bit really matters.

52b6e5eec67dc13917b4b78c268005d4?s=128

sophieweston

November 26, 2021
Tweet

Transcript

  1. Blameless Post-mortems And why the blameless bit really matters

  2. About me... Worked in tech for over 25 years, mostly

    as a Software Engineer I’m a bit obsessed with DevOps Ambassador for WiT York and co- organiser of DevOpsDays London Find me on Twitter (usually tweeting about DevOps or Agile) at @srwestons
  3. Outline What is a blameless post-mortem? Why have post-mortems? How

    to run a good post-mortem Summary
  4. What is a blameless post-mortem?

  5. Post-mortem?!

  6. Post-mortems are also known as... Retrospective Post Incident Review Root

    Cause Analysis Debriefing
  7. An incident post-mortem brings the team together to figure out:

    - what happened - why it happened - how the team responded - how to prevent repeat incidents - how to improve future responses
  8. Being blameless means no finger-pointing. Photo by Adi Goldstein on

    unsplash
  9. Why have post-mortems? (And what’s the deal with the blameless

    bit?)
  10. “A blamelessly written post-mortem assumes that everyone involved in an

    incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.” Google SRE Handbook
  11. We don’t throw anyone under the bus... Photo by Umanoide

    on unsplash
  12. Not even ‘the intern’

  13. We don’t want to make people feel bad. We don’t

    want to make people afraid of failing, and of trying new ways of doing things. Photo by Hello I’m Nik on Unsplash
  14. It’s too easy to blame failure on ‘human error’. Photo

    by Adri Ramdeane on unsplash
  15. “A blamelessly written post-mortem assumes that everyone involved in an

    incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.” Google SRE Handbook
  16. Assume good intent. Photo by Di Maitland on unsplash

  17. It’s about not thinking blamey things. Being blameless isn’t just

    about not saying blamey things. Photo by Ryan Franco on unsplash
  18. If you think that someone is to blame for an

    incident, you’ll stop asking questions. You’ll stop being curious. You need to stay curious. Genuinely curious.
  19. It doesn’t matter who’s doing the blaming.

  20. “Incidents are unplanned investments where all the costs are paid

    up front.“
  21. Don’t view post-mortems as a chore. Photo by Priscilla du

    Preez on unsplash
  22. View them as an opportunity to learn!

  23. Incidents are messy. Photo by Amy Elting on unsplash

  24. None
  25. There is so much more that we can ask.

  26. What do people do during the response to incidents that

    they don’t make explicit? What tricks or shortcuts do they use that others aren’t aware of? Which tools proved to be useful? Which tools proved to be distracting or unhelpful? How do teams perceive which parts of their application or systems are too ”risky” to touch? How did this view arise? How universal is that perspective amongst all teams? Are there any sources of data about the system (logs, graphs, etc) that people regularly dismiss or are suspicious of?
  27. Let’s not forget the ’socio’ part of our socio-technical system.

    Photo by NeONBRAND on unsplash
  28. Who was on call? How often are they on call?

    How familiar are they with the system or application they're supporting? Who do they escalate to? Who's in the incident channels? How many incident channels are there? Is this incident part of a pattern?
  29. How to run good post-mortems

  30. Photo by Victoria Palacios on unsplash

  31. Find someone else to act as scribe and take notes.

    Photo by Aaron Burden on unsplash
  32. Schedule the post-mortem to take place quite soon after the

    incident. If you leave it too long, people will start to forget the details of what happened. Photo by Behnam Norouzi on unsplash
  33. None
  34. It’s all just work.

  35. Post-mortem meetings should be open to all.

  36. Write up the discussion and findings and publish it for

    anyone to read. Photo by Janko Ferlic on unsplash
  37. Hold post-mortems for near misses as well as for full-blown

    incidents. Photo by Adam Griffithon unsplash
  38. Photo by Matt Walsh on unsplash Ask all the questions.

  39. Don’t forget to retro your post-mortem process from time to

    time. Photo by Erik Eastman on unsplash
  40. Summary Post-mortems are about improving our understanding of our systems,

    both technical and social. Focus on learning, not actions. Assume good intent and always stay curious!
  41. Thank you!

  42. References • Google SRE Handbook • https://sre.google/sre-book/postmortem-culture/ • How Complex

    Systems Fail – Dr Richard Cook • https://how.complexsystems.fail/ • https://www.youtube.com/watch?v=2S0k12uZR14 • People are the adaptable element of complex systems – John Allspaw • https://2019.leanagile.scot/programme/people-are-the-adaptable-element-of-complex- systems • Incident Analysis – Your Organization’s Secret Weapon – Nora Jones • https://videolibrary.doesvirtual.com/?video=551641823 • Salesforce DNS outage – The Register • https://www.theregister.com/2021/05/19/salesforce_root_cause/