Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Blameless Post-mortems (And why the blameless b...

sophieweston
November 26, 2021

Blameless Post-mortems (And why the blameless bit really matters)

Many of us are familiar with the practice of conducting blameless post-mortems after production outages but are we really making the most of these opportunities to learn, or has it just become something we do out of habit?

In this talk, Sophie looks again at the reasons for holding post-mortems, explore some best practices and what goes into making a successful post-mortem, and explain why the blameless bit really matters.

sophieweston

November 26, 2021
Tweet

Other Decks in Technology

Transcript

  1. About me... Worked in tech for over 25 years, mostly

    as a Software Engineer I’m a bit obsessed with DevOps Ambassador for WiT York and co- organiser of DevOpsDays London Find me on Twitter (usually tweeting about DevOps or Agile) at @srwestons
  2. An incident post-mortem brings the team together to figure out:

    - what happened - why it happened - how the team responded - how to prevent repeat incidents - how to improve future responses
  3. “A blamelessly written post-mortem assumes that everyone involved in an

    incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.” Google SRE Handbook
  4. We don’t want to make people feel bad. We don’t

    want to make people afraid of failing, and of trying new ways of doing things. Photo by Hello I’m Nik on Unsplash
  5. “A blamelessly written post-mortem assumes that everyone involved in an

    incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.” Google SRE Handbook
  6. It’s about not thinking blamey things. Being blameless isn’t just

    about not saying blamey things. Photo by Ryan Franco on unsplash
  7. If you think that someone is to blame for an

    incident, you’ll stop asking questions. You’ll stop being curious. You need to stay curious. Genuinely curious.
  8. What do people do during the response to incidents that

    they don’t make explicit? What tricks or shortcuts do they use that others aren’t aware of? Which tools proved to be useful? Which tools proved to be distracting or unhelpful? How do teams perceive which parts of their application or systems are too ”risky” to touch? How did this view arise? How universal is that perspective amongst all teams? Are there any sources of data about the system (logs, graphs, etc) that people regularly dismiss or are suspicious of?
  9. Who was on call? How often are they on call?

    How familiar are they with the system or application they're supporting? Who do they escalate to? Who's in the incident channels? How many incident channels are there? Is this incident part of a pattern?
  10. Find someone else to act as scribe and take notes.

    Photo by Aaron Burden on unsplash
  11. Schedule the post-mortem to take place quite soon after the

    incident. If you leave it too long, people will start to forget the details of what happened. Photo by Behnam Norouzi on unsplash
  12. Write up the discussion and findings and publish it for

    anyone to read. Photo by Janko Ferlic on unsplash
  13. Hold post-mortems for near misses as well as for full-blown

    incidents. Photo by Adam Griffithon unsplash
  14. Don’t forget to retro your post-mortem process from time to

    time. Photo by Erik Eastman on unsplash
  15. Summary Post-mortems are about improving our understanding of our systems,

    both technical and social. Focus on learning, not actions. Assume good intent and always stay curious!
  16. References • Google SRE Handbook • https://sre.google/sre-book/postmortem-culture/ • How Complex

    Systems Fail – Dr Richard Cook • https://how.complexsystems.fail/ • https://www.youtube.com/watch?v=2S0k12uZR14 • People are the adaptable element of complex systems – John Allspaw • https://2019.leanagile.scot/programme/people-are-the-adaptable-element-of-complex- systems • Incident Analysis – Your Organization’s Secret Weapon – Nora Jones • https://videolibrary.doesvirtual.com/?video=551641823 • Salesforce DNS outage – The Register • https://www.theregister.com/2021/05/19/salesforce_root_cause/