Blameless Post-mortems (And why the blameless bit really matters)

Blameless Post-mortems And why the blameless bit really matters

About me... Worked in tech for over 25 years, mostly
as a Software Engineer I’m a bit obsessed with DevOps Ambassador for WiT York and co- organiser of DevOpsDays London Find me on Twitter (usually tweeting about DevOps or Agile) at @srwestons

Outline What is a blameless post-mortem? Why have post-mortems? How
to run a good post-mortem Summary

What is a blameless post-mortem?

Post-mortem?!

Post-mortems are also known as... Retrospective Post Incident Review Root
Cause Analysis Debriefing

An incident post-mortem brings the team together to figure out:
- what happened - why it happened - how the team responded - how to prevent repeat incidents - how to improve future responses

Being blameless means no finger-pointing. Photo by Adi Goldstein on
unsplash

Why have post-mortems? (And what’s the deal with the blameless
bit?)

“A blamelessly written post-mortem assumes that everyone involved in an
incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.” Google SRE Handbook

We don’t throw anyone under the bus... Photo by Umanoide
on unsplash

Not even ‘the intern’

We don’t want to make people feel bad. We don’t
want to make people afraid of failing, and of trying new ways of doing things. Photo by Hello I’m Nik on Unsplash

It’s too easy to blame failure on ‘human error’. Photo
by Adri Ramdeane on unsplash

“A blamelessly written post-mortem assumes that everyone involved in an
incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.” Google SRE Handbook

Assume good intent. Photo by Di Maitland on unsplash

It’s about not thinking blamey things. Being blameless isn’t just
about not saying blamey things. Photo by Ryan Franco on unsplash

If you think that someone is to blame for an
incident, you’ll stop asking questions. You’ll stop being curious. You need to stay curious. Genuinely curious.

It doesn’t matter who’s doing the blaming.

“Incidents are unplanned investments where all the costs are paid
up front.“

Don’t view post-mortems as a chore. Photo by Priscilla du
Preez on unsplash

View them as an opportunity to learn!

Incidents are messy. Photo by Amy Elting on unsplash

There is so much more that we can ask.

What do people do during the response to incidents that
they don’t make explicit? What tricks or shortcuts do they use that others aren’t aware of? Which tools proved to be useful? Which tools proved to be distracting or unhelpful? How do teams perceive which parts of their application or systems are too ”risky” to touch? How did this view arise? How universal is that perspective amongst all teams? Are there any sources of data about the system (logs, graphs, etc) that people regularly dismiss or are suspicious of?

Let’s not forget the ’socio’ part of our socio-technical system.
Photo by NeONBRAND on unsplash

Who was on call? How often are they on call?
How familiar are they with the system or application they're supporting? Who do they escalate to? Who's in the incident channels? How many incident channels are there? Is this incident part of a pattern?

How to run good post-mortems

Photo by Victoria Palacios on unsplash

Find someone else to act as scribe and take notes.
Photo by Aaron Burden on unsplash

Schedule the post-mortem to take place quite soon after the
incident. If you leave it too long, people will start to forget the details of what happened. Photo by Behnam Norouzi on unsplash

It’s all just work.

Post-mortem meetings should be open to all.

Write up the discussion and findings and publish it for
anyone to read. Photo by Janko Ferlic on unsplash

Hold post-mortems for near misses as well as for full-blown
incidents. Photo by Adam Griffithon unsplash

Photo by Matt Walsh on unsplash Ask all the questions.

Don’t forget to retro your post-mortem process from time to
time. Photo by Erik Eastman on unsplash

Summary Post-mortems are about improving our understanding of our systems,
both technical and social. Focus on learning, not actions. Assume good intent and always stay curious!

Thank you!

References • Google SRE Handbook • https://sre.google/sre-book/postmortem-culture/ • How Complex
Systems Fail – Dr Richard Cook • https://how.complexsystems.fail/ • https://www.youtube.com/watch?v=2S0k12uZR14 • People are the adaptable element of complex systems – John Allspaw • https://2019.leanagile.scot/programme/people-are-the-adaptable-element-of-complexsystems • Incident Analysis – Your Organization’s Secret Weapon – Nora Jones • https://videolibrary.doesvirtual.com/?video=551641823 • Salesforce DNS outage – The Register • https://www.theregister.com/2021/05/19/salesforce_root_cause/

Blameless Post-mortems (And why the blameless b...

Blameless Post-mortems (And why the blameless bit really matters)

Other Decks in Technology

Featured

Transcript