Slide 1

Slide 1 text

Incident Analysis: How Learning Is Different Than Fixing John Allspaw Adaptive Capacity Labs

Slide 2

Slide 2 text

The Main Gist • Current and typical approaches to “learning from incidents” have very little to do with actual learning. • Learning is not the same as fixing • Most post-incident review documents are written to be filed, not written to be read. • Changing the primary focus from fixing to learning will result in a significant competitive advantage

Slide 3

Slide 3 text

When you focus on fixing things, they tend to get fixed quickly, with the materials and methods at hand. The quality and effectiveness of the fixing — and preventing future issues — is proportional to how well you understand how the thing works.

Slide 4

Slide 4 text

attention paid to learning will always yield higher-quality fixing focusing exclusively on fixing is a barrier for learning

Slide 5

Slide 5 text

Tendency is to think the only/greatest value is here Conventional View on Where Post-Incident Analysis Has Value incident happens… (maybe) someone preps a timeline… a postmortem meeting, where (maybe) you fill out some template… (maybe) someone compiles a report… ACTION ITEMS FINALLY! very expensive meeting

Slide 6

Slide 6 text

On “learning” • Learning is happening all of the time, it’s a core part of being human. • What is learned, who learns, when they learn, and how they learn depends on how well practices are set up to support it. • No one can understand everything about everything. We should be surprised that things work as well as they do, given this! • Frequency of incidents has nothing to do with how well an organization learns! (see first bullet) It may be a signal about what they’re learning.

Slide 7

Slide 7 text

Conventional Myth A canonical set of “lessons” can be extracted from an incident, which is then “shared” to a group. Reality Different people will have varying understandings before and after an incident, and what mysteries remain for them cannot be captured or addressed in a “one size fits all” package. The perceived problem to be solved, then, is to somehow “share” better. * we tend to say “share” when typically we mean “make available to others” What is important/notable/interesting will differ from person to person. what they actually remember!

Slide 8

Slide 8 text

if you can’t remember something, you can’t say you’ve learned it

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

when we ask people about incidents • they become animated when they tell the story • they include elements of suspense in the structure of how they tell it • they include elements of surprise (“what we didn’t know at the time was…”) • they set some context (“now remember, this is the day we did our IPO…”) • they recall it in detail even if it’s many years since

Slide 11

Slide 11 text

stories that you remember have elements of challenge, struggle, and difficulty.

Slide 12

Slide 12 text

Interesting incident analysis documents get read. Compelling incident analysis documents get read and shared with others. Uninteresting documents…don’t. Fascinating documents get read, shared with others, commented on, asked about, referenced in code comments, …in pull requests, …in architecture diagrams, …in other incident writeups, …in newhire onboarding, … This film was made available in thousands of theaters around the world.

Slide 13

Slide 13 text

Make Effort to Highlight The Messy Details • What was difficult for people to understand during the incident? • What was surprising for people about the incident? • How do people understand the origins of the incident? • What mysteries still remain for people?

Slide 14

Slide 14 text

I want to know what was difficult about this, and I want to be able to ask questions about that.

Slide 15

Slide 15 text

Flip from “severity” to difficulty • “Customer impact” is not equivalent to the difficulty of solving the issue. • Multiple difficulties can exist in the same incident. • Fielding questions about what was — or still is — difficult is how critical understandings are spread and how lasting memories are formed.

Slide 16

Slide 16 text

“this is how it all works” “this is how it all works” “this is how it all works” “this is how it all works” “this is how it all works”

Slide 17

Slide 17 text

“this is how it all works” “this is how it all works” “this is how it all works” “this is how it all works” “this is how it all works”

Slide 18

Slide 18 text

“this is how it all works” “this is how it all works” “this is how it all works” “this is how it all works” “this is how it all works”

Slide 19

Slide 19 text

“oh…I thought it did X nightly, not weekly…” “wait - I thought everyone knew that Y was an issue… only I knew that?” “I knew about N but didn’t know how it got to be that way…” “ok, got it - A feeds B, but C also feeds B…” “I didn’t know M could break silently like that…”

Slide 20

Slide 20 text

Fine Goals For A Post-Incident Meeting …when participants in a post-incident group meeting leave the meeting knowing: a. new things they didn’t know when they entered the meeting b. new things they didn’t know about what their colleagues know c. how to continue discussions and where to capture it

Slide 21

Slide 21 text

Fine Goals For A Post-Incident Review Writeup …when readers of a post-incident review writeup are finished reading, they know: a. new things they didn’t know when they started reading b. new things they didn’t know about what their colleagues know c. how to continue discussions and where to capture it

Slide 22

Slide 22 text

Some things to experiment with

Slide 23

Slide 23 text

• Separate the generation of “follow-up” items from a group incident review meeting • Record in the document who responded to the incident, and who attended the group meeting • Capture things that were done after the incident but before the group meeting in the document • Give write-ups to brand new engineers and ask them to record any and all questions they have after reading it • Link company-specific jargon/terms to documents that describe them • Ask more people to draw diagrams in debriefings and include them in the writeup

Slide 24

Slide 24 text

“If the incident analyst participated in the incident, they will inevitably have a deeper understanding and bias towards the incident that will be impossible to remove in the process of analysis.” Ryan Kitchens https://www.learningfromincidents.io/blog/ Have someone who was not involved in the event lead the analysis.

Slide 25

Slide 25 text

“The cliche idea that we would do this work to reduce the number of incidents or to lessen the time to remediate is too simplistic. Of course organizations want to have fewer incidents, however stating this as an end goal actually hurts our organizations. Indeed, it will lead to a reduction in incident count–not from actually reducing the number of incidents, but rather lessening how and how often they are reported.” Ryan Kitchens https://www.learningfromincidents.io/blog/ Resist focusing on reducing the number of incidents. Focus instead on increasing the number of people who want to read reports and attend the PIR meetings.

Slide 26

Slide 26 text

Thank You!