Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Analysis: How *Learning* is Different Than *Fixing*

Incident Analysis: How *Learning* is Different Than *Fixing*

I presented this at the excellent AllTheTalks Conference during the COVID-19 pandemic. :/

What does it mean to actually "learn" from an incident?

This talk will describe what we can do differently in the industry on this front, based on foundational methods from Cognitive Systems Engineering, Human Factors, and Resilience Engineering.

John Allspaw

April 15, 2020
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. The Main Gist • Current and typical approaches to “learning

    from incidents” have very little to do with actual learning. • Learning is not the same as fixing • Most post-incident review documents are written to be filed, not written to be read. • Changing the primary focus from fixing to learning will result in a significant competitive advantage
  2. When you focus on fixing things, they tend to get

    fixed quickly, with the materials and methods at hand. The quality and effectiveness of the fixing — and preventing future issues — is proportional to how well you understand how the thing works.
  3. attention paid to learning will always yield higher-quality fixing focusing

    exclusively on fixing is a barrier for learning
  4. Tendency is to think the only/greatest value is here Conventional

    View on Where Post-Incident Analysis Has Value incident happens… (maybe) someone preps a timeline… a postmortem meeting, where (maybe) you fill out some template… (maybe) someone compiles a report… ACTION ITEMS FINALLY! very expensive meeting
  5. On “learning” • Learning is happening all of the time,

    it’s a core part of being human. • What is learned, who learns, when they learn, and how they learn depends on how well practices are set up to support it. • No one can understand everything about everything. We should be surprised that things work as well as they do, given this! • Frequency of incidents has nothing to do with how well an organization learns! (see first bullet) It may be a signal about what they’re learning.
  6. Conventional Myth A canonical set of “lessons” can be extracted

    from an incident, which is then “shared” to a group. Reality Different people will have varying understandings before and after an incident, and what mysteries remain for them cannot be captured or addressed in a “one size fits all” package. The perceived problem to be solved, then, is to somehow “share” better. * we tend to say “share” when typically we mean “make available to others” What is important/notable/interesting will differ from person to person. what they actually remember!
  7. when we ask people about incidents • they become animated

    when they tell the story • they include elements of suspense in the structure of how they tell it • they include elements of surprise (“what we didn’t know at the time was…”) • they set some context (“now remember, this is the day we did our IPO…”) • they recall it in detail even if it’s many years since
  8. Interesting incident analysis documents get read. Compelling incident analysis documents

    get read and shared with others. Uninteresting documents…don’t. Fascinating documents get read, shared with others, commented on, asked about, referenced in code comments, …in pull requests, …in architecture diagrams, …in other incident writeups, …in newhire onboarding, … This film was made available in thousands of theaters around the world.
  9. Make Effort to Highlight The Messy Details • What was

    difficult for people to understand during the incident? • What was surprising for people about the incident? • How do people understand the origins of the incident? • What mysteries still remain for people?
  10. I want to know what was difficult about this, and

    I want to be able to ask questions about that.
  11. Flip from “severity” to difficulty • “Customer impact” is not

    equivalent to the difficulty of solving the issue. • Multiple difficulties can exist in the same incident. • Fielding questions about what was — or still is — difficult is how critical understandings are spread and how lasting memories are formed.
  12. “this is how it all works” “this is how it

    all works” “this is how it all works” “this is how it all works” “this is how it all works”
  13. “this is how it all works” “this is how it

    all works” “this is how it all works” “this is how it all works” “this is how it all works”
  14. “this is how it all works” “this is how it

    all works” “this is how it all works” “this is how it all works” “this is how it all works”
  15. “oh…I thought it did X nightly, not weekly…” “wait -

    I thought everyone knew that Y was an issue… only I knew that?” “I knew about N but didn’t know how it got to be that way…” “ok, got it - A feeds B, but C also feeds B…” “I didn’t know M could break silently like that…”
  16. Fine Goals For A Post-Incident Meeting …when participants in a

    post-incident group meeting leave the meeting knowing: a. new things they didn’t know when they entered the meeting b. new things they didn’t know about what their colleagues know c. how to continue discussions and where to capture it
  17. Fine Goals For A Post-Incident Review Writeup …when readers of

    a post-incident review writeup are finished reading, they know: a. new things they didn’t know when they started reading b. new things they didn’t know about what their colleagues know c. how to continue discussions and where to capture it
  18. • Separate the generation of “follow-up” items from a group

    incident review meeting • Record in the document who responded to the incident, and who attended the group meeting • Capture things that were done after the incident but before the group meeting in the document • Give write-ups to brand new engineers and ask them to record any and all questions they have after reading it • Link company-specific jargon/terms to documents that describe them • Ask more people to draw diagrams in debriefings and include them in the writeup
  19. “If the incident analyst participated in the incident, they will

    inevitably have a deeper understanding and bias towards the incident that will be impossible to remove in the process of analysis.” Ryan Kitchens https://www.learningfromincidents.io/blog/ Have someone who was not involved in the event lead the analysis.
  20. “The cliche idea that we would do this work to

    reduce the number of incidents or to lessen the time to remediate is too simplistic. Of course organizations want to have fewer incidents, however stating this as an end goal actually hurts our organizations. Indeed, it will lead to a reduction in incident count–not from actually reducing the number of incidents, but rather lessening how and how often they are reported.” Ryan Kitchens https://www.learningfromincidents.io/blog/ Resist focusing on reducing the number of incidents. Focus instead on increasing the number of people who want to read reports and attend the PIR meetings.