Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Analysis: How *Learning* is Different Than *Fixing*

Incident Analysis: How *Learning* is Different Than *Fixing*

I presented this at the excellent AllTheTalks Conference during the COVID-19 pandemic. :/

What does it mean to actually "learn" from an incident?

This talk will describe what we can do differently in the industry on this front, based on foundational methods from Cognitive Systems Engineering, Human Factors, and Resilience Engineering.

John Allspaw

April 15, 2020
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. Incident Analysis:
    How Learning Is Different Than Fixing
    John Allspaw
    Adaptive Capacity Labs

    View Slide

  2. The Main Gist
    • Current and typical approaches to “learning from incidents” have very little
    to do with actual learning.
    • Learning is not the same as fixing
    • Most post-incident review documents are written to be filed, not written
    to be read.
    • Changing the primary focus from fixing to learning will result in a
    significant competitive advantage

    View Slide

  3. When you focus on fixing things,
    they tend to get fixed quickly, with
    the materials and methods at hand.
    The quality and effectiveness of the fixing
    — and preventing future issues — is
    proportional to how well you understand
    how the thing works.

    View Slide

  4. attention paid to learning will
    always yield higher-quality fixing
    focusing exclusively on fixing is a
    barrier for learning

    View Slide

  5. Tendency is to think the only/greatest
    value is here
    Conventional View on Where Post-Incident Analysis Has Value
    incident
    happens…

    (maybe)
    someone
    preps a timeline…
    a postmortem
    meeting, where (maybe)
    you fill out some template…
    (maybe)
    someone
    compiles a report…
    ACTION
    ITEMS
    FINALLY!
    very expensive meeting

    View Slide

  6. On “learning”
    • Learning is happening all of the time, it’s a core part of being human.
    • What is learned, who learns, when they learn, and how they learn depends
    on how well practices are set up to support it.
    • No one can understand everything about everything. We should be
    surprised that things work as well as they do, given this!
    • Frequency of incidents has nothing to do with how well an organization
    learns! (see first bullet) It may be a signal about what they’re learning.

    View Slide

  7. Conventional Myth
    A canonical set of “lessons” can be extracted from an
    incident, which is then “shared” to a group.
    Reality
    Different people will have varying understandings
    before and after an incident, and what mysteries
    remain for them cannot be captured or addressed in
    a “one size fits all” package.
    The perceived problem to be solved, then, is to
    somehow “share” better.
    * we tend to say “share” when typically we mean “make available to others”
    What is important/notable/interesting will differ from
    person to person.
    what they actually remember!

    View Slide

  8. if you can’t remember something,
    you can’t say you’ve learned it

    View Slide

  9. View Slide

  10. when we ask people about incidents
    • they become animated when they tell the story
    • they include elements of suspense in the structure of how they tell it
    • they include elements of surprise (“what we didn’t know at the time was…”)
    • they set some context (“now remember, this is the day we did our IPO…”)
    • they recall it in detail even if it’s many years since

    View Slide

  11. stories that you remember have
    elements of challenge, struggle, and
    difficulty.

    View Slide

  12. Interesting incident analysis documents get read.
    Compelling incident analysis documents get read and shared with others.
    Uninteresting documents…don’t.
    Fascinating documents get read, shared with others,
    commented on, asked about, referenced in code comments,
    …in pull requests,

    …in architecture diagrams,

    …in other incident writeups,

    …in newhire onboarding,


    This film was made available in
    thousands of theaters around the
    world.

    View Slide

  13. Make Effort to Highlight The Messy Details
    • What was difficult for people to understand during the incident?
    • What was surprising for people about the incident?
    • How do people understand the origins of the incident?
    • What mysteries still remain for people?

    View Slide

  14. I want to know what was difficult
    about this, and I want to be able to
    ask questions about that.

    View Slide

  15. Flip from “severity” to difficulty
    • “Customer impact” is not equivalent to the difficulty of solving the issue.
    • Multiple difficulties can exist in the same incident.
    • Fielding questions about what was — or still is — difficult is how critical
    understandings are spread and how lasting memories are formed.

    View Slide

  16. “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”

    View Slide

  17. “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”

    View Slide

  18. “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”
    “this is how it
    all works”

    View Slide

  19. “oh…I thought it
    did X nightly,
    not weekly…”
    “wait - I thought everyone
    knew that Y was an issue…
    only I knew that?”
    “I knew about N but
    didn’t know how it got
    to be that way…”
    “ok, got it - A feeds B,
    but C also feeds B…”
    “I didn’t know M
    could break silently
    like that…”

    View Slide

  20. Fine Goals For A Post-Incident Meeting
    …when participants in a post-incident group meeting leave the meeting
    knowing:
    a. new things they didn’t know when they entered the meeting
    b. new things they didn’t know about what their colleagues know
    c. how to continue discussions and where to capture it

    View Slide

  21. Fine Goals For A Post-Incident Review Writeup
    …when readers of a post-incident review writeup are finished reading, they
    know:
    a. new things they didn’t know when they started reading
    b. new things they didn’t know about what their colleagues know
    c. how to continue discussions and where to capture it

    View Slide

  22. Some things to
    experiment with

    View Slide

  23. • Separate the generation of “follow-up” items from a group incident review meeting
    • Record in the document who responded to the incident, and who attended the
    group meeting
    • Capture things that were done after the incident but before the group meeting in the
    document
    • Give write-ups to brand new engineers and ask them to record any and all
    questions they have after reading it
    • Link company-specific jargon/terms to documents that describe them
    • Ask more people to draw diagrams in debriefings and include them in the writeup

    View Slide

  24. “If the incident analyst
    participated in the incident, they
    will inevitably have a deeper
    understanding and bias towards
    the incident that will be
    impossible to remove in the
    process of analysis.”
    Ryan Kitchens
    https://www.learningfromincidents.io/blog/
    Have someone who was
    not involved in the event
    lead the analysis.

    View Slide

  25. “The cliche idea that we would do this work to
    reduce the number of incidents or to lessen
    the time to remediate is too simplistic.
    Of course organizations want to have fewer
    incidents, however stating this as an end goal
    actually hurts our organizations. Indeed, it will
    lead to a reduction in incident count–not from
    actually reducing the number of incidents, but
    rather lessening how and how often they are
    reported.”
    Ryan Kitchens
    https://www.learningfromincidents.io/blog/
    Resist focusing on
    reducing the number of
    incidents.
    Focus instead on
    increasing the number
    of people who want to
    read reports and attend
    the PIR meetings.

    View Slide

  26. Thank You!

    View Slide