Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Findings From The Field - DevOps Enterprise London 2020

Findings From The Field - DevOps Enterprise London 2020

In the past two years, we have had the opportunity to observe and explore the real nitty-gritty of how organizations handle, perceive, value, and treat the incidents that they experience. While the size, type, and character of these companies vary wildly, we have observed some common patterns across them.

This talk will outline these patterns we've discovered and explored, some of which run counter to traditional beliefs in the industry, some of which pose dilemmas for businesses, and others that point to undiscovered competitive advantages that are being "left on the table."

These patterns include:
- A mismatch between leadership’s views on incidents and the lived experience that engineers have with them.
- Learning from incidents is given low organizational priority, and the quality and depth of the results reflect this.
- “Fixing” rather than “learning” - the main focus of post-incident activity is repair.
- Engineers learn from incidents in unconventional ways that are largely hidden from management (and therefore unsupported).

John Allspaw

June 23, 2020
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. Findings From The Field
    Two Years of Studying Incidents Closely
    Adaptive Capacity Labs
    John Allspaw

    View Slide

  2. about me
    Consortium for Resilient
    Internet-Facing Business IT
    Adaptive
    Capacity
    Labs

    View Slide

  3. disclosure
    1. these are only a few of the most common patterns
    2. these are not judgements/comments on any single organization

    View Slide

  4. Bottom Line, Up Front:
    what we’ve observed across the industry
    1. The state of maturity in the industry on learning from incidents is low.
    2. Significant gaps exist between technology leaders 㲗 hands-on
    practitioners on what it means to learn from incidents.
    3. Learning from incidents is given low priority, resulting in a narrow focus on
    fixing.
    4. Overconfidence in what shallow incident metrics mean and significant
    energy wasted on tabulating them.

    View Slide

  5. Technology Leaders 㲗 Hands-on Practitioners
    • what is actually learned
    • how learning actually takes place
    • what the incident actually means (for the past, for now, and for the future)
    a gap exists here

    View Slide

  6. “Blunt” End
    “Sharp” End
    Technology Leaders
    Hands-on Practitioners
    • summaries
    • simplifications
    • abstractions
    • statistics

    View Slide

  7. “Blunt” End
    “Sharp” End
    Technology Leaders
    Hands-on Practitioners
    • summaries
    • simplifications
    • abstractions
    • statistics

    View Slide

  8. Technology Leaders
    Hands-on Practitioners

    !
    !

    24.53 Mean Time To Resolve
    32.13 Mean Time To Oversimplify
    14.45 Mean Time To Something
    22 incidents in Q3
    12 SEVERITY DEFCON events

    View Slide

  9. Technology Leaders
    • typically are far away from the “messy details” of incidents
    • frequently believe their presence and participation in incident response
    channels (chat, bridges, etc.) has a positive influence (it doesn’t)
    • typically believes incidents are adverse events in an otherwise “quiet” and
    healthy reality (they’re not)
    • typically fear how incidents reflect poorly on their performance more
    than they fear practitioners not learning effectively from them

    View Slide

  10. Technology Leaders
    • typically believe abstract incident metrics tell enough of a story for them
    to understand the state of the “system” (they don’t)
    • typically believe abstract incident metrics reflect more about their teams’
    performance than it reflects the complexity those teams have to cope
    with
    • typically believe the above observations don’t apply to them

    View Slide

  11. (M)TTR/(M)TTD
    Frequency
    Severity
    Customer impact

    shallow
    metrics
    no predictive value forward
    no explanatory value backward

    View Slide

  12. “but they help us ask deeper questions”
    You don’t need this chart to ask deeper
    questions about incidents.
    Just ask the questions.
    and record both the questions and
    answers so others can find them in the
    future

    View Slide

  13. Technology Leaders
    How can you tell the difference between…
    A difficult case
    handled well.
    A straightforward
    case handled
    poorly.
    ?

    View Slide

  14. Difficulty in handling
    the incident
    Consequences
    or impact of the
    incident
    Performance in
    handling the incident
    Technology Leaders
    incident metrics
    only signal these
    without these, you cannot understand
    what incidents mean in context

    View Slide

  15. incident metrics do not do what you think they do
    More on this topic:
    https://bit.ly/beyond-shallow-data

    View Slide

  16. Hands-on Practitioners
    • typically view post-incident activities to be a “check-the-box” chore
    • typically believe in a future world where automation will make incidents
    disappear
    • typically do not capture what makes an incident difficult, only what
    technical solution there was for it.
    • typically do not capture the post-incident writeup for readers beyond
    their local team

    View Slide

  17. Hands-on Practitioners
    • typically do not read post-incident review write-ups from other teams
    • typically fear what leadership thinks of incident metrics more than they
    fear misunderstanding the origins and sources of the incident
    • typically has to exercise significant restraint from immediately jumping to
    “fixes” before understanding an incident beyond a surface level
    • typically believe the above observations don’t apply to them

    View Slide

  18. Learning is not the
    same as fixing.
    More about this here:
    https://bit.ly/learning-not-fixing

    View Slide

  19. Ok! We get it!
    What are solutions, wiseguy?

    View Slide

  20. Technology Leaders
    Learning from incidents effectively requires
    skill and expertise that most do not have
    These are skills that can be learned and improved.
    Prioritize it when things are going well.
    It will accelerate the expertise in your org.
    More on this:
    https://www.learningfromincidents.io/

    View Slide

  21. Technology Leaders
    Focus less on incident metrics and more on signals that people are learning
    • analytics on how often incident write-up are being read
    • analytics on who is reading the write-ups
    • analytics on where incident write-ups are being linked from
    • support group incident review meetings being optional, and track attendance
    • track which write-ups that link to prior relevant incident write-ups
    More about this here:
    https://bit.ly/learning-markers

    View Slide

  22. Practitioners
    Don’t place all the burden on a group review meeting!
    Use this meeting to present and discuss analysis that has already been done.
    this is an important meeting — prepare for it like it’s expensive — because it is!
    • HiPPO (“highest paid person's opinion”)
    • Groupthink
    • Tangents
    • Redirections
    • Elephants in the room
    • “Down in the weeds”
    Too many potential pitfalls to bet everything
    on a single meeting…

    View Slide

  23. Practitioners
    Incident analysts should NOT be stakeholders
    • Your role is not to tell the One True Story™ of what happened.
    • Your role is not to dictate or suggest what to do.
    • Maintaining a non-stakeholder stance signals to others that you are willing
    too hear a minority viewpoint
    • Half of your job is to get people to genuinely look forward to and participate
    in the next incident analysis.

    View Slide

  24. Practitioners
    Separate generating action items from the group review meeting
    Action Items Generation
    Group Review Meeting
    “soak time”

    View Slide

  25. ACL Challenge
    Technology Leaders
    Practitioners
    For every incident that has a “red herring” episode…capture the red herring
    part of the story in detail in the write-up, especially on what made following
    the “rabbit hole” seem reasonable at the time.
    Start tracking how often post-incident write-ups are voluntarily read by
    people outside of the team(s) closest to the incident.
    Start tracking how often incident review meetings are voluntarily attended by
    people outside of the team(s) closest to the incident.

    View Slide

  26. Help I’m Looking For

    View Slide

  27. Thank You!
    Help I’m Looking For

    View Slide