$30 off During Our Annual Pro Sale. View Details »

“Problem Detection” (Klein, et. al., 2005) - Papers We Love (NYC)

John Allspaw
October 26, 2018

“Problem Detection” (Klein, et. al., 2005) - Papers We Love (NYC)

(meetup description for this talk is here: https://www.meetup.com/papers-we-love/events/254505298/)

Published in 2005 in the journal Cognition, Technology and Work, "Problem Detection" explores the "process by which people first become concerned that events may be taking an unexpected and undesirable direction that potentially requires action." While this paper primarily centers on empirically rebutting previous theories of how problems are detected, it also puts forth many important observations and concepts for software engineering to pay close attention to. This talk won't just be a re-statement of the paper's core views; I will place these into a software engineering and operations context and connect them to SRE and DevOps worlds in ways that may be consequential.

The paper's authors are Gary Klein, Rebecca Pliske, Beth Crandall, and David Woods.

Paper: https://www.researchgate.net/publication/220579480_Problem_detection

John Allspaw

October 26, 2018
Tweet

More Decks by John Allspaw

Other Decks in Science

Transcript

  1. Problem Detection
    Gary Klein, Rebecca Pliske, Beth Crandall, David Woods
    in
    “Cognition, Technology, and Work”
    March 2005, Volume 7, Issue 1, pp 14-28
    John Allspaw
    Adaptive Capacity Labs

    View Slide

  2. View Slide

  3. (the journal)
    Cognition, Technology, and Work
    “…focuses on the practical issues of human interaction with technology within
    the context of work and, in particular, how human cognition affects, and is
    affected by, work and working conditions.”

    View Slide

  4. “…process by which people first become concerned
    that events may be taking an unexpected and
    undesirable direction that potentially requires action”

    View Slide

  5. problem detection
    • critical in complex, real-world situations
    • in order to improve people’s ability to detect problems, we first have to
    understand how problem detection works in real-world situations.

    View Slide

  6. Once detection happens, people can then...
    • seek more information
    • track events more carefully
    • try to diagnose or identify the problem
    • raise the concern to other people
    • “explain away” the anomaly
    • cope with it by finding action(s) that might counter the trajectory of events
    • accept that the situation has changed in fundamental ways and need to
    revise goals and plans

    View Slide

  7. View Slide

  8. “At 12:40 pm ET, an engineer noticed anomalies in our grafana
    dashboards.”
    “On Monday, just after 8:30am, we noticed that a couple of large
    websites that are hosted on Amazon were having problems and
    displaying errors.”

    View Slide

  9. reaction to existing work
    Cowan’s discrepancy
    accumulation model (1986)
    ‘as the accumulation of
    discrepancies until a threshold
    was reached’
    Klein and Co. say this is not the case,
    because:

    a. cues to problems may be subtle
    and context-dependent

    b. what counts as a discrepancy
    depends on the problem-solver’s
    experience and the stance taken in
    interpreting the situation.

    In many cases, detecting a problem is
    equivalent to reconceptualizing the
    situation.

    View Slide

  10. problem
    detection
    initial factors that arouse concern

    problem
    identification
    the ability to specify the problem
    this is the focus of the paper

    View Slide

  11. existing cases
    • 19 NICU nurses
    • 37 weather forecasters
    • US Navy Commanders
    • Weapons directors aboard AWACS
    • 26 fireground commanders
    • Space shuttle mission control
    • anesthetic management during surgery
    • aviation flight decks
    Review of >1000 previous critical incidents, data from Critical
    Decision Method and other cognitive task analysis techniques

    View Slide

  12. new cases
    Wildland firefighting (5)
    Minimally invasive surgery (3)

    View Slide

  13. Case 1 and 2
    #1 - NICU nurse case
    #2 - AEGIS naval battle group exercise

    not all incidents
    wield the same
    potential
    some cases
    have elements
    and qualities
    that others
    don’t

    View Slide

  14. disturbances that trigger
    problem detection

    View Slide


  15. Cues are not primitive events—they are constructions generated by
    people trying to understand situations.
    “…cues are only ‘‘objective’’ in a limited sense”
    “…rather, the knowledge and expectancies a person has will
    determine what counts as a cue and whether it will be noticed.”

    View Slide

  16. faults
    events that threaten
    to block an intended
    outcome
    symptoms
    or cues
    …we notice the disturbances they produce
    whether we notice these or not depend on
    several factors, including data from …
    “sensors”
    we DO NOT directly perceive them…

    View Slide

  17. routine deteriorating
    a shift in situations
    potential
    increased wind velocity during firefighting
    loss of ability to increase compute or storage capacity
    faults
    single or multiple





    View Slide

  18. speed of change
    • in <1s a driving hazard can appear
    • mining operations or dams may develop over years
    • “going sour” pattern (Cook, 1991) - situation slowly deteriorating but goes
    unnoticed because each symptom considered in isolation does not signify a
    problem exists
    number and variety
    single dominant symptom to a set of multiple symptoms
    trajectory
    difference between “safe” and “unsafe” trajectories are clear only in
    hindsight
    bifurcation point
    an unstable, temporary state that can evolve into several stable states
    absence of data
    expertise is needed to notice these “negative” events - what is not present
    symptoms
    or cues

    View Slide

  19. Case 3
    Inbound Exocet missile
    symptoms have to viewed against a background

    View Slide

  20. “sensors”
    completeness
    • number of them may not be adequate
    • placement of them may not be adequate
    sensitivity
    • temp probes that can’t go > 150° can’t tell you it’s climbed to 500°
    • if teammate or system detects early signs of danger but doesn’t announce them
    update rates
    • slow update rate can make it hard to gauge trajectories
    • wasn’t issue with surgeons but was with firefighters
    direct (such as visual inspection)
    indirect (such as sw displays)
    costs
    • effort and risk - not all data can be collected safely
    • ease of adjustment
    credibility
    • perceived reliability
    • uncertainty of sensitivity
    • history of the data

    View Slide

  21. “sensors”
    turbulence of the background
    • Operational settings are typically data rich and noisy.
    • Many data elements are present that could be relevant to the problem solver
    • There are a large number of data channels and the signals on these channels usually
    are changing.
    • The raw values are rarely constant even when the system is stable and normal.
    • The default case is detecting emerging signs of trouble against a dynamic background
    of signals rather than detecting a change from a quiescent, stable, or static
    background.
    • The noisiness of the background makes it easy to miss symptoms or to explain them
    away as part of a different pattern

    View Slide

  22. Problem detection as sensemaking
    activity
    data are used to construct a frame that accounts for the data and guides the
    search for additional data
    (a story or script or schema)
    the frame a person is using to understand events will determine what counts
    as data
    Both activities occur in parallel: the data generating the frame, and the frame
    defining what counts as data

    View Slide

  23. View Slide

  24. View Slide

  25. critical factors that determine whether
    cues will be noticed
    • expertise
    • stance
    • attention management

    View Slide

  26. expertise
    • ability to perceive subtle complexities of signs
    • ability to generate expectancies
    • mental models

    View Slide

  27. Case 4

    View Slide

  28. stance
    can range from:
    • denial that anything could go wrong,
    • to a positive ‘can-do’’ attitude that is confident of being able to overcome
    difficulties,
    • to an alert attitude that expects some serious problems might arise,
    • to a level of hysteria that over-reacts to minor signs and transient signals
    the orientation the person has to the situation

    View Slide

  29. attention management
    handling the configuration of “sensors”

    View Slide

  30. Future directions (as of 2005)
    • More intensive empirical studies
    • Effects of variations in stance, expertise, and attention management
    • Domain-specific failures
    • Nonlinear resistance
    • Human-automation teamwork
    • Coping with massive amounts of data

    View Slide

  31. Conclusions
    • Problem detection hasn’t been researched closely
    • Cowan (1986) got some things right but he’s mostly wrong
    • Detecting problems in real-world situations is not trivial
    • What counts as important cues is context-dependent and heavily
    dependent on expertise

    View Slide

  32. Implications for us
    • major, for tool makers
    • find cases that have elements that support probing for problem detection
    expertise - and get it out of people’s heads

    View Slide

  33. but wait…what about problem detection in teams?
    HOMEWORK!

    View Slide

  34. View Slide

  35. stella.report

    View Slide

  36. Questions?

    View Slide