Upgrade to Pro — share decks privately, control downloads, hide ads and more …

“Problem Detection” (Klein, et. al., 2005) - Papers We Love (NYC)

John Allspaw
October 26, 2018

“Problem Detection” (Klein, et. al., 2005) - Papers We Love (NYC)

(meetup description for this talk is here: https://www.meetup.com/papers-we-love/events/254505298/)

Published in 2005 in the journal Cognition, Technology and Work, "Problem Detection" explores the "process by which people first become concerned that events may be taking an unexpected and undesirable direction that potentially requires action." While this paper primarily centers on empirically rebutting previous theories of how problems are detected, it also puts forth many important observations and concepts for software engineering to pay close attention to. This talk won't just be a re-statement of the paper's core views; I will place these into a software engineering and operations context and connect them to SRE and DevOps worlds in ways that may be consequential.

The paper's authors are Gary Klein, Rebecca Pliske, Beth Crandall, and David Woods.

Paper: https://www.researchgate.net/publication/220579480_Problem_detection

John Allspaw

October 26, 2018

More Decks by John Allspaw

Other Decks in Science


  1. Problem Detection Gary Klein, Rebecca Pliske, Beth Crandall, David Woods

    in “Cognition, Technology, and Work” March 2005, Volume 7, Issue 1, pp 14-28 John Allspaw Adaptive Capacity Labs
  2. (the journal) Cognition, Technology, and Work “…focuses on the practical

    issues of human interaction with technology within the context of work and, in particular, how human cognition affects, and is affected by, work and working conditions.”
  3. “…process by which people first become concerned that events may

    be taking an unexpected and undesirable direction that potentially requires action”
  4. problem detection • critical in complex, real-world situations • in

    order to improve people’s ability to detect problems, we first have to understand how problem detection works in real-world situations.
  5. Once detection happens, people can then... • seek more information

    • track events more carefully • try to diagnose or identify the problem • raise the concern to other people • “explain away” the anomaly • cope with it by finding action(s) that might counter the trajectory of events • accept that the situation has changed in fundamental ways and need to revise goals and plans
  6. “At 12:40 pm ET, an engineer noticed anomalies in our

    grafana dashboards.” “On Monday, just after 8:30am, we noticed that a couple of large websites that are hosted on Amazon were having problems and displaying errors.”
  7. reaction to existing work Cowan’s discrepancy accumulation model (1986) ‘as

    the accumulation of discrepancies until a threshold was reached’ Klein and Co. say this is not the case, because: a. cues to problems may be subtle and context-dependent b. what counts as a discrepancy depends on the problem-solver’s experience and the stance taken in interpreting the situation. In many cases, detecting a problem is equivalent to reconceptualizing the situation.
  8. problem detection initial factors that arouse concern problem identification the

    ability to specify the problem this is the focus of the paper
  9. existing cases • 19 NICU nurses • 37 weather forecasters

    • US Navy Commanders • Weapons directors aboard AWACS • 26 fireground commanders • Space shuttle mission control • anesthetic management during surgery • aviation flight decks Review of >1000 previous critical incidents, data from Critical Decision Method and other cognitive task analysis techniques
  10. Case 1 and 2 #1 - NICU nurse case #2

    - AEGIS naval battle group exercise not all incidents wield the same potential some cases have elements and qualities that others don’t
  11. Cues are not primitive events—they are constructions generated by people

    trying to understand situations. “…cues are only ‘‘objective’’ in a limited sense” “…rather, the knowledge and expectancies a person has will determine what counts as a cue and whether it will be noticed.”
  12. faults events that threaten to block an intended outcome symptoms

    or cues …we notice the disturbances they produce whether we notice these or not depend on several factors, including data from … “sensors” we DO NOT directly perceive them…
  13. routine deteriorating a shift in situations potential increased wind velocity

    during firefighting loss of ability to increase compute or storage capacity faults single or multiple
  14. speed of change • in <1s a driving hazard can

    appear • mining operations or dams may develop over years • “going sour” pattern (Cook, 1991) - situation slowly deteriorating but goes unnoticed because each symptom considered in isolation does not signify a problem exists number and variety single dominant symptom to a set of multiple symptoms trajectory difference between “safe” and “unsafe” trajectories are clear only in hindsight bifurcation point an unstable, temporary state that can evolve into several stable states absence of data expertise is needed to notice these “negative” events - what is not present symptoms or cues
  15. “sensors” completeness • number of them may not be adequate

    • placement of them may not be adequate sensitivity • temp probes that can’t go > 150° can’t tell you it’s climbed to 500° • if teammate or system detects early signs of danger but doesn’t announce them update rates • slow update rate can make it hard to gauge trajectories • wasn’t issue with surgeons but was with firefighters direct (such as visual inspection) indirect (such as sw displays) costs • effort and risk - not all data can be collected safely • ease of adjustment credibility • perceived reliability • uncertainty of sensitivity • history of the data
  16. “sensors” turbulence of the background • Operational settings are typically

    data rich and noisy. • Many data elements are present that could be relevant to the problem solver • There are a large number of data channels and the signals on these channels usually are changing. • The raw values are rarely constant even when the system is stable and normal. • The default case is detecting emerging signs of trouble against a dynamic background of signals rather than detecting a change from a quiescent, stable, or static background. • The noisiness of the background makes it easy to miss symptoms or to explain them away as part of a different pattern
  17. Problem detection as sensemaking activity data are used to construct

    a frame that accounts for the data and guides the search for additional data (a story or script or schema) the frame a person is using to understand events will determine what counts as data Both activities occur in parallel: the data generating the frame, and the frame defining what counts as data
  18. critical factors that determine whether cues will be noticed •

    expertise • stance • attention management
  19. expertise • ability to perceive subtle complexities of signs •

    ability to generate expectancies • mental models
  20. stance can range from: • denial that anything could go

    wrong, • to a positive ‘can-do’’ attitude that is confident of being able to overcome difficulties, • to an alert attitude that expects some serious problems might arise, • to a level of hysteria that over-reacts to minor signs and transient signals the orientation the person has to the situation
  21. Future directions (as of 2005) • More intensive empirical studies

    • Effects of variations in stance, expertise, and attention management • Domain-specific failures • Nonlinear resistance • Human-automation teamwork • Coping with massive amounts of data
  22. Conclusions • Problem detection hasn’t been researched closely • Cowan

    (1986) got some things right but he’s mostly wrong • Detecting problems in real-world situations is not trivial • What counts as important cues is context-dependent and heavily dependent on expertise
  23. Implications for us • major, for tool makers • find

    cases that have elements that support probing for problem detection expertise - and get it out of people’s heads