Problem Detection
Gary Klein, Rebecca Pliske, Beth Crandall, David Woods
in
“Cognition, Technology, and Work”
March 2005, Volume 7, Issue 1, pp 14-28
John Allspaw
Adaptive Capacity Labs
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
(the journal)
Cognition, Technology, and Work
“…focuses on the practical issues of human interaction with technology within
the context of work and, in particular, how human cognition affects, and is
affected by, work and working conditions.”
Slide 4
Slide 4 text
“…process by which people first become concerned
that events may be taking an unexpected and
undesirable direction that potentially requires action”
Slide 5
Slide 5 text
problem detection
• critical in complex, real-world situations
• in order to improve people’s ability to detect problems, we first have to
understand how problem detection works in real-world situations.
Slide 6
Slide 6 text
Once detection happens, people can then...
• seek more information
• track events more carefully
• try to diagnose or identify the problem
• raise the concern to other people
• “explain away” the anomaly
• cope with it by finding action(s) that might counter the trajectory of events
• accept that the situation has changed in fundamental ways and need to
revise goals and plans
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
“At 12:40 pm ET, an engineer noticed anomalies in our grafana
dashboards.”
“On Monday, just after 8:30am, we noticed that a couple of large
websites that are hosted on Amazon were having problems and
displaying errors.”
Slide 9
Slide 9 text
reaction to existing work
Cowan’s discrepancy
accumulation model (1986)
‘as the accumulation of
discrepancies until a threshold
was reached’
Klein and Co. say this is not the case,
because:
a. cues to problems may be subtle
and context-dependent
b. what counts as a discrepancy
depends on the problem-solver’s
experience and the stance taken in
interpreting the situation.
In many cases, detecting a problem is
equivalent to reconceptualizing the
situation.
Slide 10
Slide 10 text
problem
detection
initial factors that arouse concern
problem
identification
the ability to specify the problem
this is the focus of the paper
Slide 11
Slide 11 text
existing cases
• 19 NICU nurses
• 37 weather forecasters
• US Navy Commanders
• Weapons directors aboard AWACS
• 26 fireground commanders
• Space shuttle mission control
• anesthetic management during surgery
• aviation flight decks
Review of >1000 previous critical incidents, data from Critical
Decision Method and other cognitive task analysis techniques
Slide 12
Slide 12 text
new cases
Wildland firefighting (5)
Minimally invasive surgery (3)
Slide 13
Slide 13 text
Case 1 and 2
#1 - NICU nurse case
#2 - AEGIS naval battle group exercise
not all incidents
wield the same
potential
some cases
have elements
and qualities
that others
don’t
Slide 14
Slide 14 text
disturbances that trigger
problem detection
Slide 15
Slide 15 text
Cues are not primitive events—they are constructions generated by
people trying to understand situations.
“…cues are only ‘‘objective’’ in a limited sense”
“…rather, the knowledge and expectancies a person has will
determine what counts as a cue and whether it will be noticed.”
Slide 16
Slide 16 text
faults
events that threaten
to block an intended
outcome
symptoms
or cues
…we notice the disturbances they produce
whether we notice these or not depend on
several factors, including data from …
“sensors”
we DO NOT directly perceive them…
Slide 17
Slide 17 text
routine deteriorating
a shift in situations
potential
increased wind velocity during firefighting
loss of ability to increase compute or storage capacity
faults
single or multiple
Slide 18
Slide 18 text
speed of change
• in <1s a driving hazard can appear
• mining operations or dams may develop over years
• “going sour” pattern (Cook, 1991) - situation slowly deteriorating but goes
unnoticed because each symptom considered in isolation does not signify a
problem exists
number and variety
single dominant symptom to a set of multiple symptoms
trajectory
difference between “safe” and “unsafe” trajectories are clear only in
hindsight
bifurcation point
an unstable, temporary state that can evolve into several stable states
absence of data
expertise is needed to notice these “negative” events - what is not present
symptoms
or cues
Slide 19
Slide 19 text
Case 3
Inbound Exocet missile
symptoms have to viewed against a background
Slide 20
Slide 20 text
“sensors”
completeness
• number of them may not be adequate
• placement of them may not be adequate
sensitivity
• temp probes that can’t go > 150° can’t tell you it’s climbed to 500°
• if teammate or system detects early signs of danger but doesn’t announce them
update rates
• slow update rate can make it hard to gauge trajectories
• wasn’t issue with surgeons but was with firefighters
direct (such as visual inspection)
indirect (such as sw displays)
costs
• effort and risk - not all data can be collected safely
• ease of adjustment
credibility
• perceived reliability
• uncertainty of sensitivity
• history of the data
Slide 21
Slide 21 text
“sensors”
turbulence of the background
• Operational settings are typically data rich and noisy.
• Many data elements are present that could be relevant to the problem solver
• There are a large number of data channels and the signals on these channels usually
are changing.
• The raw values are rarely constant even when the system is stable and normal.
• The default case is detecting emerging signs of trouble against a dynamic background
of signals rather than detecting a change from a quiescent, stable, or static
background.
• The noisiness of the background makes it easy to miss symptoms or to explain them
away as part of a different pattern
Slide 22
Slide 22 text
Problem detection as sensemaking
activity
data are used to construct a frame that accounts for the data and guides the
search for additional data
(a story or script or schema)
the frame a person is using to understand events will determine what counts
as data
Both activities occur in parallel: the data generating the frame, and the frame
defining what counts as data
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
critical factors that determine whether
cues will be noticed
• expertise
• stance
• attention management
Slide 26
Slide 26 text
expertise
• ability to perceive subtle complexities of signs
• ability to generate expectancies
• mental models
Slide 27
Slide 27 text
Case 4
Slide 28
Slide 28 text
stance
can range from:
• denial that anything could go wrong,
• to a positive ‘can-do’’ attitude that is confident of being able to overcome
difficulties,
• to an alert attitude that expects some serious problems might arise,
• to a level of hysteria that over-reacts to minor signs and transient signals
the orientation the person has to the situation
Slide 29
Slide 29 text
attention management
handling the configuration of “sensors”
Slide 30
Slide 30 text
Future directions (as of 2005)
• More intensive empirical studies
• Effects of variations in stance, expertise, and attention management
• Domain-specific failures
• Nonlinear resistance
• Human-automation teamwork
• Coping with massive amounts of data
Slide 31
Slide 31 text
Conclusions
• Problem detection hasn’t been researched closely
• Cowan (1986) got some things right but he’s mostly wrong
• Detecting problems in real-world situations is not trivial
• What counts as important cues is context-dependent and heavily
dependent on expertise
Slide 32
Slide 32 text
Implications for us
• major, for tool makers
• find cases that have elements that support probing for problem detection
expertise - and get it out of people’s heads
Slide 33
Slide 33 text
but wait…what about problem detection in teams?
HOMEWORK!