Taking Human Performance Seriously

Taking Human Performance Seriously John Allspaw (@allspaw) Adaptive Capacity Labs
(@adaptiveclabs)

previously, on Allspaw Speaks At Monitorama…

observability alerts monitoring tracing logs telemetry metrics

observability alerts monitoring tracing logs telemetry metrics coordinating anticipating inferring
diagnosing planning modifying reacting correcting

some context

beliefs about safety (1940s-1970s) • Safety can be encoded in
the design of technology. • Accidents can be avoided by having more automation. • Procedures can be speciﬁed to be objective and comprehensive. • Operators just have to follow the procedures to get work done. • “Humans Are Better At” versus “Machines Are Better At” List (HABA-MABA)

March 28, 1979

new beliefs about safety, post-TMI • Automation is necessary in
modern systems, and also introduces new forms of challenges and risk. • Rules and procedures are always underspeciﬁed, so therefore can’t guarantee safety by themselves without interpreting them in local context. • Events in these environments will require operators to make decisions and take action that cannot be pre-speciﬁed. • The methods and models for “risk” that rely on “human error” categories, accounting, taxonomies, etc. are fraught.

What we thought we knew about human contributions to successful
work in complex domains was wrong.

By “human” performance, we mean cognitive performance.

We study cognitive work by studying incidents time pressure high
(or potentially increasing) consequences uncertainty ambiguity

Resilience In Business-Critical Digital Services Consortium Adaptive Capacity Labs who
is “we”?

“…nonroutine, challenging events, because these tough cases have the greatest
potential for uncovering elements of expertise and related cognitive phenomena.” (Klein, Crandall, Hoffman, 2006) methods, approaches, and techniques cognitive task analysis cognitive work analysis process tracing conversation analysis Critical Decision Method Critical Incident Technique more…

what we ﬁnd when we study incidents

logs time of year day of year time of day
observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work

“Cues are not primitive events—they are constructions generated by people
trying to understand situations. …cues are only ‘objective’ in a limited sense. …rather, the knowledge and expectancies a person has will determine what counts as a cue and whether it will be noticed.”

DBA 2 weeks on the job Infra Engineer 2.5 years
Network Engineer 5 years Product/App Engineer 3 years Security Engineer 1 year

- problem detection and identiﬁcation - generating hypotheses - diagnostic
actions - therapeutic actions - sacriﬁce decisions - coordinating - (re) planning - preparing for potential escalation/cascades multiple threads of activity some productive some unproductive

time pressure high consequences

this is not “debugging” “troubleshooting”

people will pursue what they think will be productive

I mean I could ssh into one of the servers,
and I might ﬁnd something helpful by doing that…but… NO I REFUSE TO DO THAT BECAUSE I SHOULDN’T HAVE TO!!!

people will pursue what they think will be productive who
are these people? what roles do they play…actually? people for “ﬁxing”…? for understanding? for ‘stemming the bleeding’? for customer support? for…? be productive via hypotheses? via past experience? via…? think

what does this research look like?

Anomalous signals and representations Interventions and results Tentative, evolving, shared
hypotheses Collective hypotheses ➝ plans acted on line of certainty and commitment to action

Approaching Overload: Diagnosis and Response to Anomalies in Complex and
Automated Production Software Systems Marisa Grayson Ohio State University

monitoring/observability are inextricably coupled with other activities

what can you do?

Build your own internal resources to do incident analysis

Are there any sources of data about the systems (logs,
graphs, etc.) that people regularly dismiss or are suspicious of? 0 100 200 300 400 0 10% 20% 30 % 40 % 0 100 200 300 400 0 1,000 2,000 3,000 4,000 1 2 3 4 5 How do people improvise new tools to help them understand what is happening? What tricks do people or teams use to understand how otherwise opaque 3rd party services are behaving?

Select a few incidents for closer and deeper analysis

Build or adjust tooling to capture data streams of incidents
and their handling

Make company-wide postmortem sessions regular events

Suggestions for vendors

Hire and retain expertise to do qualitative research “dogfooding” is
not sufﬁcient

Research on supporting work in complex cognitive domains already exists!
It will prove to be a competitive advantage for you.

Summary • Understanding cognitive work in software engineering and operations
is critically important. (The stakes are already too high, and we’re behind.) • Doing this well will mean new language, concepts, paradigms, and practices — some of which may be unintuitive and/or controversial. • Must be driven by both research/academia and industry/practitioners. • Vendors: if you pay attention, this will be a competitive advantage for you.

–Lisanne Bainbridge, 1983 “Ironies of Automation” “...irony that the more
advanced a control system is, so the more crucial may be the contribution of the human operator.”

Thank You! @allspaw https://www.adaptivecapacitylabs.com/blog @AdaptiveCLabs

Taking Human Performance Seriously

Taking Human Performance Seriously

More Decks by John Allspaw

Featured

Transcript