Slide 1

Slide 1 text

Taking Human Performance Seriously John Allspaw (@allspaw) Adaptive Capacity Labs (@adaptiveclabs)

Slide 2

Slide 2 text

previously, on Allspaw Speaks At Monitorama…

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

observability alerts monitoring tracing logs telemetry metrics

Slide 5

Slide 5 text

observability alerts monitoring tracing logs telemetry metrics coordinating anticipating inferring diagnosing planning modifying reacting correcting

Slide 6

Slide 6 text

some context

Slide 7

Slide 7 text

beliefs about safety (1940s-1970s) • Safety can be encoded in the design of technology. • Accidents can be avoided by having more automation. • Procedures can be specified to be objective and comprehensive. • Operators just have to follow the procedures to get work done. • “Humans Are Better At” versus “Machines Are Better At” List (HABA-MABA)

Slide 8

Slide 8 text

March 28, 1979

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

new beliefs about safety, post-TMI • Automation is necessary in modern systems, and also introduces new forms of challenges and risk. • Rules and procedures are always underspecified, so therefore can’t guarantee safety by themselves without interpreting them in local context. • Events in these environments will require operators to make decisions and take action that cannot be pre-specified. • The methods and models for “risk” that rely on “human error” categories, accounting, taxonomies, etc. are fraught.

Slide 11

Slide 11 text

What we thought we knew about human contributions to successful work in complex domains was wrong.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

By “human” performance, we mean cognitive performance.

Slide 14

Slide 14 text

We study cognitive work by studying incidents time pressure high (or potentially increasing) consequences uncertainty ambiguity

Slide 15

Slide 15 text

Resilience In Business-Critical Digital Services Consortium Adaptive Capacity Labs who is “we”?

Slide 16

Slide 16 text

“…nonroutine, challenging events, because these tough cases have the greatest potential for uncovering elements of expertise and related cognitive phenomena.” (Klein, Crandall, Hoffman, 2006) methods, approaches, and techniques cognitive task analysis cognitive work analysis process tracing conversation analysis Critical Decision Method Critical Incident Technique more…

Slide 17

Slide 17 text

what we find when we study incidents

Slide 18

Slide 18 text

logs time of year day of year time of day observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work

Slide 19

Slide 19 text

“Cues are not primitive events—they are constructions generated by people trying to understand situations. …cues are only ‘objective’ in a limited sense. …rather, the knowledge and expectancies a person has will determine what counts as a cue and whether it will be noticed.”

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

DBA 2 weeks on the job Infra Engineer 2.5 years Network Engineer 5 years Product/App Engineer 3 years Security Engineer 1 year

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

- problem detection and identification - generating hypotheses - diagnostic actions - therapeutic actions - sacrifice decisions - coordinating - (re) planning - preparing for potential escalation/cascades multiple threads of activity some productive some unproductive

Slide 26

Slide 26 text

time pressure high consequences

Slide 27

Slide 27 text

this is not “debugging” “troubleshooting”

Slide 28

Slide 28 text

people will pursue what they think will be productive

Slide 29

Slide 29 text

I mean I could ssh into one of the servers, and I might find something helpful by doing that…but… NO I REFUSE TO DO THAT BECAUSE I SHOULDN’T HAVE TO!!!

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

people will pursue what they think will be productive who are these people? what roles do they play…actually? people for “fixing”…? for understanding? for ‘stemming the bleeding’? for customer support? for…? be productive via hypotheses? via past experience? via…? think

Slide 32

Slide 32 text

what does this research look like?

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Anomalous signals and representations Interventions and results Tentative, evolving, shared hypotheses Collective hypotheses ➝ plans acted on line of certainty and commitment to action

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems Marisa Grayson Ohio State University

Slide 39

Slide 39 text

monitoring/observability are inextricably coupled with other activities

Slide 40

Slide 40 text

what can you do?

Slide 41

Slide 41 text

Build your own internal resources to do incident analysis

Slide 42

Slide 42 text

Are there any sources of data about the systems (logs, graphs, etc.) that people regularly dismiss or are suspicious of? 0 100 200 300 400 0 10% 20% 30 % 40 % 0 100 200 300 400 0 1,000 2,000 3,000 4,000 1 2 3 4 5 How do people improvise new tools to help them understand what is happening? What tricks do people or teams use to understand how otherwise opaque 3rd party services are behaving?

Slide 43

Slide 43 text

Select a few incidents for closer and deeper analysis

Slide 44

Slide 44 text

Build or adjust tooling to capture data streams of incidents and their handling

Slide 45

Slide 45 text

Make company-wide postmortem sessions regular events

Slide 46

Slide 46 text

Suggestions for vendors

Slide 47

Slide 47 text

Hire and retain expertise to do qualitative research “dogfooding” is not sufficient

Slide 48

Slide 48 text

Research on supporting work in complex cognitive domains already exists! It will prove to be a competitive advantage for you.

Slide 49

Slide 49 text

Summary • Understanding cognitive work in software engineering and operations is critically important. (The stakes are already too high, and we’re behind.) • Doing this well will mean new language, concepts, paradigms, and practices — some of which may be unintuitive and/or controversial. • Must be driven by both research/academia and industry/practitioners. • Vendors: if you pay attention, this will be a competitive advantage for you.

Slide 50

Slide 50 text

–Lisanne Bainbridge, 1983 “Ironies of Automation” “...irony that the more advanced a control system is, so the more crucial may be the contribution of the human operator.”

Slide 51

Slide 51 text

Thank You! @allspaw https://www.adaptivecapacitylabs.com/blog @AdaptiveCLabs