Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Taking Human Performance Seriously

393c4d3cf4315a211e04f2a85abe7822?s=47 John Allspaw
June 03, 2019
690

Taking Human Performance Seriously

Monitorama PDX 2019

393c4d3cf4315a211e04f2a85abe7822?s=128

John Allspaw

June 03, 2019
Tweet

Transcript

  1. Taking Human Performance Seriously John Allspaw (@allspaw) Adaptive Capacity Labs

    (@adaptiveclabs)
  2. previously, on Allspaw Speaks At Monitorama…

  3. None
  4. observability alerts monitoring tracing logs telemetry metrics

  5. observability alerts monitoring tracing logs telemetry metrics coordinating anticipating inferring

    diagnosing planning modifying reacting correcting
  6. some context

  7. beliefs about safety (1940s-1970s) • Safety can be encoded in

    the design of technology. • Accidents can be avoided by having more automation. • Procedures can be specified to be objective and comprehensive. • Operators just have to follow the procedures to get work done. • “Humans Are Better At” versus “Machines Are Better At” List (HABA-MABA)
  8. March 28, 1979

  9. None
  10. new beliefs about safety, post-TMI • Automation is necessary in

    modern systems, and also introduces new forms of challenges and risk. • Rules and procedures are always underspecified, so therefore can’t guarantee safety by themselves without interpreting them in local context. • Events in these environments will require operators to make decisions and take action that cannot be pre-specified. • The methods and models for “risk” that rely on “human error” categories, accounting, taxonomies, etc. are fraught.
  11. What we thought we knew about human contributions to successful

    work in complex domains was wrong.
  12. None
  13. By “human” performance, we mean cognitive performance.

  14. We study cognitive work by studying incidents time pressure high

    (or potentially increasing) consequences uncertainty ambiguity
  15. Resilience In Business-Critical Digital Services Consortium Adaptive Capacity Labs who

    is “we”?
  16. “…nonroutine, challenging events, because these tough cases have the greatest

    potential for uncovering elements of expertise and related cognitive phenomena.” (Klein, Crandall, Hoffman, 2006) methods, approaches, and techniques cognitive task analysis cognitive work analysis process tracing conversation analysis Critical Decision Method Critical Incident Technique more…
  17. what we find when we study incidents

  18. logs time of year day of year time of day

    observations and hypotheses others share what has been investigated thus far what’s been happening in the world (news, service provider outages, etc.) time-series data alerts tracing/observability tools recent changes in existing tech new dependencies who is on vacation, at a conference, traveling, etc. status of other ongoing work
  19. “Cues are not primitive events—they are constructions generated by people

    trying to understand situations. …cues are only ‘objective’ in a limited sense. …rather, the knowledge and expectancies a person has will determine what counts as a cue and whether it will be noticed.”
  20. None
  21. None
  22. DBA 2 weeks on the job Infra Engineer 2.5 years

    Network Engineer 5 years Product/App Engineer 3 years Security Engineer 1 year
  23. None
  24. None
  25. - problem detection and identification - generating hypotheses - diagnostic

    actions - therapeutic actions - sacrifice decisions - coordinating - (re) planning - preparing for potential escalation/cascades multiple threads of activity some productive some unproductive
  26. time pressure high consequences

  27. this is not “debugging” “troubleshooting”

  28. people will pursue what they think will be productive

  29. I mean I could ssh into one of the servers,

    and I might find something helpful by doing that…but… NO I REFUSE TO DO THAT BECAUSE I SHOULDN’T HAVE TO!!!
  30. None
  31. people will pursue what they think will be productive who

    are these people? what roles do they play…actually? people for “fixing”…? for understanding? for ‘stemming the bleeding’? for customer support? for…? be productive via hypotheses? via past experience? via…? think
  32. what does this research look like?

  33. None
  34. None
  35. None
  36. Anomalous signals and representations Interventions and results Tentative, evolving, shared

    hypotheses Collective hypotheses ➝ plans acted on line of certainty and commitment to action
  37. None
  38. Approaching Overload: Diagnosis and Response to Anomalies in Complex and

    Automated Production Software Systems Marisa Grayson Ohio State University
  39. monitoring/observability are inextricably coupled with other activities

  40. what can you do?

  41. Build your own internal resources to do incident analysis

  42. Are there any sources of data about the systems (logs,

    graphs, etc.) that people regularly dismiss or are suspicious of? 0 100 200 300 400 0 10% 20% 30 % 40 % 0 100 200 300 400 0 1,000 2,000 3,000 4,000 1 2 3 4 5 How do people improvise new tools to help them understand what is happening? What tricks do people or teams use to understand how otherwise opaque 3rd party services are behaving?
  43. Select a few incidents for closer and deeper analysis

  44. Build or adjust tooling to capture data streams of incidents

    and their handling
  45. Make company-wide postmortem sessions regular events

  46. Suggestions for vendors

  47. Hire and retain expertise to do qualitative research “dogfooding” is

    not sufficient
  48. Research on supporting work in complex cognitive domains already exists!

    It will prove to be a competitive advantage for you.
  49. Summary • Understanding cognitive work in software engineering and operations

    is critically important. (The stakes are already too high, and we’re behind.) • Doing this well will mean new language, concepts, paradigms, and practices — some of which may be unintuitive and/or controversial. • Must be driven by both research/academia and industry/practitioners. • Vendors: if you pay attention, this will be a competitive advantage for you.
  50. –Lisanne Bainbridge, 1983 “Ironies of Automation” “...irony that the more

    advanced a control system is, so the more crucial may be the contribution of the human operator.”
  51. Thank You! @allspaw https://www.adaptivecapacitylabs.com/blog @AdaptiveCLabs