$30 off During Our Annual Pro Sale. View Details »

Taking Human Performance Seriously

John Allspaw
June 03, 2019
760

Taking Human Performance Seriously

Monitorama PDX 2019

John Allspaw

June 03, 2019
Tweet

Transcript

  1. Taking Human Performance
    Seriously
    John Allspaw (@allspaw)
    Adaptive Capacity Labs (@adaptiveclabs)

    View Slide

  2. previously, on Allspaw Speaks At Monitorama…

    View Slide

  3. View Slide

  4. observability
    alerts
    monitoring
    tracing
    logs
    telemetry
    metrics

    View Slide

  5. observability
    alerts
    monitoring
    tracing
    logs
    telemetry
    metrics
    coordinating
    anticipating inferring
    diagnosing
    planning
    modifying reacting
    correcting

    View Slide

  6. some context

    View Slide

  7. beliefs about safety (1940s-1970s)
    • Safety can be encoded in the design of technology.
    • Accidents can be avoided by having more automation.
    • Procedures can be specified to be objective and comprehensive.
    • Operators just have to follow the procedures to get work done.
    • “Humans Are Better At” versus “Machines Are Better At” List (HABA-MABA)

    View Slide

  8. March 28, 1979

    View Slide

  9. View Slide

  10. new beliefs about safety, post-TMI
    • Automation is necessary in modern systems, and also introduces new
    forms of challenges and risk.
    • Rules and procedures are always underspecified, so therefore can’t
    guarantee safety by themselves without interpreting them in local
    context.
    • Events in these environments will require operators to make decisions
    and take action that cannot be pre-specified.
    • The methods and models for “risk” that rely on “human error” categories,
    accounting, taxonomies, etc. are fraught.

    View Slide

  11. What we thought we knew about human
    contributions to successful work in complex
    domains was wrong.

    View Slide

  12. View Slide

  13. By “human” performance,
    we mean cognitive performance.

    View Slide

  14. We study cognitive work by
    studying incidents
    time pressure
    high (or potentially increasing) consequences
    uncertainty
    ambiguity

    View Slide

  15. Resilience In Business-Critical Digital Services
    Consortium
    Adaptive
    Capacity
    Labs
    who is “we”?

    View Slide

  16. “…nonroutine, challenging events, because these tough cases have the
    greatest potential for uncovering elements of expertise and related
    cognitive phenomena.” (Klein, Crandall, Hoffman, 2006)
    methods, approaches, and techniques
    cognitive task analysis
    cognitive work analysis
    process tracing
    conversation analysis
    Critical Decision Method
    Critical Incident Technique
    more…

    View Slide

  17. what we find when we study
    incidents

    View Slide

  18. logs
    time of year
    day of year
    time of day
    observations
    and
    hypotheses
    others share
    what has been
    investigated thus far
    what’s been happening
    in the world
    (news, service
    provider outages, etc.)
    time-series
    data
    alerts
    tracing/observability tools
    recent changes in
    existing tech
    new
    dependencies
    who is on vacation,
    at a conference,
    traveling, etc.
    status of other
    ongoing work

    View Slide

  19. “Cues are not primitive events—they are constructions generated by
    people trying to understand situations.
    …cues are only ‘objective’ in a limited sense.
    …rather, the knowledge and expectancies a person has will
    determine what counts as a cue and whether it will be noticed.”

    View Slide

  20. View Slide

  21. View Slide


  22. DBA
    2 weeks on the job
    Infra Engineer
    2.5 years
    Network Engineer
    5 years
    Product/App Engineer
    3 years
    Security Engineer
    1 year

    View Slide

  23. View Slide

  24. View Slide

  25. - problem detection and identification
    - generating hypotheses
    - diagnostic actions
    - therapeutic actions
    - sacrifice decisions
    - coordinating
    - (re) planning
    - preparing for potential escalation/cascades
    multiple threads of activity
    some productive
    some unproductive

    View Slide

  26. time pressure
    high consequences

    View Slide

  27. this is not
    “debugging”
    “troubleshooting”

    View Slide

  28. people will pursue what they
    think will be productive

    View Slide

  29. I mean I could ssh into one of the servers,
    and I might find something helpful by doing
    that…but…
    NO I REFUSE TO DO THAT BECAUSE I
    SHOULDN’T HAVE TO!!!

    View Slide

  30. View Slide

  31. people will pursue what they
    think will be productive
    who are these people?
    what roles do they play…actually?
    people
    for “fixing”…?
    for understanding?
    for ‘stemming the bleeding’?
    for customer support?
    for…?
    be productive
    via hypotheses?
    via past experience?
    via…?
    think

    View Slide

  32. what does this research look
    like?

    View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. Anomalous signals and
    representations
    Interventions and results
    Tentative, evolving, shared
    hypotheses
    Collective hypotheses ➝ plans acted on
    line of certainty and commitment to action

    View Slide

  37. View Slide

  38. Approaching Overload:
    Diagnosis and Response to Anomalies in
    Complex and Automated Production Software Systems
    Marisa Grayson
    Ohio State University

    View Slide

  39. monitoring/observability
    are inextricably coupled with other activities

    View Slide

  40. what can you do?

    View Slide

  41. Build your own internal
    resources to do incident analysis

    View Slide

  42. Are there any sources of data about the
    systems (logs, graphs, etc.) that people
    regularly dismiss or are suspicious of?
    0 100 200 300 400
    0
    10% 20% 30 % 40 %
    0 100 200 300 400
    0 1,000 2,000 3,000 4,000
    1 2 3 4 5
    How do people improvise new
    tools to help them understand
    what is happening?
    What tricks do people or teams use to
    understand how otherwise opaque 3rd
    party services are behaving?

    View Slide

  43. Select a few incidents for closer and
    deeper analysis

    View Slide

  44. Build or adjust tooling to capture data
    streams of incidents and their handling

    View Slide

  45. Make company-wide postmortem
    sessions regular events

    View Slide

  46. Suggestions for vendors

    View Slide

  47. Hire and retain expertise to do
    qualitative research
    “dogfooding” is not sufficient

    View Slide

  48. Research on supporting work in
    complex cognitive domains already
    exists!
    It will prove to be a competitive
    advantage for you.

    View Slide

  49. Summary
    • Understanding cognitive work in software engineering and operations is
    critically important. (The stakes are already too high, and we’re behind.)
    • Doing this well will mean new language, concepts, paradigms, and
    practices — some of which may be unintuitive and/or controversial.
    • Must be driven by both research/academia and industry/practitioners.
    • Vendors: if you pay attention, this will be a competitive advantage for you.

    View Slide

  50. –Lisanne Bainbridge, 1983 “Ironies of Automation”
    “...irony that the more advanced a control system is,
    so the more crucial may be the contribution of the
    human operator.”

    View Slide

  51. Thank You!
    @allspaw
    https://www.adaptivecapacitylabs.com/blog
    @AdaptiveCLabs

    View Slide