$30 off During Our Annual Pro Sale. View Details »

DevOpsDays Seattle 2018

DevOpsDays Seattle 2018

Taking Human Performance Seriously In Software

John Allspaw

April 24, 2018
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. Taking Human Performance Seriously In Software John Allspaw Adaptive Capacity

    Labs
  2. What You Are In For • Grounded research on how

    engineers cope with complexity • Incidents are the entry point for learning, but not in the way(s) you might think. • What this means for you, and potential directions you might take
  3. about me

  4. http://bitly.com/AllspawThesis

  5. Human Factors Systems Safety Cognitive Systems Engineering Resilience Engineering

  6. http://stella.report Year-long project Researchers analyzed 3 incidents, at: Six themes

    •Postmortems as re-calibration •Blameless v. sanctionless after-action actions •Controlling the costs of coordination •Visualizations during anomaly management •Strange Loops •Dark Debt
  7. Researchers Practitioners @UberGeekGirl @caseyrosenthal @nora_js @jpaulreed

  8. Fundamental View

  9. None
  10. externally sourced code (e.g. DB) results the using world delivery

    technology stack internally sourced code results
  11. externally sourced code (e.g. DB) results the using world delivery

    technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  12. code repositories macro descriptions testing/validation suites code code stuff meta

    rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  13. code repositories macro descriptions testing/validation suites code code stuff meta

    rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools system system framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  14. deploy organization/ “monitoring” Adding stuff to the running system Getting

    stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/
  15. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results The Work Is Done Here Your Product Or Service The Stuff You Build and Maintain With
  16. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  17. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  18. Copyright © 2016 by R.I. Cook for ACL, all rights

    reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  19. What matters. Why what matters matters. code deploy organization/ encapsulation

    “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  20. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Time …and things are changing here things are changing here…
  21. “above the line” …is not “management” …is not “organization design”

    or reporting structures …is not “culture” …is how people work (cognition)
  22. How does our software work, really? How does our software

    break, really? What do we do to keep it all working?
  23. We Study Incidents • Incidents bring everyone’s attention to a

    specific place and time • All incidents are surprises - they show us where our understanding isn’t accurate • Incidents can have many influences on an organization, most of them invisible without digging for them deliberately
  24. normal detection diagnosis repair normal time How We Think Incidents

    Unfold
  25. “normal” yep this is a big deal notice something is

    abnormal watch this thing - examine other things - realize whatever is happening — it’s getting worse figure out what to do about it attempt to repair confirm it’s actually fixed
  26. “normal” notice something is going critical test hypotheses, work out

    which it is come up with three plausible explanations for what’s happening (a, b, and c) oops, wait, no — it’s “b” (pretty sure) confident that it’s “c” - start working on fixing “c” working on “b” assess if the “c” fix caused any other issues confirm it’s actually stable
  27. “normal” OMG this is serious confirm it’s actually stable figure

    out who to call for help, contact them, bring them up to speed figure out what to do, settle on doing X do X realize that doing X made things worse, figure out the fix is Y clean up damage done by doing X do Y
  28. yep this is a big deal notice something is abnormal

    watch this thing - examine other things - realize whatever is happening — it’s getting worse figure out what to do about it attempt to repair confirm it’s actually fixed notice something is going critical test hypotheses, work out which it is come up with three plausible explanations for what’s happening (a, b, and c) oops, wait, no — it’s “b” (pretty sure) confident that it’s “c” - start working on fixing “c” working on “b” assess if the “c” fix caused any other issues confirm it’s actually stable do Y OMG this is serious confirm it’s actually stable figure out who to call for help, contact them, bring them up to speed figure out what to do, settle on doing X do X realize that doing X made things worse, figure out the fix is Y clean up damage done by doing X
  29. None
  30. None
  31. • “detection” can be different and separate from “identification” •

    hypothesis generation is dynamic and emergent • interventions can have ambiguous or uncertain consequences dynamic fault management
  32. ! ! Far Reaching Influence of Incidents

  33. medium/long-term business decisions • product/service problem anticipation • 3rd-party vendors

    and partner evaluation/viability • varying engagement with external dependencies • leadership/management “philosophy” changes • external stakeholder or shareholder expectancies • … ! !
  34. perceptions of risk (“what keeps you up at night?”) •

    Staffing forecasts • Product/project roadmap coordination/priority • Budget forecasts • … C-Suite or Board Leadership Team Director/Middle Hands-on practitioners … “top”-down “bottom”-up ! !
  35. influences on velocity and robustness of change • Agency of

    practitioners “on the ground” • Unintentional vs intentional org and product change • Local vs global adaptations • Regulation/compliance, rules and norms • Introduction of new/novel tech, architectures, languages • “Legacy” migration, retirement, decommissioning • Technical and “dark” debt • Threats to esoteric but critical knowledge of systems behavior • … ! !
  36. Implications For You Food For Thought

  37. • Consider treating incidents as unplanned investments. • Avoid using

    shallow metrics (TTR, frequency, severity, etc.) as critically valuable data. People’s experiences represent the critical data. • Collect the questions being asked in your post-incident review meetings. Are some better than others? What makes a good question? A bad question? • Levels of “severity” != levels of terror/confusion/struggling for engineers. Find the shape of how these are different.
  38. • How do people who are in different teams or

    different domain expertise explain to each other what is familiar and important to watch for during an incident? • Are there any sources of data about the systems (logs, graphs, etc.) that people regularly dismiss or are suspicious of? • How do people improvise new tools to help them understand what is happening? • How do people assess when they need to call for help? • How does a new person in a team learn the nuances of systems behavior that isn’t documented? • What do the elder/veteran engineers know about their systems that others don’t? What esoteric knowledge do they have, and how did they get it? • What tricks do people or teams use to understand how otherwise opaque 3rd party services are behaving? • Are there specific types of problems (or particular systems) that trigger engineers to call for assistance from other teams, and others that don’t? consider topics such as these
  39. STOP LOOKING FOR A ROOT CAUSE

  40. HUMAN ERROR IS NOT A THING

  41. END