Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsDays Seattle 2018

DevOpsDays Seattle 2018

Taking Human Performance Seriously In Software

John Allspaw

April 24, 2018
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. What You Are In For • Grounded research on how

    engineers cope with complexity • Incidents are the entry point for learning, but not in the way(s) you might think. • What this means for you, and potential directions you might take
  2. http://stella.report Year-long project Researchers analyzed 3 incidents, at: Six themes

    •Postmortems as re-calibration •Blameless v. sanctionless after-action actions •Controlling the costs of coordination •Visualizations during anomaly management •Strange Loops •Dark Debt
  3. externally sourced code (e.g. DB) results the using world delivery

    technology stack internally sourced code results
  4. externally sourced code (e.g. DB) results the using world delivery

    technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  5. code repositories macro descriptions testing/validation suites code code stuff meta

    rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  6. code repositories macro descriptions testing/validation suites code code stuff meta

    rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools system system framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  7. deploy organization/ “monitoring” Adding stuff to the running system Getting

    stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/
  8. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results The Work Is Done Here Your Product Or Service The Stuff You Build and Maintain With
  9. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results
  10. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  11. Copyright © 2016 by R.I. Cook for ACL, all rights

    reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  12. What matters. Why what matters matters. code deploy organization/ encapsulation

    “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting
  13. code generating tools testing tools deploy tools organization/ encapsulation tools

    “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Time …and things are changing here things are changing here…
  14. “above the line” …is not “management” …is not “organization design”

    or reporting structures …is not “culture” …is how people work (cognition)
  15. How does our software work, really? How does our software

    break, really? What do we do to keep it all working?
  16. We Study Incidents • Incidents bring everyone’s attention to a

    specific place and time • All incidents are surprises - they show us where our understanding isn’t accurate • Incidents can have many influences on an organization, most of them invisible without digging for them deliberately
  17. “normal” yep this is a big deal notice something is

    abnormal watch this thing - examine other things - realize whatever is happening — it’s getting worse figure out what to do about it attempt to repair confirm it’s actually fixed
  18. “normal” notice something is going critical test hypotheses, work out

    which it is come up with three plausible explanations for what’s happening (a, b, and c) oops, wait, no — it’s “b” (pretty sure) confident that it’s “c” - start working on fixing “c” working on “b” assess if the “c” fix caused any other issues confirm it’s actually stable
  19. “normal” OMG this is serious confirm it’s actually stable figure

    out who to call for help, contact them, bring them up to speed figure out what to do, settle on doing X do X realize that doing X made things worse, figure out the fix is Y clean up damage done by doing X do Y
  20. yep this is a big deal notice something is abnormal

    watch this thing - examine other things - realize whatever is happening — it’s getting worse figure out what to do about it attempt to repair confirm it’s actually fixed notice something is going critical test hypotheses, work out which it is come up with three plausible explanations for what’s happening (a, b, and c) oops, wait, no — it’s “b” (pretty sure) confident that it’s “c” - start working on fixing “c” working on “b” assess if the “c” fix caused any other issues confirm it’s actually stable do Y OMG this is serious confirm it’s actually stable figure out who to call for help, contact them, bring them up to speed figure out what to do, settle on doing X do X realize that doing X made things worse, figure out the fix is Y clean up damage done by doing X
  21. • “detection” can be different and separate from “identification” •

    hypothesis generation is dynamic and emergent • interventions can have ambiguous or uncertain consequences dynamic fault management
  22. medium/long-term business decisions • product/service problem anticipation • 3rd-party vendors

    and partner evaluation/viability • varying engagement with external dependencies • leadership/management “philosophy” changes • external stakeholder or shareholder expectancies • … ! !
  23. perceptions of risk (“what keeps you up at night?”) •

    Staffing forecasts • Product/project roadmap coordination/priority • Budget forecasts • … C-Suite or Board Leadership Team Director/Middle Hands-on practitioners … “top”-down “bottom”-up ! !
  24. influences on velocity and robustness of change • Agency of

    practitioners “on the ground” • Unintentional vs intentional org and product change • Local vs global adaptations • Regulation/compliance, rules and norms • Introduction of new/novel tech, architectures, languages • “Legacy” migration, retirement, decommissioning • Technical and “dark” debt • Threats to esoteric but critical knowledge of systems behavior • … ! !
  25. • Consider treating incidents as unplanned investments. • Avoid using

    shallow metrics (TTR, frequency, severity, etc.) as critically valuable data. People’s experiences represent the critical data. • Collect the questions being asked in your post-incident review meetings. Are some better than others? What makes a good question? A bad question? • Levels of “severity” != levels of terror/confusion/struggling for engineers. Find the shape of how these are different.
  26. • How do people who are in different teams or

    different domain expertise explain to each other what is familiar and important to watch for during an incident? • Are there any sources of data about the systems (logs, graphs, etc.) that people regularly dismiss or are suspicious of? • How do people improvise new tools to help them understand what is happening? • How do people assess when they need to call for help? • How does a new person in a team learn the nuances of systems behavior that isn’t documented? • What do the elder/veteran engineers know about their systems that others don’t? What esoteric knowledge do they have, and how did they get it? • What tricks do people or teams use to understand how otherwise opaque 3rd party services are behaving? • Are there specific types of problems (or particular systems) that trigger engineers to call for assistance from other teams, and others that don’t? consider topics such as these
  27. END