Slide 1

Slide 1 text

Taking Human Performance Seriously In Software John Allspaw Adaptive Capacity Labs

Slide 2

Slide 2 text

What You Are In For • Grounded research on how engineers cope with complexity • Incidents are the entry point for learning, but not in the way(s) you might think. • What this means for you, and potential directions you might take

Slide 3

Slide 3 text

about me

Slide 4

Slide 4 text

http://bitly.com/AllspawThesis

Slide 5

Slide 5 text

Human Factors Systems Safety Cognitive Systems Engineering Resilience Engineering

Slide 6

Slide 6 text

http://stella.report Year-long project Researchers analyzed 3 incidents, at: Six themes •Postmortems as re-calibration •Blameless v. sanctionless after-action actions •Controlling the costs of coordination •Visualizations during anomaly management •Strange Loops •Dark Debt

Slide 7

Slide 7 text

Researchers Practitioners @UberGeekGirl @caseyrosenthal @nora_js @jpaulreed

Slide 8

Slide 8 text

Fundamental View

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 11

Slide 11 text

externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 12

Slide 12 text

code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 13

Slide 13 text

code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools system system framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 14

Slide 14 text

deploy organization/ “monitoring” Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/

Slide 15

Slide 15 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results The Work Is Done Here Your Product Or Service The Stuff You Build and Maintain With

Slide 16

Slide 16 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 17

Slide 17 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 18

Slide 18 text

Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 19

Slide 19 text

What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 20

Slide 20 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Time …and things are changing here things are changing here…

Slide 21

Slide 21 text

“above the line” …is not “management” …is not “organization design” or reporting structures …is not “culture” …is how people work (cognition)

Slide 22

Slide 22 text

How does our software work, really? How does our software break, really? What do we do to keep it all working?

Slide 23

Slide 23 text

We Study Incidents • Incidents bring everyone’s attention to a specific place and time • All incidents are surprises - they show us where our understanding isn’t accurate • Incidents can have many influences on an organization, most of them invisible without digging for them deliberately

Slide 24

Slide 24 text

normal detection diagnosis repair normal time How We Think Incidents Unfold

Slide 25

Slide 25 text

“normal” yep this is a big deal notice something is abnormal watch this thing - examine other things - realize whatever is happening — it’s getting worse figure out what to do about it attempt to repair confirm it’s actually fixed

Slide 26

Slide 26 text

“normal” notice something is going critical test hypotheses, work out which it is come up with three plausible explanations for what’s happening (a, b, and c) oops, wait, no — it’s “b” (pretty sure) confident that it’s “c” - start working on fixing “c” working on “b” assess if the “c” fix caused any other issues confirm it’s actually stable

Slide 27

Slide 27 text

“normal” OMG this is serious confirm it’s actually stable figure out who to call for help, contact them, bring them up to speed figure out what to do, settle on doing X do X realize that doing X made things worse, figure out the fix is Y clean up damage done by doing X do Y

Slide 28

Slide 28 text

yep this is a big deal notice something is abnormal watch this thing - examine other things - realize whatever is happening — it’s getting worse figure out what to do about it attempt to repair confirm it’s actually fixed notice something is going critical test hypotheses, work out which it is come up with three plausible explanations for what’s happening (a, b, and c) oops, wait, no — it’s “b” (pretty sure) confident that it’s “c” - start working on fixing “c” working on “b” assess if the “c” fix caused any other issues confirm it’s actually stable do Y OMG this is serious confirm it’s actually stable figure out who to call for help, contact them, bring them up to speed figure out what to do, settle on doing X do X realize that doing X made things worse, figure out the fix is Y clean up damage done by doing X

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

• “detection” can be different and separate from “identification” • hypothesis generation is dynamic and emergent • interventions can have ambiguous or uncertain consequences dynamic fault management

Slide 32

Slide 32 text

! ! Far Reaching Influence of Incidents

Slide 33

Slide 33 text

medium/long-term business decisions • product/service problem anticipation • 3rd-party vendors and partner evaluation/viability • varying engagement with external dependencies • leadership/management “philosophy” changes • external stakeholder or shareholder expectancies • … ! !

Slide 34

Slide 34 text

perceptions of risk (“what keeps you up at night?”) • Staffing forecasts • Product/project roadmap coordination/priority • Budget forecasts • … C-Suite or Board Leadership Team Director/Middle Hands-on practitioners … “top”-down “bottom”-up ! !

Slide 35

Slide 35 text

influences on velocity and robustness of change • Agency of practitioners “on the ground” • Unintentional vs intentional org and product change • Local vs global adaptations • Regulation/compliance, rules and norms • Introduction of new/novel tech, architectures, languages • “Legacy” migration, retirement, decommissioning • Technical and “dark” debt • Threats to esoteric but critical knowledge of systems behavior • … ! !

Slide 36

Slide 36 text

Implications For You Food For Thought

Slide 37

Slide 37 text

• Consider treating incidents as unplanned investments. • Avoid using shallow metrics (TTR, frequency, severity, etc.) as critically valuable data. People’s experiences represent the critical data. • Collect the questions being asked in your post-incident review meetings. Are some better than others? What makes a good question? A bad question? • Levels of “severity” != levels of terror/confusion/struggling for engineers. Find the shape of how these are different.

Slide 38

Slide 38 text

• How do people who are in different teams or different domain expertise explain to each other what is familiar and important to watch for during an incident? • Are there any sources of data about the systems (logs, graphs, etc.) that people regularly dismiss or are suspicious of? • How do people improvise new tools to help them understand what is happening? • How do people assess when they need to call for help? • How does a new person in a team learn the nuances of systems behavior that isn’t documented? • What do the elder/veteran engineers know about their systems that others don’t? What esoteric knowledge do they have, and how did they get it? • What tricks do people or teams use to understand how otherwise opaque 3rd party services are behaving? • Are there specific types of problems (or particular systems) that trigger engineers to call for assistance from other teams, and others that don’t? consider topics such as these

Slide 39

Slide 39 text

STOP LOOKING FOR A ROOT CAUSE

Slide 40

Slide 40 text

HUMAN ERROR IS NOT A THING

Slide 41

Slide 41 text

END