Slide 1

Slide 1 text

Poised To Adapt Continuous Delivery’s Relationship to Resilience Engineering John Allspaw Principal, Adaptive Capacity Labs

Slide 2

Slide 2 text

example #1 rm -rf $PATHNAME

Slide 3

Slide 3 text

@@ -1,2 +1,2 @@ - + Showing 1 changed file with 1 addition and 1 deletion. index.html example #2

Slide 4

Slide 4 text

all work is contextual

Slide 5

Slide 5 text

Adaptive Capacity Labs

Slide 6

Slide 6 text

Velocity Conference 2009

Slide 7

Slide 7 text

http://bitly.com/AllspawThesis

Slide 8

Slide 8 text

http://stella.report Year-long project Researchers analyzed 3 incidents, at: Six themes •Postmortems as re-calibration •Blameless v. sanctionless after action actions •Controlling the costs of coordination •Visualizations during anomaly management •Strange Loops •Dark Debt

Slide 9

Slide 9 text

What You Are In For 1. Resilience Engineering: a field and a community 2. A cognitive systems perspective of the CD/CI community 3. Poise/potential/capacity to adapt 4. Some (hopefully) thought-provoking questions

Slide 10

Slide 10 text

Resilience Engineering • A field of study that emerged largely from Cognitive Systems Engineering, early 2000s. • 7 symposia over 12 years

Slide 11

Slide 11 text

Resilience Engineering Community is largely made up of practitioners and researchers from…. working in these domains… Aviation/ATM Rail Maritime Space Surgery Power Plants Intelligence Agencies Law Enforcement Mining Construction Explosives Firefighting Anesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering Human Factors & Ergonomics Cognitive Systems Engineering Cybernetics Complexity Science Engineering* Psychology Sociology Ecology Safety Science

Slide 12

Slide 12 text

Some of the cast of characters David Woods CSEL/OSU Shawna Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Anne-Sophie Nyssen University de Liege Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU

Slide 13

Slide 13 text

Sample of Research Experiences in Fukushima Dai-ichi nuclear power plant in light of resilience engineering Unmanned Aircraft Systems in (Inter)national Airspace: Resilience as a Lever in the Debate Sociotechnical Networks for Power Grid Resilience: South Korean Case Study Limits on adaptation: Modeling Resilience and Brittleness in Hospital Emergency Departments

Slide 14

Slide 14 text

Books

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 17

Slide 17 text

externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results macro descriptions externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 18

Slide 18 text

code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 19

Slide 19 text

code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools system system framing doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 20

Slide 20 text

deploy organization/ “monitoring” Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code deploy organization/

Slide 21

Slide 21 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results The Work Is Done Here Your Product Or Service The Stuff You Build and Maintain With

Slide 22

Slide 22 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results

Slide 23

Slide 23 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 24

Slide 24 text

Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” Copyright © 2016 by R.I. Cook for ACL, all rights reserved ack: Michael Angeles http://konigi.com/tools/ What matters. Why what matters matters. code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line Why is it doing that? What needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results goals purposes risks cognition actions interactions speech gestures clicks signals representations artifacts the line of representation individuals have unique models of the “system” observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 25

Slide 25 text

What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe What matters. Why what matters matters. code deploy organization/ encapsulation “monitoring” Why is it doing that? hat needs to change? What does it mean? How should this work? What’s it doing? What does it mean? What is happening? What should be happening What does it mean? Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing go purp ris cogn act intera spe ges cli sig represe observing inferring anticipating planning troubleshooting diagnosing correcting modifying reacting

Slide 26

Slide 26 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Time …and things are changing here things are changing here…

Slide 27

Slide 27 text

Resilience is something that a system does, not what a system has.

Slide 28

Slide 28 text

“Resilience is an expression of how people, alone or together, cope with everyday situations – large and small – by adjusting their performance to the conditions. An organisation’s performance is resilient if it can function as required under expected and unexpected conditions alike (changes/disturbances/opportunities).” Hollnagel, Erik. Safety-II in Practice: Developing the Resilience Potentials

Slide 29

Slide 29 text

–David Woods (2015) “Resilience is sustained adaptive capacity.”

Slide 30

Slide 30 text

Resilience is the story of the outage that didn’t happen.

Slide 31

Slide 31 text

Continuous Delivery?

Slide 32

Slide 32 text

small and frequent? risk? shorten lead time?

Slide 33

Slide 33 text

#include int main(void) { printf("Hello World\n"); return 0; } 6 lines, 79 characters tr G-t F-s<<

Slide 34

Slide 34 text

• canaries are different than feature flags • feature flags are different than ramp-ups • ramp-ups are different than db schema changes • network, hardware, “front-end” “fault injection/chaos” • etc. etc. etc. a change is not a change is not a change

Slide 35

Slide 35 text

Hints that the situation is more complicated than usually described… Deploys get different evaluations based on their perceived risk. Freedom to deploy is sometimes restricted. Risk hedging is common. Conventions are everywhere. Dr. Richard Cook, Velocity Conf 2016 Santa Clara, CA

Slide 36

Slide 36 text

all work is contextual

Slide 37

Slide 37 text

code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools Adding stuff to the running system Getting stuff ready to be part of the running system architectural & structural framing keeping track of what “the system” is doing code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results Context Is Constructed Here

Slide 38

Slide 38 text

Poised To Deploy 1. Knowing what the platform is supposed to do 2. Knowing how the platform works 3. What the platform’s behavior means 4. Being able to devise a change that addresses 1, 2, & 3 5. Being able to predict the effects of that change 6. Being able to force the platform to change in that way 7. Being prepared to deal with the consequences Dr. Richard Cook, Velocity Conf 2016 Santa Clara, CA

Slide 39

Slide 39 text

Stuff you can test Unknown ∞ ∞ Known complete testing is impossible credit: Noah Sussman

Slide 40

Slide 40 text

How does our software work, really? How does our software break, really? What do we do to keep it all working?

Slide 41

Slide 41 text

“it depends”

Slide 42

Slide 42 text

how to discover what happens “above the line”?

Slide 43

Slide 43 text

incidents (outages, degradations, breaches, accidents, near-misses, “glitches”, untoward/unexpected events, etc.)

Slide 44

Slide 44 text

what makes incidents interesting & valuable?

Slide 45

Slide 45 text

code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line externally sourced code (e.g. DB) results the using world delivery technology stack internally sourced code results code repositories macro descriptions testing/validation suites code code stuff meta rules scripts, rules, etc. test cases code generating tools testing tools deploy tools organization/ encapsulation tools “monitoring” tools above the line below the line externally sourced code (e.g. DB) results delivery technology stack internally sourced code results incidents as… drivers of software design - “incidents of yesterday inform the architectures of tomorrow” - incidents “below the line” drive changes “above the line" - staffing, budgets, planning, roadmaps, etc. - shape the design of new components, subsystems, architectures

Slide 46

Slide 46 text

5/6/2010 - “Flash Crash” - loss of $1 trillion in market value in <10min 3/23/2012 - BATS IPO - systems issue halted the exchange’s own IPO 5/23/2012 - Facebook IPO - systems issue delayed IPO trading 8/1/2012 - Knight Capital - $461 million in 45 minutes “Regulation SCI” - tend also to give birth to new forms of regulations, policies, norms, compliance requirements, explosion of documentation, auditing, constraints, etc. - “incidents of yesterday inform the rules of tomorrow” - influence staffing, budgets, planning, roadmaps, etc. PCI-DSS 1988-1998, Visa and MasterCard reported credit card losses due to fraud of $750 million incidents as… motivators for policy

Slide 47

Slide 47 text

incidents tend to focus our attention on what matters

Slide 48

Slide 48 text

incidents help us gauge the delta between how the system works how we think the system works Δ { almost always greater than we imagine

Slide 49

Slide 49 text

What is it doing?! Why is it doing that?! What will it do next? How did it get into this state? WTF is happening? If we do Y, will it help us figure out what to do? Is it getting worse? It looks like it’s fixed…but is it…? If we do X, will it prevent it from getting worse…or make it worse? Who else should we call that can help us? Is this OUR issue, or are we BEING ATTACKED?!

Slide 50

Slide 50 text

“…nonroutine, challenging events, because these tough cases have the greatest potential for uncovering elements of expertise and related cognitive phenomena.” (Klein, Crandall, Hoffman, 2006) A family of well-worn methods, approaches, and techniques Cognitive task/work analysis Process tracing Conversation analysis Critical decision method Critical incident technique more… research validates these opportunities

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

incident 54 minutes start resolve

Slide 53

Slide 53 text

12 minutes 54 minutes start resolve detect incident

Slide 54

Slide 54 text

20 minutes 73 minutes 12 minutes 54 minutes start resolve detect start detect resolve incidents

Slide 55

Slide 55 text

12 minutes 54 minutes start resolve detect 20 minutes 73 minutes start detect resolve 5 25 minutes start detect resolve incidents

Slide 56

Slide 56 text

incidents 12 minutes 54 minutes start resolve detect 20 minutes 73 minutes start detect resolve 5 25 minutes start detect resolve 135 minutes 100 minutes start detect resolve

Slide 57

Slide 57 text

incidents 12 minutes 54 minutes start resolve detect 20 minutes 73 minutes start detect resolve 5 25 minutes start detect resolve 135 minutes 100 minutes start detect resolve minutes

Slide 58

Slide 58 text

incidents minutes

Slide 59

Slide 59 text

incidents minutes jan feb mar apr may jun

Slide 60

Slide 60 text

incidents minutes jan feb mar apr may jun

Slide 61

Slide 61 text

incidents minutes jan feb mar apr may jun

Slide 62

Slide 62 text

incidents minutes jan feb mar apr may jun

Slide 63

Slide 63 text

“Resilience is an expression of how people, alone or together, cope with everyday situations – large and small – by adjusting their performance to the conditions. An organisation’s performance is resilient if it can function as required under expected and unexpected conditions alike (changes/disturbances/opportunities).”

Slide 64

Slide 64 text

“Resilience is sustained adaptive capacity.”

Slide 65

Slide 65 text

incidents minutes jan feb mar apr may jun

Slide 66

Slide 66 text

What is it doing?! Why is it doing that?! What will it do next? How did it get into this state? WTF is happening? If we do Y, will it help us figure out what to do? Is it getting worse? It looks like it’s fixed…but is it…? If we do X, will it prevent it from getting worse…or make it worse? Who else should we call that can help us? Is this OUR issue, or are we BEING ATTACKED?!

Slide 67

Slide 67 text

incidents provide calibration about… how decisions are focused how attention flows how work is coordinated how escalation manifests the weight of time pressure the effects of uncertainty the impact of ambiguity what consequences are consequential

Slide 68

Slide 68 text

What can we learn about these… how decisions are focused how attention flows how work is coordinated how escalation manifests the weight of time pressure the effects of uncertainty the impact of ambiguity what consequences are consequential …from these? (M)TTR? (M)TTD? Frequency of incidents? Severity of incidents? Customer impact? Number of deploys? “…while there is value in the items on the right, we value the items on the left more.”

Slide 69

Slide 69 text

Thought Food • We cannot comprehensively understand how our systems behave - we continually build and revise our understandings based on (relatively sparse) signals our tech sends us. • Applying CD approaches/techniques is an implicit acknowledgement of this sparse and fleeting understanding, and represent coping strategies for this state of affairs. • Understanding activities “above the line” are basically unexplored or ignored in our industry, and this needs to change.

Slide 70

Slide 70 text

Taking Human Performance Seriously This discussion is happening… in nuclear power in air traffic control in firefighting in medicine … We need to do more than just acknowledge this - we need to embrace it.