Slide 1

Slide 1 text

WHAT IT IS WHAT IT IS NOT HOW AND WHY IT MATTERS John Allspaw Adaptive Capacity Labs

Slide 2

Slide 2 text

me 2009 Velocity Conf Consortium for Resilient Internet-Facing Business IT Adaptive Capacity Labs

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Bottom Line, Up Front • Resilience Engineering is a nascent field aiming to create and sustain conditions where resilience can manifest productively. • Resilience is something a system (your organization, not your software) does, not what it has. • Resilience is sustained adaptive capacity, or continuous adaptability to unforeseen situations. • Our world (software) has opportunities to further the state of the field, but face real challenges.

Slide 5

Slide 5 text

RESILIENCE ENGINEERING • field • community • practice “resilience” ?

Slide 6

Slide 6 text

Resilience Engineering is not • SRE • DevOps • Invented by any $COMPANY • Chaos Engineering • automation

Slide 7

Slide 7 text

resilience is not • redundancy • robustness • high-availability • fault-tolerance • anything about software or hardware a synonym for these things

Slide 8

Slide 8 text

A FIELD A COMMUNITY

Slide 9

Slide 9 text

Resilience Engineering Is a Field • Multidisciplinary, emerged from Cognitive Systems Engineering • Early 2000s, largely in response to NASA events in 1999 and 2000 • 8 symposia over 13 years

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Resilience Engineering is a Community is largely made up of practitioners and researchers from…. Cybernetics Engineering* Ecology Safety Science Biology Control Systems Human Factors & Ergonomics Cognitive Systems Engineering Complexity Science Cognitive Psychology Sociology Operations Research

Slide 12

Slide 12 text

working in domains such as… Rail Maritime Surgery Intelligence Agencies Law Enforcement Aviation/ATM Space Mining Construction Explosives Firefighting Anesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering Resilience Engineering is a Community

Slide 13

Slide 13 text

Some of the cast of characters David Woods CSEL/OSU Shawna Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Gesa Praetorius Linnaeus University Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU

Slide 14

Slide 14 text

Some of the cast of characters J. Paul Reed Jessica DeVita Casey Rosenthal Nora Jones (me) David Woods Dr. Shawna Perry Dr. Richard Cook Ivonne Herrera Erik Hollnagel Johan Bergström Sidney Dekker Asher Balkin Laura Maguire Gesa Praetorius

Slide 15

Slide 15 text

resiliencepapers.club Lorin Hochstein

Slide 16

Slide 16 text

“resilience”

Slide 17

Slide 17 text

resilience is: • proactive activities aimed at preparing to be unprepared — without an ability to justify it economically! • sustaining the potential for future adaptive action when conditions change • something that a system does, not what it has

Slide 18

Slide 18 text

unforeseen unanticipated unexpected fundamentally surprising

Slide 19

Slide 19 text

–Scott Sagan “The Limits of Safety” “things that have never happened before happen all the time”

Slide 20

Slide 20 text

robustness redundancy

Slide 21

Slide 21 text

capacity to find ways of getting to your destination cash in local currency requisite fluency in local language rail schedules bus schedules flight schedules postponing your appointment taking appointment partially via phone until arrival colleague to take your place until you arrive … … …

Slide 22

Slide 22 text

resilience is a verb

Slide 23

Slide 23 text

sustained adaptive capacity

Slide 24

Slide 24 text

sustained adaptive capacity continuous adaptability graceful extensibility

Slide 25

Slide 25 text

Can resilience be found “in the wild”? (yes!) How? By looking closely at incidents and near-incidents for novel adaptations made which required prior investments to be made in expertise and flexibility.

Slide 26

Slide 26 text

all incidents can be worse what are things (people, maneuvers, knowledge, etc.) that went into preventing it from being worse?

Slide 27

Slide 27 text

How can I find this “adaptive capacity”? Find incidents that have: • high degree of surprise • whose consequences were not severe • and look closely at the details about what went into making it not nearly as bad as it could have been • protect and acknowledge explicitly the sources you find

Slide 28

Slide 28 text

indications of surprise and novelty wtf happened here I have no idea what is going on well that's terrifying

Slide 29

Slide 29 text

indications about contrasting mental models so you want to rebuild {server01} first? neither box has been touched yet and im a tad nervous to do both at once wait wait, i thought the X table was small I'm still a bit confused why B and A are different if A got to 0 and B is still at 3099 : oh I see.. the retry interval is pretty aggressive

Slide 30

Slide 30 text

why not look at incidents with severe consequences? • scrutiny from stakeholders with face-saving agenda tend to block deep inquiry • with “medium-severe” incidents the cost of getting details/descriptions of people’s perspectives is low relative to the potential gain • “Goldilocks” incidents are the ideal

Slide 31

Slide 31 text

initiative 1. the ability of a unit to adapt when the plan no longer fits the situation, as seen from that unit’s perspective; 
 2. the willingness (even the audacity) to adapt planned activities to work around impasses or to seize opportunities in order to better meet the goals/ intent behind the plan; and 
 3. when taking the initiative, the unit begins to adapts on its own, using information and knowledge available at that point, without asking for and then waiting for explicit authorization or tasking from other units.

Slide 32

Slide 32 text

case of brittleness • 2010 Knight Capital collapse incident • new changes deployed to participate in a new market • unexpected algorithmic mechanisms led to unbounded automated trading activity • team rolls back changes, situation gets much worse • team did not believe it had authority to halt system • $440M loss in ~20 minutes

Slide 33

Slide 33 text

in responding to an incident… • do you have access to contact details for everyone in your organization? • what actions do you need permission to take? • what repercussions exist for “violating” procedures or compliance rules? • can you anticipate what “neighboring” teams may need in the future that you have (expertise, staff, resources, etc.) and can donate to them before they need it, even if it sacrifices some of your local goals?

Slide 34

Slide 34 text

Can resilience be engineered? Maybe! We think so! Not entirely sure how yet, exactly.

Slide 35

Slide 35 text

Challenges to DevOps+SRE communities w.r.t. Resilience Engineering

Slide 36

Slide 36 text

Challenges • Inertia towards the status quo, oversimplifications • Chronic inability to learn from other domains • Technofetishization and automation naïvety

Slide 37

Slide 37 text

The Status Quo Beliefs • Tyranny of metrics and "shallow data" • Under-investment in real incident analysis expertise • Oversimplified methods such as one-size-fits-all postmortem templates

Slide 38

Slide 38 text

• “mean time to X” numbers are negotiated, not objective • all incident data is reactive and scoped to unwanted events; they tell us nothing about wanted situations • “trending” these numbers tell us nothing about learning, prevention, expertise, proactiveness, or adaptive capacity. Inconvenient realities of shallow data

Slide 39

Slide 39 text

Bottom Line, Revisted • Resilience Engineering is a nascent field aiming to create and sustain conditions where resilience can manifest productively. • Resilience is something a system (your organization, not your software) does, not what it has. • Resilience is sustained adaptive capacity, or continuous adaptability to unforeseen situations. • Our world (software) has opportunities to further the state of the field, but face real challenges.

Slide 40

Slide 40 text

Thank You! @allspaw Resilience Is A Verb (Woods, 2018) http://bit.ly/ResilienceIsAVerb Stella Report http://stella.report https://www.adaptivecapacitylabs.com/blog @AdaptiveCLabs SRE Cognitive Work (chapter in Seeking SRE, O’Reilly Media) http://bit.ly/SRECognitiveWork How Complex Systems Fail (Cook, 1998) http://bit.ly/ComplexSystemsFailure