Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilience Engineering: How and What

Resilience Engineering: How and What

DevOpsDays DC 2019

John Allspaw

July 08, 2019
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. WHAT IT IS WHAT IT IS NOT HOW AND WHY

    IT MATTERS John Allspaw Adaptive Capacity Labs
  2. Bottom Line, Up Front • Resilience Engineering is a nascent

    field aiming to create and sustain conditions where resilience can manifest productively. • Resilience is something a system (your organization, not your software) does, not what it has. • Resilience is sustained adaptive capacity, or continuous adaptability to unforeseen situations. • Our world (software) has opportunities to further the state of the field, but face real challenges.
  3. Resilience Engineering is not • SRE • DevOps • Invented

    by any $COMPANY • Chaos Engineering • automation
  4. resilience is not • redundancy • robustness • high-availability •

    fault-tolerance • anything about software or hardware a synonym for these things
  5. Resilience Engineering Is a Field • Multidisciplinary, emerged from Cognitive

    Systems Engineering • Early 2000s, largely in response to NASA events in 1999 and 2000 • 8 symposia over 13 years
  6. Resilience Engineering is a Community is largely made up of

    practitioners and researchers from…. Cybernetics Engineering* Ecology Safety Science Biology Control Systems Human Factors & Ergonomics Cognitive Systems Engineering Complexity Science Cognitive Psychology Sociology Operations Research
  7. working in domains such as… Rail Maritime Surgery Intelligence Agencies

    Law Enforcement Aviation/ATM Space Mining Construction Explosives Firefighting Anesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering Resilience Engineering is a Community
  8. Some of the cast of characters David Woods CSEL/OSU Shawna

    Perry Univ of Florida Emergency Medicine Dr. Richard Cook Anesthesiologist Researcher Ivonne Andrade Herrera SINTEF Erik Hollnagel Univ of S. Denmark Gesa Praetorius Linnaeus University Johan Bergström Lund University Sidney Dekker Griffith University Asher Balkin CSEL/OSU Laura Maguire CSEL/OSU
  9. Some of the cast of characters J. Paul Reed Jessica

    DeVita Casey Rosenthal Nora Jones (me) David Woods Dr. Shawna Perry Dr. Richard Cook Ivonne Herrera Erik Hollnagel Johan Bergström Sidney Dekker Asher Balkin Laura Maguire Gesa Praetorius
  10. resilience is: • proactive activities aimed at preparing to be

    unprepared — without an ability to justify it economically! • sustaining the potential for future adaptive action when conditions change • something that a system does, not what it has
  11. capacity to find ways of getting to your destination cash

    in local currency requisite fluency in local language rail schedules bus schedules flight schedules postponing your appointment taking appointment partially via phone until arrival colleague to take your place until you arrive … … …
  12. Can resilience be found “in the wild”? (yes!) How? By

    looking closely at incidents and near-incidents for novel adaptations made which required prior investments to be made in expertise and flexibility.
  13. all incidents can be worse what are things (people, maneuvers,

    knowledge, etc.) that went into preventing it from being worse?
  14. How can I find this “adaptive capacity”? Find incidents that

    have: • high degree of surprise • whose consequences were not severe • and look closely at the details about what went into making it not nearly as bad as it could have been • protect and acknowledge explicitly the sources you find
  15. indications of surprise and novelty <murphy> wtf happened here <steve>

    I have no idea what is going on <laurie> well that's terrifying
  16. indications about contrasting mental models <laurie> so you want to

    rebuild {server01} first? <laurie> neither box has been touched yet <laurie> and im a tad nervous to do both at once <lisa> wait wait, i thought the X table was small <jeremy> I'm still a bit confused why B and A are different if A got to 0 and B is still at 3099 <tim>: oh I see.. the retry interval is pretty aggressive
  17. why not look at incidents with severe consequences? • scrutiny

    from stakeholders with face-saving agenda tend to block deep inquiry • with “medium-severe” incidents the cost of getting details/descriptions of people’s perspectives is low relative to the potential gain • “Goldilocks” incidents are the ideal
  18. initiative 1. the ability of a unit to adapt when

    the plan no longer fits the situation, as seen from that unit’s perspective; 
 2. the willingness (even the audacity) to adapt planned activities to work around impasses or to seize opportunities in order to better meet the goals/ intent behind the plan; and 
 3. when taking the initiative, the unit begins to adapts on its own, using information and knowledge available at that point, without asking for and then waiting for explicit authorization or tasking from other units.
  19. case of brittleness • 2010 Knight Capital collapse incident •

    new changes deployed to participate in a new market • unexpected algorithmic mechanisms led to unbounded automated trading activity • team rolls back changes, situation gets much worse • team did not believe it had authority to halt system • $440M loss in ~20 minutes
  20. in responding to an incident… • do you have access

    to contact details for everyone in your organization? • what actions do you need permission to take? • what repercussions exist for “violating” procedures or compliance rules? • can you anticipate what “neighboring” teams may need in the future that you have (expertise, staff, resources, etc.) and can donate to them before they need it, even if it sacrifices some of your local goals?
  21. Challenges • Inertia towards the status quo, oversimplifications • Chronic

    inability to learn from other domains • Technofetishization and automation naïvety
  22. The Status Quo Beliefs • Tyranny of metrics and "shallow

    data" • Under-investment in real incident analysis expertise • Oversimplified methods such as one-size-fits-all postmortem templates
  23. • “mean time to X” numbers are negotiated, not objective

    • all incident data is reactive and scoped to unwanted events; they tell us nothing about wanted situations • “trending” these numbers tell us nothing about learning, prevention, expertise, proactiveness, or adaptive capacity. Inconvenient realities of shallow data
  24. Bottom Line, Revisted • Resilience Engineering is a nascent field

    aiming to create and sustain conditions where resilience can manifest productively. • Resilience is something a system (your organization, not your software) does, not what it has. • Resilience is sustained adaptive capacity, or continuous adaptability to unforeseen situations. • Our world (software) has opportunities to further the state of the field, but face real challenges.
  25. Thank You! @allspaw Resilience Is A Verb (Woods, 2018) http://bit.ly/ResilienceIsAVerb

    Stella Report http://stella.report https://www.adaptivecapacitylabs.com/blog @AdaptiveCLabs SRE Cognitive Work (chapter in Seeking SRE, O’Reilly Media) http://bit.ly/SRECognitiveWork How Complex Systems Fail (Cook, 1998) http://bit.ly/ComplexSystemsFailure