$30 off During Our Annual Pro Sale. View Details »

Resilience Engineering: How and What

Resilience Engineering: How and What

DevOpsDays DC 2019

John Allspaw

July 08, 2019
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. WHAT IT IS
    WHAT IT IS NOT
    HOW AND WHY IT MATTERS
    John Allspaw
    Adaptive Capacity Labs

    View Slide

  2. me
    2009 Velocity Conf
    Consortium for Resilient
    Internet-Facing Business IT
    Adaptive
    Capacity
    Labs

    View Slide

  3. View Slide

  4. Bottom Line, Up Front
    • Resilience Engineering is a nascent field aiming to create and sustain
    conditions where resilience can manifest productively.
    • Resilience is something a system (your organization, not your software)
    does, not what it has.
    • Resilience is sustained adaptive capacity, or continuous adaptability to
    unforeseen situations.
    • Our world (software) has opportunities to further the state of the field, but
    face real challenges.

    View Slide

  5. RESILIENCE ENGINEERING
    • field
    • community
    • practice
    “resilience”
    ?

    View Slide

  6. Resilience Engineering is not
    • SRE
    • DevOps
    • Invented by any $COMPANY
    • Chaos Engineering
    • automation

    View Slide

  7. resilience is not
    • redundancy
    • robustness
    • high-availability
    • fault-tolerance
    • anything about software or hardware
    a synonym for these things

    View Slide

  8. A FIELD
    A COMMUNITY

    View Slide

  9. Resilience Engineering
    Is a Field
    • Multidisciplinary, emerged from Cognitive Systems Engineering
    • Early 2000s, largely in response to NASA events in 1999 and 2000
    • 8 symposia over 13 years

    View Slide

  10. View Slide

  11. Resilience Engineering
    is a Community
    is largely made up of practitioners and researchers from….
    Cybernetics
    Engineering*
    Ecology
    Safety Science
    Biology
    Control Systems
    Human Factors & Ergonomics
    Cognitive Systems Engineering
    Complexity Science
    Cognitive Psychology
    Sociology
    Operations Research

    View Slide

  12. working in domains such as…
    Rail Maritime
    Surgery
    Intelligence Agencies
    Law Enforcement
    Aviation/ATM
    Space
    Mining
    Construction
    Explosives
    Firefighting
    Anesthesia
    Pediatrics
    Power Grid & Distribution
    Military Agencies
    Software Engineering
    Resilience Engineering
    is a Community

    View Slide

  13. Some of the cast of characters
    David Woods
    CSEL/OSU
    Shawna Perry
    Univ of Florida
    Emergency Medicine
    Dr. Richard Cook
    Anesthesiologist
    Researcher
    Ivonne Andrade Herrera
    SINTEF
    Erik Hollnagel
    Univ of S. Denmark
    Gesa Praetorius
    Linnaeus University
    Johan Bergström
    Lund University
    Sidney Dekker
    Griffith University
    Asher Balkin
    CSEL/OSU
    Laura Maguire
    CSEL/OSU

    View Slide

  14. Some of the cast of characters
    J. Paul Reed
    Jessica DeVita
    Casey Rosenthal
    Nora Jones (me)
    David Woods Dr. Shawna Perry Dr. Richard Cook Ivonne Herrera Erik Hollnagel
    Johan Bergström
    Sidney Dekker Asher Balkin
    Laura Maguire
    Gesa Praetorius

    View Slide

  15. resiliencepapers.club
    Lorin Hochstein

    View Slide

  16. “resilience”

    View Slide

  17. resilience is:
    • proactive activities aimed at preparing to be unprepared
    — without an ability to justify it economically!
    • sustaining the potential for future adaptive action when
    conditions change
    • something that a system does, not what it has

    View Slide

  18. unforeseen
    unanticipated
    unexpected
    fundamentally surprising

    View Slide

  19. –Scott Sagan “The Limits of Safety”
    “things that have never happened before
    happen all the time”

    View Slide

  20. robustness
    redundancy

    View Slide

  21. capacity to find ways of getting to your destination
    cash in local currency
    requisite fluency in local language
    rail schedules
    bus schedules
    flight schedules
    postponing your appointment
    taking appointment partially via phone until arrival
    colleague to take your place until you arrive



    View Slide

  22. resilience is a verb

    View Slide

  23. sustained
    adaptive capacity

    View Slide

  24. sustained adaptive capacity
    continuous adaptability
    graceful extensibility

    View Slide

  25. Can resilience be found “in the wild”?
    (yes!)
    How?
    By looking closely at incidents and near-incidents for novel adaptations
    made which required prior investments to be made in expertise and
    flexibility.

    View Slide

  26. all incidents can be worse
    what are things (people, maneuvers, knowledge, etc.) that went into
    preventing it from being worse?

    View Slide

  27. How can I find this “adaptive capacity”?
    Find incidents that have:
    • high degree of surprise
    • whose consequences were not severe
    • and look closely at the details about what went into
    making it not nearly as bad as it could have been
    • protect and acknowledge explicitly the sources you find

    View Slide

  28. indications of surprise and novelty
    wtf happened here
    I have no idea what is going on
    well that's terrifying

    View Slide

  29. indications about contrasting mental models
    so you want to rebuild {server01} first?
    neither box has been touched yet
    and im a tad nervous to do both at once
    wait wait, i thought the X table was small
    I'm still a bit confused why B and A are different
    if A got to 0 and B is still at 3099
    : oh I see.. the retry interval is pretty aggressive

    View Slide

  30. why not look at incidents with
    severe consequences?
    • scrutiny from stakeholders with face-saving agenda tend to block deep
    inquiry
    • with “medium-severe” incidents the cost of getting details/descriptions
    of people’s perspectives is low relative to the potential gain
    • “Goldilocks” incidents are the ideal

    View Slide

  31. initiative
    1. the ability of a unit to adapt when the plan no longer fits the situation, as
    seen from that unit’s perspective; 

    2. the willingness (even the audacity) to adapt planned activities to work
    around impasses or to seize opportunities in order to better meet the goals/
    intent behind the plan; and 

    3. when taking the initiative, the unit begins to adapts on its own, using
    information and knowledge available at that point, without asking for and
    then waiting for explicit authorization or tasking from other units.

    View Slide

  32. case of brittleness
    • 2010 Knight Capital collapse incident
    • new changes deployed to participate in a new market
    • unexpected algorithmic mechanisms led to unbounded automated trading
    activity
    • team rolls back changes, situation gets much worse
    • team did not believe it had authority to halt system
    • $440M loss in ~20 minutes

    View Slide

  33. in responding to an incident…
    • do you have access to contact details for everyone in your organization?
    • what actions do you need permission to take?
    • what repercussions exist for “violating” procedures or compliance rules?
    • can you anticipate what “neighboring” teams may need in the future that
    you have (expertise, staff, resources, etc.) and can donate to them before
    they need it, even if it sacrifices some of your local goals?

    View Slide

  34. Can resilience be engineered?
    Maybe! We think so!
    Not entirely sure how yet, exactly.

    View Slide

  35. Challenges to DevOps+SRE
    communities w.r.t.
    Resilience Engineering

    View Slide

  36. Challenges
    • Inertia towards the status quo, oversimplifications
    • Chronic inability to learn from other domains

    • Technofetishization and automation naïvety

    View Slide

  37. The Status Quo Beliefs
    • Tyranny of metrics and "shallow data"
    • Under-investment in real incident analysis expertise
    • Oversimplified methods such as one-size-fits-all postmortem templates

    View Slide

  38. • “mean time to X” numbers are negotiated, not objective
    • all incident data is reactive and scoped to unwanted events; they tell us
    nothing about wanted situations
    • “trending” these numbers tell us nothing about learning, prevention,
    expertise, proactiveness, or adaptive capacity.
    Inconvenient realities of shallow data

    View Slide

  39. Bottom Line, Revisted
    • Resilience Engineering is a nascent field aiming to create and sustain
    conditions where resilience can manifest productively.
    • Resilience is something a system (your organization, not your software)
    does, not what it has.
    • Resilience is sustained adaptive capacity, or continuous adaptability to
    unforeseen situations.
    • Our world (software) has opportunities to further the state of the field, but
    face real challenges.

    View Slide

  40. Thank You!
    @allspaw
    Resilience Is A Verb (Woods, 2018)
    http://bit.ly/ResilienceIsAVerb
    Stella Report
    http://stella.report
    https://www.adaptivecapacitylabs.com/blog
    @AdaptiveCLabs
    SRE Cognitive Work
    (chapter in Seeking SRE, O’Reilly Media)
    http://bit.ly/SRECognitiveWork
    How Complex Systems Fail (Cook, 1998)
    http://bit.ly/ComplexSystemsFailure

    View Slide