Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilience Engineering: It Might Not Mean What You Think It Means

Resilience Engineering: It Might Not Mean What You Think It Means

This is a talk I gave at the Chaos Community Day event in 2017.
The goal of the talk was to give the audience a high-level introduction to Resilience Engineering, both the field of study and the community of researchers), and attempt to describe a couple of core perspectives from Resilience Engineering.

John Allspaw

January 25, 2018
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. Resilience Engineering
    It Might Not Mean What You Think It Means
    John Allspaw
    MSc., Human Factors and Systems Safety
    Adaptive Capacity Labs
    SNAFU Catchers

    View full-size slide

  2. What You Are In For
    1. Resilience Engineering: a field and a community
    2. Recalibration: the “resilience” label
    3. Strong assertions on how to think about resilience
    4. How RE might approach the topic of fault injection
    5. A request

    View full-size slide

  3. Resilience Engineering
    • A field of study that emerged largely from Cognitive Systems Engineering,
    early 2000s.
    • David Woods, Erik Hollnagel, Nancy Leveson, Richard Cook, Sidney Dekker,
    Jean Paris, Bob Wears, more…
    • 7 symposia over 12 years

    View full-size slide

  4. Resilience Engineering
    Community
    is largely made up of practitioners and researchers from….
    working in these domains…
    Aviation/ATM
    Rail
    Maritime
    Space
    Surgery Power Plants
    Intelligence
    Agencies
    Law Enforcement
    Mining
    Construction
    Explosives
    Firefighting
    Anesthesia
    Pediatrics
    Power Grid &
    Distribution
    Military
    Agencies
    Software Engineering
    Human Factors & Ergonomics Cognitive Systems Engineering Cybernetics Complexity Science Engineering*
    Psychology Sociology Ecology Safety Science

    View full-size slide

  5. Some of the cast of characters
    David Woods
    CSEL/OSU
    Shawna Perry
    Univ of Florida
    Emergency Medicine
    Dr. Richard Cook
    Anesthesiologist
    Researcher
    Ivonne Andrade Herrera
    SINTEF
    Erik Hollnagel
    Univ of S. Denmark
    Anne-Sophie Nyssen
    University de Liege Johan Bergström
    Lund University
    Sidney Dekker
    Griffith University
    Asher Balkin
    CSEL/OSU
    Laura Maguire
    CSEL/OSU

    View full-size slide

  6. Sample of Research
    Experiences in Fukushima Dai-ichi nuclear power plant in light of resilience engineering
    Unmanned Aircraft Systems in (Inter)national Airspace: Resilience as a Lever in the Debate
    Sociotechnical Networks for Power Grid Resilience: South Korean Case Study
    Limits on adaptation: Modeling Resilience and Brittleness in Hospital Emergency Departments

    View full-size slide

  7. Resilience is something that a system
    does, not what a system has.

    View full-size slide

  8. Resilience is the story of the outage
    that didn’t happen.

    View full-size slide

  9. A Mental Model

    View full-size slide

  10. externally sourced
    code (e.g. DB)
    results
    delivery
    technology
    stack
    internally sourced code
    results
    the outside world

    View full-size slide

  11. code
    generating
    tools
    testing
    tools
    deployment
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    code repositories
    code stuff
    testing/validation suites
    scripts,
    rules, etc.
    test cases
    neo-assemblers
    pseudo/
    meta/ rules
    code
    externally sourced
    code (e.g. DB)
    results
    delivery
    technology
    stack
    internally sourced code
    results
    the outside world

    View full-size slide

  12. code
    generating
    tools
    testing
    tools
    deployment
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    pseudo/
    meta/ rules
    code
    getting stuff ready
    to be part of the
    running system
    adding stuff to the
    running system
    architectural and
    structural framing
    keeping track of
    what “the system”
    is doing

    View full-size slide

  13. code
    generating
    tools
    testing
    tools
    deployment
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    getting stuff ready
    to be part of the
    running system
    adding stuff to
    the running
    system
    architectural
    and structural
    framing
    keeping track of
    what “the
    system” is doing
    code repositories
    code stuff
    testing/validation suites
    scripts,
    rules, etc.
    test cases
    neo-assemblers
    pseudo/
    meta/ rules
    code
    externally sourced
    code (e.g. DB)
    results
    delivery
    technology
    stack
    internally sourced code
    results
    “below the line”
    “above the line”

    View full-size slide

  14. code
    generating
    tools
    testing
    tools
    deployment
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    getting stuff ready
    to be part of the
    running system
    adding stuff to
    the running
    system
    architectural
    and structural
    framing
    keeping track of
    what “the
    system” is doing
    code repositories
    code stuff
    testing/validation suites
    scripts,
    rules, etc.
    test cases
    neo-assemblers
    pseudo/
    meta/ rules
    code
    externally sourced
    code (e.g. DB)
    results
    delivery
    technology
    stack
    internally sourced code
    results
    The Thing You’re Building
    The Stuff You Build and Maintain With
    The People Doing The Work

    View full-size slide

  15. code generating tools
    testing tools deployment tools organization/
    encapsulation tools
    “monitoring” tools
    getting stuff ready to be part
    of the running system
    adding stuff to the
    running system
    architectural and
    structural framing
    keeping track of what “the
    system” is doing
    T
    coordinating
    testing
    anticipating
    learning
    modeling
    troubleshooting
    organizing
    remembering
    revising
    planning
    monitoring

    View full-size slide

  16. If you haven’t found people responsible for
    outcomes, you haven’t seen the system.

    View full-size slide

  17. externally sourced
    code (e.g. DB)
    results
    delivery
    technology
    stack
    internally sourced code
    results
    Representations
    } Interactions
    Communications
    Signaling
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should happen?
    What does it mean?
    Cognition
    Goals
    Purposes
    Risks
    What matters Why what matters matters
    getting stuff ready
    to be part of the
    running system
    adding stuff to
    the running
    system
    architectural
    and structural
    framing
    keeping track of
    what “the system”
    is doing
    code
    generating
    tools
    testing
    tools
    deployment
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    code repositories
    code stuff
    testing/validation suites
    scripts,
    rules, etc.
    test cases
    neo-assemblers
    pseudo/
    meta/ rules
    code
    } Artifacts

    View full-size slide

  18. externally sourced
    code (e.g. DB)
    results
    delivery
    technology
    stack
    internally sourced code
    results
    Representations
    } Interactions
    Communications
    Signaling
    Why is it doing that?
    What needs to change?
    What does it mean?
    How should this work?
    What’s it doing?
    What does it mean?
    What is happening?
    What should happen?
    What does it mean?
    Cognition
    Goals
    Purposes
    Risks
    What matters Why what matters matters
    getting stuff ready to
    be part of the
    running system
    adding stuff to
    the running
    system
    architectural and
    structural framing
    keeping track of
    what “the system”
    is doing
    code
    generating
    tools
    testing
    tools
    deployme
    nt tools
    organization/
    encapsulatio
    n tools
    “monitoring”
    tools
    code repositories
    code stuff
    testing/validation suites
    scripts,
    rules, etc.
    test cases
    neo-assemblers
    pseudo/
    meta/ rules
    code
    } Artifacts
    Time
    A Resilience Engineering Unit of Analysis

    View full-size slide

  19. Humans are predominantly seen as a liability or hazard.
    They are a problem to be fixed.
    Traditional view on the role of people (“Safety-I”)
    Humans are seen as a resource necessary for system flexibility and resilience.
    They provide flexible solutions to many potential problems.
    RE view on the role of people in complex systems (“Safety-II”)

    View full-size slide

  20. “above the line”
    …is not “management”
    …is not “organization design” or reporting structures
    …is how people work (detect/diagnose/solve problems, both acute and
    chronic) alongside and with technology and each other, under continual
    trade-off scenarios, that provide the…

    View full-size slide

  21. potential to…
    • respond
    • monitor
    • learn
    • anticipate
    the AUDACITY to build and sustain the

    View full-size slide

  22. Why “audacity”?

    View full-size slide

  23. fault injection as a focus

    View full-size slide

  24. Questions from an RE perspective
    • What resources (funding, incentives, etc.) encourage engineering groups to
    invest time and effort into designing new fault injection cases?
    • What criteria directs our focus of attention for fault injection scenarios?
    • How do teams assess the level of effort needed to maintain the stuff that
    makes fault injection work - and work safely?
    • How do teams assess the ongoing value of specific fault injections versus
    others?

    View full-size slide

  25. potential to…
    • respond
    • monitor
    • learn
    • anticipate
    the AUDACITY to build and sustain the

    View full-size slide

  26. code
    generating
    tools
    testing
    tools
    deployment
    tools
    organization/
    encapsulation
    tools
    “monitoring”
    tools
    getting stuff ready
    to be part of the
    running system
    adding stuff to
    the running
    system
    architectural
    and structural
    framing
    keeping track of
    what “the
    system” is doing
    code repositories
    code stuff
    testing/validation suites
    scripts,
    rules, etc.
    test cases
    neo-assemblers
    pseudo/
    meta/ rules
    code
    externally sourced
    code (e.g. DB)
    results
    delivery
    technology
    stack
    internally sourced code
    results
    The Thing You’re Building
    The Stuff You Build and Maintain With
    The People Doing The Work

    View full-size slide

  27. Resilience
    sustained adaptive capacity

    View full-size slide

  28. Copyright © 2016 by R.I.Cook
    Copyright ⓒ 2016 by Richard Cook for SNAFUmasters
    www.snafucatchers.org
    http://bit.ly/ResilienceConsortium

    View full-size slide

  29. RE Needs Cases From Our World
    • This is not a field isolated in academia! Progress in RE depends on
    exploring a wide and diverse set of cases.

    • Incidents (especially minor ones) make for good case to explore adaptive
    capacity.

    View full-size slide

  30. Adaptive
    Capacity
    Labs
    Vehicles For Doing This Research

    View full-size slide

  31. http://bit.ly/REShortCourse
    Short Course In Resilience Engineering
    David Woods

    View full-size slide