Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsDays NYC 2020: Can Resilience Engineering...

DevOpsDays NYC 2020: Can Resilience Engineering be sufficiently described in 5 minutes?

Of course the answer to the question in the title is “no” because this twenty-year old multidisciplinary field is as broad and deep as Distributed Systems. Bringing perspectives, methods, and concepts from Resilience Engineering is a long game; my goal is to whet your appetite and lay down enough compelling threads for you to pull on as this important long game unfolds.

John Allspaw

March 03, 2020
Tweet

More Decks by John Allspaw

Other Decks in Technology

Transcript

  1. There is no “one weird trick” to understand Resilience Engineering

    and that’s ok! John Allspaw Adaptive Capacity Labs
  2. Resilience Engineering is a FIELD Cybernetics Ecology Safety Science Biology

    Control Systems Human Factors & Ergonomics Cognitive Systems Engineering Complexity Science Cognitive Psychology Sociology Operations Research
  3. Resilience Engineering is a COMMUNITY Rail Maritime Surgery Intelligence Agencies

    Law Enforcement Aviation/ATM Space Mining Construction Explosives Firefighting Anesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering
  4. Resilience Engineering is a COMMUNITY Rail Maritime Surgery Intelligence Agencies

    Law Enforcement Aviation/ATM Space Mining Construction Explosives Firefighting Anesthesia Pediatrics Power Grid & Distribution Military Agencies Software Engineering
  5. resilience is not these things • redundancy • robustness •

    high-availability / fault-tolerance • Chaos Engineering • anything about software or hardware!
  6. Buy a ticket and win Don’t buy a ticket and

    win fundamental surprise situational surprise
  7. example “in the wild” it’s needed for handling fundamental surprises

    we continually invest in the ability to deploy to production when it’s needed
  8. • automated tests • availability of (and expertise involved) peer

    code review • availability (and familiarity with using) feature/config flags • people available and looking for signs of trouble, focusing attention • the ability to contact others who can help if necessary
  9. Change Is Afoot 2018 2019 J. Paul Reed 2018 Nora

    Jones Casey Rosenthal 2020 Jessica DeVita Chad Todd Tim Tischler 2021 Learning From Incidents In Software http://learningfromincidents.io