Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilience Engineering: Why break your software...

Resilience Engineering: Why break your software with Chaos Engineering?

Yury Nino

June 11, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. WOMEN TECH GLOBAL CONFERENCE 2020 JUNE 10-12 VIRTUAL Unite 100

    000 Women in Tech to Drive Change with Purpose and Impact.
  2. Resilience Engineering • Miracle on the Hudson River • Recording,

    tracking, and solving incidents • Why Resilience matters? • 4 Capabilities ◦ Ability to Respond ◦ Ability to Monitor ◦ Ability to Anticipate ◦ Ability to Learn • Resilience & Reliability • Reliability & Chaos Engineering • Resilient & Reliable Engineering Teams
  3. US Airways Flight 1549 ditched into the Hudson River New

    York January 15th 2009 This emergency ditching and evacuation, with the loss of no lives, was a heroic and unique aviation achievement
  4. Resilience Engineering • To be able to construct a mental

    representation of the situation. • To be able to assess risk and threats as relevant for the flight. • To be able to switch from a situation under control. • To be able to maintain a relevant level of confidence. • To be able to make a decision in a complex.
  5. Resilience Engineering • To be able to make an intelligent

    usage of procedures. • To be able to use available technical and human resources. • To be able to manage time and time pressure. • To be able to cooperate with, crew members and other staff. • To be able to properly use and manage information.
  6. Recording, triaging, tracking, and measure the incidents that impact our

    applications is difficult, time-consuming, and requires talent and experienced engineering.
  7. Resilience is the intrinsic ability of a system to adjust

    its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions. Erik Hollnagel
  8. Resilience is the ability of a system to adapt itself

    due to changes, failures, and anomalies.
  9. This is the capability to address the actual Knowing what

    to do, or being able to respond to regular and irregular variability. Rate it as being so serious it is necessary to know how and when to respond The system must first detect that something has happened, then recognise the event.
  10. This is the capability to address the factual Knowing what

    has happened to learn from experience, in particular to learn the right lessons. Learning is generally defined as ‘a change in behaviour as a result of experience’. The future performance only can be improved if something is learned from past performance.
  11. This is the capability to address the critical Knowing what

    to look for, or being able to monitor that which changes, or may change Monitoring enables the system to address possible near-term threats and opportunities before they become reality. A resilient system must be able flexibly to monitor its own performance as well as changes in the environment.
  12. This is the capability to address the potencial Knowing what

    to expect, or being able to anticipate developments, threats, and opportunities. The anticipation of future has little support in current methods, although it is considered just as important as the search for threats. Risk assessment focuses on future threats and is suitable for systems where the principles of functioning are known
  13. Resilience means that the critical parts of a system can

    mitigate, survive, and/or recover from high impact threats. A distributed system on production needs to be resilient in order to be reliable! Reliability means that the lights always come on when you throw the switch.
  14. Chaos Engineering introduces the injection of failures as a discipline

    for building confidence in the resilience capability of the systems.
  15. What is Chaos Engineering? It is the discipline of experimenting

    failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  16. If we want to have distributed systems, highly available, reliable

    and resilient! We must be reliable and resilient! Take care of yourself!