Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering Considering Failures from Dev...

Yury Nino
November 10, 2020

Chaos Engineering Considering Failures from Development

Yury Nino

November 10, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. ▪ Chaos Engineering ▪ Software Development Lifecycle ▪ Designing for

    Chaos ▪ Coding for Chaos ▪ Deploying for Chaos ▪ Operating with Chaos
  2. Chaos Engineering It is the discipline of experimenting failures in

    production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  3. 2008 Chaos Engineering was born at Netflix 2010 Chaos Monkey

    & Simian Army were launched 2016 Gremlin was born 2019 Chaos Massification 2017 SRE USenix Chaos IQ ChaosConf 2018 Book Chaos Eng 2020 Book Chaos Eng History
  4. Designing Requirements focus on general attributes rather than specific behaviors.

    How to guarantee resilience? It’s not easy to design and build Nonfunctional Requirements, since they are primarily emergent properties! Design to fail
  5. Principles Design • Design for Least Privilege • Design for

    Understandability • Design for Changing Landscape • Design for Resilience • Design for Recovery Where are failures?
  6. Principles Design • Design for Least Privilege • Design for

    Understandability • Design for Changing Landscape • Design for Chaos • Design for Resilience • Design for Recovery
  7. Google reported that 85% of all bugs in Android were

    caused by memory management errors. How to guarantee resilience? They concluded that “they need to move towards memory safe languages”. Code to fail
  8. Principles Coding • Programming Language Choice • Complexity vs Understandability

    • Securing Third-Party Software • Testing Code • Data Validation Where is Resilience?
  9. Principles Coding • Programming Language Choice • Complexity vs Understandability

    • Securing Third-Party Software • Functions based Chaos • Testing Code • Data Validation
  10. Deployment Practices • Require Code Reviews • Rely on Automation

    • Verify Artifacts, Not Just People • Treat Configuration as Code • Securing Against the Threat Model • Policies Verifiable Builds • Post-Deployment Verification
  11. Complex systems can fail in both simple and complex ways,

    ranging from unexpected service outages to attacks by malicious actors to gain unauthorized access Netflix Twitter
  12. Principles Operating • Define a disaster • Prepare a Disaster

    Planning • Identify Team and Roles • Establish a Team Charter • Establish Severity Models • Develop Response Plans • Create Detailed Playbooks It considers Resilience :)