Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to start with Chaos Engineering for Organizations

How to start with Chaos Engineering for Organizations

How to use Chaos Engineering to build Reliable Systems Productively

Yury Nino

April 11, 2020
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. YURY NIÑO Site Reliability Engineer Chaos Engineering Advocate Garagoa is

    a town located in Boyacá, a Department in Colombia.
  2. What is Chaos Engineering? Slack: Disasterpiece LinkedIn: Mindful CapitalOne: Evolution

    Microsoft: Variation Chaos Maturity Model How to start? Agenda
  3. What is Chaos Engineering? It is the discipline of experimenting

    failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/
  4. History 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey

    & Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massification 2017 SRE USenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published
  5. What my mom thinks I do What my friends thinks

    I do What software engineers think I do What I really do Who is a Chaos Engineer? Help service owners to increase their resilience through education, tools and encouragement.
  6. Slack Disaster Piece Whenever they launch features or make changes,

    we test the fault tolerance of that new code! In January of 2018, they started a rigorous process of identifying failures that are likely to happen and that we must be able to tolerate, and then purposely causing them to happen in production. This isn’t Chaos Engineering as practiced and evangelized by Netflix. It’s the first step; we call it Disasterpiece Theater. Taken from Chaos Engineering Book - 2020
  7. Slack Disaster Piece 1. Decide on a server or service

    that will be caused to fail. 2. Survey the server or service in dev and prod. 3. Identify alerts, dashboards, logs, and metrics. 4. Identify redundancies and automated remediations. 5. Invite all the relevant people to the event! Taken from Chaos Engineering Book - 2020
  8. Slack Disaster Piece 1. Announce the exercise and incite the

    failure in DEV! 2. Announce the exercise and incite the failure in PROD! 3. Receive alerts and inspect dashboards 4. Give automated remediations time to be triggered. 5. Follow runbooks to restore service in prod. 6. Debrief & distribute the recording! Taken from Chaos Engineering Book - 2020
  9. Linkedin LinkedOut Look back at famous incidents that went awry

    to help draw conclusions about how we should plan their chaos experiments. When you’re running failure experiments for the first time, it’s important to start small. Even shutting down one server can be hard to recover from. 3 failure modes into the disruptor: Error, Delay and Timeout. Taken from Chaos Engineering Book - 2020
  10. Capital One Chaos Continuous The rise of neobanks brings financial

    capabilities powered by blockchain, AI, machine learning, and business intelligence. They are building cloud native apps and adopting practices like CI/CD pipelines, templated frameworks, secret management and Chaos Engineering. They started with a list of around 25 experiments. Taken from Chaos Engineering Book - 2020
  11. Capital One Chaos Continuous 1. Clear documentation of the expected

    behavior. 2. Potential/possible failures. 3. Impact to in-flight transactions. 4. Monitoring of the infrastructure and application. 5. Risk score for each experiment. Taken from Chaos Engineering Book - 2020
  12. Elementary • Experiments are not run in production. • The

    process is administered manually. • Results do not reflect business metrics. • Simple events are applied like turn it off. Taken from Chaos Engineering Book - 2018
  13. Simple • Experiments are replayed like production. • Self-service setup,

    automatic execution. • Results reflect aggregated business metrics. • Expanded events like latency are applied. • Results are manually curated. Taken from Chaos Engineering Book - 2018
  14. • Experiments run in production. • Result analysis and termination

    are automated. • Experimentation is integrated with CD. • Business metrics are compared. • Combination of failures. • Results are tracked over time. Sophisticated Taken from Chaos Engineering Book - 2018
  15. • Experiments run in each step of development. • Design

    and execution are fully automated. • Events include state mutation. • Experiments have dynamic scope. • Revenue loss can be projected from results. • Capacity forecasting can be performed. Advanced Taken from Chaos Engineering Book - 2018
  16. • Skunkworks projects are unsanctioned. • Few systems covered. •

    There is low or no organizational awareness. • Early adopters infrequently perform chaos experimentation. In shadows Taken from Chaos Engineering Book - 2018
  17. • Experimentation is officially sanctioned. • Resources are dedicated to

    the practice. • Multiple teams are interested and engaged. • A few critical services run chaos. Investment Taken from Chaos Engineering Book - 2018
  18. • A team is dedicated to Chaos Engineering. • Incidents

    allow to create regression experiments. • Critical services practice regular Chaos. • Occasional Gamedays are performed. Adoption Taken from Chaos Engineering Book - 2018
  19. • All critical services run frequent chaos. • Most noncritical

    services use chaos. • Chaos experimentation is part of onboarding. • Participation is the default behavior. Expectation Taken from Chaos Engineering Book - 2018