How to start with Chaos Engineering for Organizations

Chaos Engineering Chaos Maturity Model Organizations Quarantine Tech Talks April
11th

YURY NIÑO Site Reliability Engineer Chaos Engineering Advocate Garagoa is
a town located in Boyacá, a Department in Colombia.

TITANIC QUEBEC BRIDGE

What is common with them?

What is Chaos Engineering? Slack: Disasterpiece LinkedIn: Mindful CapitalOne: Evolution
Microsoft: Variation Chaos Maturity Model How to start? Agenda

What is Chaos Engineering? It is the discipline of experimenting
failures in production in order to reveal their weakness and to build conﬁdence in their resilience capability. https://principlesofchaos.org/

History 2008 Chaos Engineering began at Netﬂix 2010 Chaos Monkey
& Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massiﬁcation 2017 SRE USenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published

What my mom thinks I do What my friends thinks
I do What software engineers think I do What I really do Who is a Chaos Engineer? Help service owners to increase their resilience through education, tools and encouragement.

Common Questions

Who are they?

They are practicing Chaos Engineering

Slack Disaster Piece Whenever they launch features or make changes,
we test the fault tolerance of that new code! In January of 2018, they started a rigorous process of identifying failures that are likely to happen and that we must be able to tolerate, and then purposely causing them to happen in production. This isn’t Chaos Engineering as practiced and evangelized by Netﬂix. It’s the ﬁrst step; we call it Disasterpiece Theater. Taken from Chaos Engineering Book - 2020

Slack Disaster Piece 1. Decide on a server or service
that will be caused to fail. 2. Survey the server or service in dev and prod. 3. Identify alerts, dashboards, logs, and metrics. 4. Identify redundancies and automated remediations. 5. Invite all the relevant people to the event! Taken from Chaos Engineering Book - 2020

Slack Disaster Piece 1. Announce the exercise and incite the
failure in DEV! 2. Announce the exercise and incite the failure in PROD! 3. Receive alerts and inspect dashboards 4. Give automated remediations time to be triggered. 5. Follow runbooks to restore service in prod. 6. Debrief & distribute the recording! Taken from Chaos Engineering Book - 2020

Linkedin LinkedOut Look back at famous incidents that went awry
to help draw conclusions about how we should plan their chaos experiments. When you’re running failure experiments for the ﬁrst time, it’s important to start small. Even shutting down one server can be hard to recover from. 3 failure modes into the disruptor: Error, Delay and Timeout. Taken from Chaos Engineering Book - 2020

Linkedin LinkedOut Taken from Chaos Engineering Book - 2020

Who are they?

Capital One Chaos Continuous The rise of neobanks brings ﬁnancial
capabilities powered by blockchain, AI, machine learning, and business intelligence. They are building cloud native apps and adopting practices like CI/CD pipelines, templated frameworks, secret management and Chaos Engineering. They started with a list of around 25 experiments. Taken from Chaos Engineering Book - 2020

Capital One Chaos Continuous 1. Clear documentation of the expected
behavior. 2. Potential/possible failures. 3. Impact to in-ﬂight transactions. 4. Monitoring of the infrastructure and application. 5. Risk score for each experiment. Taken from Chaos Engineering Book - 2020

Capital One Evolution Testing CI/CD Tooling Team Culture Evangelism Taken
from Chaos Engineering Book - 2020

How can WE start?

Chaos Maturity Model Taken from Chaos Engineering Book - 2018

Chaos Maturity Model Sophistication Elementary Simple Sophisticated Advanced

Sophistication Elementary

Elementary • Experiments are not run in production. • The
process is administered manually. • Results do not reﬂect business metrics. • Simple events are applied like turn it off. Taken from Chaos Engineering Book - 2018

Sophistication Simple

Simple • Experiments are replayed like production. • Self-service setup,
automatic execution. • Results reﬂect aggregated business metrics. • Expanded events like latency are applied. • Results are manually curated. Taken from Chaos Engineering Book - 2018

Sophistication Sophisticated

• Experiments run in production. • Result analysis and termination
are automated. • Experimentation is integrated with CD. • Business metrics are compared. • Combination of failures. • Results are tracked over time. Sophisticated Taken from Chaos Engineering Book - 2018

Sophistication Advanced

• Experiments run in each step of development. • Design
and execution are fully automated. • Events include state mutation. • Experiments have dynamic scope. • Revenue loss can be projected from results. • Capacity forecasting can be performed. Advanced Taken from Chaos Engineering Book - 2018

Adoption In the shadows Investment Cultural Expectation

In the shadows

• Skunkworks projects are unsanctioned. • Few systems covered. •
There is low or no organizational awareness. • Early adopters infrequently perform chaos experimentation. In shadows Taken from Chaos Engineering Book - 2018

Investment

• Experimentation is ofﬁcially sanctioned. • Resources are dedicated to
the practice. • Multiple teams are interested and engaged. • A few critical services run chaos. Investment Taken from Chaos Engineering Book - 2018

Adoption

• A team is dedicated to Chaos Engineering. • Incidents
allow to create regression experiments. • Critical services practice regular Chaos. • Occasional Gamedays are performed. Adoption Taken from Chaos Engineering Book - 2018

Cultural Expectation

• All critical services run frequent chaos. • Most noncritical
services use chaos. • Chaos experimentation is part of onboarding. • Participation is the default behavior. Expectation Taken from Chaos Engineering Book - 2018

How to start? https://chaosengineering.slack.com https://github.com/dastergon/awesome-chaos-e ngineering https://www.infoq.com/chaos-engineering https://www.gremlin.com/community/

Never let a good crisis go to waste! Winston Churchill

Thanks for coming!!! @yurynino

How to start with Chaos Engineering for Organiz...

How to start with Chaos Engineering for Organizations

More Decks by Yury Nino

Other Decks in Technology

Featured

Transcript