Verifying Unknown Conditions with Chaos Engineering

Verifying Unknown Conditions Chaos Engineering in AWS 1 © 2021
ADL - AWS www.sitereliabilityengineering.co

© 2021 ADL - AWS www.sitereliabilityengineering.co 2 @yurynino Colombia www.ingenieriadelcaos.com
www.sitereliabilityengineering.co www.yurynino.com Site Reliability Engineers Chaos Engineering Advocates @jhonnatangil Colombia

3 • Challenges with distributed systems. • What is Chaos
Engineering? • Why testing in Production is hard? • AWS Fault Injection Simulator • FIS Features • FIS Use Cases • FIS Demo AGENDA Topics will be covered © 2021 ADL - AWS www.sitereliabilityengineering.co

5 © 2021 ADL - AWS www.sitereliabilityengineering.co Humans, are central
to both the problem and the solution of incidents in engineering!

The infrastructure required by a software system can be as
complex as the software itself. Every production failure is unique. No two incidents will share the precise chain of failure! Distributed Systems 6 Distributed Systems © 2021 ADL - AWS www.sitereliabilityengineering.co They provide a particular challenge to program! Netflix Twitter

7 © 2021 ADL - AWS www.sitereliabilityengineering.co Implementation What does
it mean for a system to be distributed?

Distributed Systems 8 Distributed Systems Patterns © 2021 ADL -
AWS www.sitereliabilityengineering.co Patterns provide a structured way of looking at a problem space along with the solutions which are seen multiple times and proven. Patterns is a concept introduced by Christopher Alexander Looking at distributed systems as a series of patterns is a useful way to gain insights into their implementation.

AWS www.sitereliabilityengineering.co Consistent Core Follower Readers Generation Clock Gossip Dissemination HeartBeat Hybrid Clock Idempotent Receiver State Watch Quorum

AWS www.sitereliabilityengineering.co

Distributed Systems 12 What could go wrong? © 2021 ADL
- AWS www.sitereliabilityengineering.co Our systems face all kinds of adversities: hard disks failures, network can go down, customer traﬃc can overload and cyberattack can happen. In this chaotic world, how can they still be alive?

13 © 2021 ADL - AWS www.sitereliabilityengineering.co Process Crashes It
can be taken down for routine maintenance by system administrators. It can be killed doing some ﬁle IO because the disk is full and the exception is not properly handled. In cloud environments, it can be even trickier, as some unrelated events can bring the servers down.

14 © 2021 ADL - AWS www.sitereliabilityengineering.co Network Delays There
are two problems to be tackled here: A particular server can not wait indeﬁnitely to know if another server has crashed. There should not be two sets of servers, each considering another set to have failed, and therefore continuing to serve diﬀerent sets of clients. This is called the split brain.

Distributed Systems 15 Fallacies in Distributed Systems © 2021 ADL
- AWS www.sitereliabilityengineering.co

17 © 2021 ADL - AWS www.sitereliabilityengineering.co Everything fails, all
the time! Werner Vogels

18 © 2021 ADL - AWS www.sitereliabilityengineering.co What is Chaos
Engineering? It is the discipline of experimenting failures in production in order to reveal their weakness and to build conﬁdence in their resilience capability. https://principlesofchaos.org/

19 © 2021 ADL - AWS www.sitereliabilityengineering.co 2008 Chaos Engineering
began at Netﬂix 2010 Chaos Monkey & Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massiﬁcation 2017 SRE Usenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published Distributed Systems Chaos History

20 © 2021 ADL - AWS www.sitereliabilityengineering.co Chaos Tools Chaos
Monkey Chaos Toolkit Gremlin Chaos Monkey for Spring Boot ChaosMesh AWS Fault Injection Simulator

23 You can measure your Success with Chaos Engineering by
counting the number of vulnerabilities Nora Jones © 2021 ADL - AWS www.sitereliabilityengineering.co

Distributed Systems 24 AWS Fault Injection Simulator © 2021 ADL
- AWS www.sitereliabilityengineering.co AWS FIS is for running fault injection experiments on AWS to improve an application’s performance, observability, and resiliency. Fault injection experiments are used in chaos engineering for stressing an application in testing or production environments by creating disruptive events, observing how the system responds, and implementing improvements. FIS provides controls and guardrails to run experiments, such as automatically rolling back or stopping the experiment if speciﬁc conditions are met.

37 © 2021 ADL - AWS www.sitereliabilityengineering.co How to begin?
https://www.ingenieriadelcaos.com https://chaosengineering.slack.com https://github.com/dastergon/ awesome-chaos-engineering https://www.infoq.com/chaos-engineering https://www.ingenieriadelcaos.com/

Thanks for coming! @yurynino

Verifying Unknown Conditions with Chaos Enginee...

Verifying Unknown Conditions with Chaos Engineering

More Decks by Yury Nino

Other Decks in Technology

Featured

Transcript