Slide 1

Slide 1 text

Verifying Unknown Conditions Chaos Engineering in AWS 1 © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 2

Slide 2 text

© 2021 ADL - AWS www.sitereliabilityengineering.co 2 @yurynino Colombia www.ingenieriadelcaos.com www.sitereliabilityengineering.co www.yurynino.com Site Reliability Engineers Chaos Engineering Advocates @jhonnatangil Colombia

Slide 3

Slide 3 text

3 ● Challenges with distributed systems. ● What is Chaos Engineering? ● Why testing in Production is hard? ● AWS Fault Injection Simulator ● FIS Features ● FIS Use Cases ● FIS Demo AGENDA Topics will be covered © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 4

Slide 4 text

4 Introduction Distributed Systems © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 5

Slide 5 text

5 © 2021 ADL - AWS www.sitereliabilityengineering.co Humans, are central to both the problem and the solution of incidents in engineering!

Slide 6

Slide 6 text

The infrastructure required by a software system can be as complex as the software itself. Every production failure is unique. No two incidents will share the precise chain of failure! Distributed Systems 6 Distributed Systems © 2021 ADL - AWS www.sitereliabilityengineering.co They provide a particular challenge to program! Netflix Twitter

Slide 7

Slide 7 text

7 © 2021 ADL - AWS www.sitereliabilityengineering.co Implementation What does it mean for a system to be distributed?

Slide 8

Slide 8 text

Distributed Systems 8 Distributed Systems Patterns © 2021 ADL - AWS www.sitereliabilityengineering.co Patterns provide a structured way of looking at a problem space along with the solutions which are seen multiple times and proven. Patterns is a concept introduced by Christopher Alexander Looking at distributed systems as a series of patterns is a useful way to gain insights into their implementation.

Slide 9

Slide 9 text

Distributed Systems 9 Distributed Systems Patterns © 2021 ADL - AWS www.sitereliabilityengineering.co Consistent Core Follower Readers Generation Clock Gossip Dissemination HeartBeat Hybrid Clock Idempotent Receiver State Watch Quorum

Slide 10

Slide 10 text

Distributed Systems 10 Distributed Systems Patterns © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 11

Slide 11 text

11 Distributed Systems Challenges © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 12

Slide 12 text

Distributed Systems 12 What could go wrong? © 2021 ADL - AWS www.sitereliabilityengineering.co Our systems face all kinds of adversities: hard disks failures, network can go down, customer traffic can overload and cyberattack can happen. In this chaotic world, how can they still be alive?

Slide 13

Slide 13 text

13 © 2021 ADL - AWS www.sitereliabilityengineering.co Process Crashes It can be taken down for routine maintenance by system administrators. It can be killed doing some file IO because the disk is full and the exception is not properly handled. In cloud environments, it can be even trickier, as some unrelated events can bring the servers down.

Slide 14

Slide 14 text

14 © 2021 ADL - AWS www.sitereliabilityengineering.co Network Delays There are two problems to be tackled here: A particular server can not wait indefinitely to know if another server has crashed. There should not be two sets of servers, each considering another set to have failed, and therefore continuing to serve different sets of clients. This is called the split brain.

Slide 15

Slide 15 text

Distributed Systems 15 Fallacies in Distributed Systems © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 16

Slide 16 text

16 Foundations Chaos Engineering © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 17

Slide 17 text

17 © 2021 ADL - AWS www.sitereliabilityengineering.co Everything fails, all the time! Werner Vogels

Slide 18

Slide 18 text

18 © 2021 ADL - AWS www.sitereliabilityengineering.co What is Chaos Engineering? It is the discipline of experimenting failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/

Slide 19

Slide 19 text

19 © 2021 ADL - AWS www.sitereliabilityengineering.co 2008 Chaos Engineering began at Netflix 2010 Chaos Monkey & Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massification 2017 SRE Usenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published Distributed Systems Chaos History

Slide 20

Slide 20 text

20 © 2021 ADL - AWS www.sitereliabilityengineering.co Chaos Tools Chaos Monkey Chaos Toolkit Gremlin Chaos Monkey for Spring Boot ChaosMesh AWS Fault Injection Simulator

Slide 21

Slide 21 text

21 © 2021 ADL - AWS www.sitereliabilityengineering.co Chaos Resources

Slide 22

Slide 22 text

22 Foundations AWS Failure Injection Simulator © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 23

Slide 23 text

23 You can measure your Success with Chaos Engineering by counting the number of vulnerabilities Nora Jones © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 24

Slide 24 text

Distributed Systems 24 AWS Fault Injection Simulator © 2021 ADL - AWS www.sitereliabilityengineering.co AWS FIS is for running fault injection experiments on AWS to improve an application’s performance, observability, and resiliency. Fault injection experiments are used in chaos engineering for stressing an application in testing or production environments by creating disruptive events, observing how the system responds, and implementing improvements. FIS provides controls and guardrails to run experiments, such as automatically rolling back or stopping the experiment if specific conditions are met.

Slide 25

Slide 25 text

Distributed Systems 25 AWS Fault Injection Simulator © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 26

Slide 26 text

Distributed Systems 26 AWS Fault Injection Simulator © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 27

Slide 27 text

27 © 2021 ADL - AWS www.sitereliabilityengineering.co Challenges with Distributed Systems

Slide 28

Slide 28 text

28 © 2021 ADL - AWS www.sitereliabilityengineering.co Components

Slide 29

Slide 29 text

29 © 2021 ADL - AWS www.sitereliabilityengineering.co Components

Slide 30

Slide 30 text

30 © 2021 ADL - AWS www.sitereliabilityengineering.co Components

Slide 31

Slide 31 text

31 © 2021 ADL - AWS www.sitereliabilityengineering.co Components

Slide 32

Slide 32 text

32 © 2021 ADL - AWS www.sitereliabilityengineering.co Components

Slide 33

Slide 33 text

33 © 2021 ADL - AWS www.sitereliabilityengineering.co Components

Slide 34

Slide 34 text

34 © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 35

Slide 35 text

35 © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 36

Slide 36 text

36 © 2021 ADL - AWS www.sitereliabilityengineering.co

Slide 37

Slide 37 text

37 © 2021 ADL - AWS www.sitereliabilityengineering.co How to begin? https://www.ingenieriadelcaos.com https://chaosengineering.slack.com https://github.com/dastergon/ awesome-chaos-engineering https://www.infoq.com/chaos-engineering https://www.ingenieriadelcaos.com/

Slide 38

Slide 38 text

38 © 2021 ADL - AWS www.sitereliabilityengineering.co Practice Chaos Gamedays

Slide 39

Slide 39 text

Thanks for coming! @yurynino