Slide 1

Slide 1 text

Training Engineering Teams with Chaos Engineering

Slide 2

Slide 2 text

YURY NIÑO ROA Site Reliability Engineer Chaos Engineering Advocate @yurynino https://www.yurynino.dev/

Slide 3

Slide 3 text

AGENDA Topics will be covered Motivations ● Identifying needs ● Training Engineers How to ● Trainer Role ● Instruction Principles GameDays ● Chaos Engineering ● Chaos GameDays https://www.yurynino.dev/

Slide 4

Slide 4 text

Identifying needs Training Engineers Motivations https://www.yurynino.dev/

Slide 5

Slide 5 text

Humans, are central to both the problem and the solution of challenges in engineering! https://www.yurynino.dev/

Slide 6

Slide 6 text

Identify Needs Needs Avoid Toil NaLSD Curiosity https://www.yurynino.dev/

Slide 7

Slide 7 text

Organizations Training SREs Training SREs https://www.yurynino.dev/

Slide 8

Slide 8 text

https://www.yurynino.dev/

Slide 9

Slide 9 text

● Be prepared. ● State your objectives. ● Be organized. ● Use visuals. ● Answer questions. ● Be enthusiastic. Training Do’s ● Provide feedback. ● Be flexible. ● Prepare for emergencies. ● Encourage participation. ● Establish rapport. ● Be yourself. https://www.yurynino.dev/

Slide 10

Slide 10 text

● Starting late and wasting time. ● Being poorly prepared and lacking knowledge. ● Displaying distracting habits. ● Ignoring participants and interrupting their questions. ● Lacking enthusiasm. ● Reading from a script. Mistakes to avoid https://www.yurynino.dev/

Slide 11

Slide 11 text

● Establish an informal atmosphere. ● Encourage participants to take control. ● Accept participants where they are. ● Communicate openly and honestly. ● Tap participants for the ideas. How to Ensure the Participation https://www.yurynino.dev/

Slide 12

Slide 12 text

● To be able to construct a mental representation. ● To be able to assess risks and threats as relevant. ● To be able to switch from a situation under control. ● To be able to maintain a relevant level of confidence. ● To be able to make a decision in a complex situation. In an emergency https://www.yurynino.dev/

Slide 13

Slide 13 text

● To be able to make an intelligent usage of procedures. ● To be able to use available resources. ● To be able to manage time and pressure. ● To be able to cooperate with and crew members. ● To be able to properly use and manage information. In an emergency https://www.yurynino.dev/

Slide 14

Slide 14 text

Human Factors

Slide 15

Slide 15 text

https://www.yurynino.dev/

Slide 16

Slide 16 text

Humans operate differently when they expect things to fail! Aaron Rinehart

Slide 17

Slide 17 text

Training Roles Instructions Principles How to https://www.yurynino.dev/

Slide 18

Slide 18 text

The world is an imperfect place. We can not control the environment! But we can control how to face the failures. https://www.yurynino.dev/

Slide 19

Slide 19 text

4 Essential Capabilities 4 Sets of answers to construct resilience profile https://www.yurynino.dev/

Slide 20

Slide 20 text

Chaos Engineering Chaos GameDays GameDays https://www.yurynino.dev/

Slide 21

Slide 21 text

Everything fails, all the time! Werner Vogels https://www.yurynino.dev/

Slide 22

Slide 22 text

What is Chaos Engineering? It is the discipline of experimenting failures in production in order to reveal their weakness and to build confidence in their resilience capability. https://principlesofchaos.org/

Slide 23

Slide 23 text

What is Chaos Engineering? It is a scientific method that consists in specifying and evaluating resilience hypotheses 1) injecting faults in production 2) observing the impact 3) building resilience https://principlesofchaos.org/

Slide 24

Slide 24 text

2008 Chaos Engineering began at Netflix 2010 Chaos Monkey & Simian Army were launched 2016 Gremlin born 2019 1 Book Chaos massification 2017 SRE Usenix Chaos IQ born ChaosConf 2018 1 Book Chaos Monkey for Spring Boot 2020 1 Book was published Chaos Engineering History

Slide 25

Slide 25 text

1. Pick a Hypothesis: Recipe! 2. Choose the tools: Ingredients! 3. Launch an attack: Cook! 4. Notify the Org: Invite! 5. Run the Experiment: Enjoy! 6. Analyze the Results 7. Automate Chaos Principles

Slide 26

Slide 26 text

Hypothesize about Steady State Run Experiments Vary Real-World Events Automate Experiments Chaos: Principles

Slide 27

Slide 27 text

Chaos: Plan

Slide 28

Slide 28 text

https://github.com/yurynino/learning-chaos-springboot Chaos: Hypothesis

Slide 29

Slide 29 text

https://github.com/yurynino/learning-chaos-springboot Chaos: Running

Slide 30

Slide 30 text

Chaos: Running

Slide 31

Slide 31 text

Chaos: Running

Slide 32

Slide 32 text

Chaos: Running

Slide 33

Slide 33 text

Chaos: Running

Slide 34

Slide 34 text

Chaos: Running

Slide 35

Slide 35 text

Chaos: Automate

Slide 36

Slide 36 text

Chaos Toolkit Gremlin Chaos Engineering Tools

Slide 37

Slide 37 text

Before After During ● Pick a hypothesis. ● Pick a style. ● Decide who. ● Decide where. ● Decide when. ● Document. ● Get approval! ● Detect the situation. ● Take a deep breath. ● Communicate. ● Visit dashboards. ● Analyze data. ● Propose solutions. ● Apply and solve! ● Write a postmortem. ● What Happened ● Impact ● Duration ● Resolution Time ● Resolution ● Timeline ● Action Items Chaos Methodology

Slide 38

Slide 38 text

Chaos Engineering Science

Slide 39

Slide 39 text

The infrastructure required by a software system can be as complex as the software itself. We need a hands-on guide to exploring the world of Chaos! Netflix Twitter Chaos Engineering Motivations

Slide 40

Slide 40 text

From Chaos Engineering Book 2020 Chaos Maturity Model

Slide 41

Slide 41 text

Who are practicing?

Slide 42

Slide 42 text

Who are practicing

Slide 43

Slide 43 text

Disaster Piece Whenever they launch features or make changes, we test the fault tolerance of that new code! In January of 2018, they started a rigorous process of identifying failures that are likely to happen and that we must be able to tolerate, and then purposely causing them to happen in production. This isn’t Chaos Engineering as practiced and evangelized by Netflix. It’s the first step; we call it Disasterpiece Theater. Who are practicing

Slide 44

Slide 44 text

LinkedOut Taken from Chaos Engineering Book 2020 Taken from Chaos Engineering Book 2020 Who are practicing

Slide 45

Slide 45 text

Evolution CI/CD Tooling Culture Evangelism Team Taken from Chaos Engineering Book 2020 Who are practicing

Slide 46

Slide 46 text

Chaos GameDays GameDays are an interactive, real-world and learning exercises. They are designed to give players a chance to put their skills in a technology to test. GameDays were created by Jesse Robbins inspired by his experience & training as a firefighter.

Slide 47

Slide 47 text

GameDays Chaos Gamedays GameDays are interactive team-based learning exercises designed to give players a chance to put their skills to the test in a real-world, gamified, risk-free environment. A Chaos GameDay is a practice event, and although it can take a whole day, it usually requires only a few hours. The goal of a GameDay is to practice how you, your team, and your supporting systems deal with real-world turbulent conditions. Chaos References

Slide 48

Slide 48 text

First on Call Monitors, triages, and tries to mitigate failures caused by the Master of Disaster. Master of Disaster Decides the failure and declares start of incident and attack!!! Team Find and solve the exhibited issues, and write up postmortem. Chaos GameDays

Slide 49

Slide 49 text

Chaos References

Slide 50

Slide 50 text

Chaos References

Slide 51

Slide 51 text

https://chaosengineering.slack.com https://github.com/dastergon/ awesome-chaos-engineering https://www.infoq.com/chaos-engineering @yurynino How to begin?

Slide 52

Slide 52 text

Picasso Napkin Mark Manson .. The Subtle Art of Not Giving a F*ck

Slide 53

Slide 53 text

Thanks for coming!!! @yurynino https://www.yurynino.dev/