An Optimist's Guide to Chaos Engineering

Slide 1

Slide 1 text

@andyﬂeener AN OPTIMIST’S GUIDE TO CHAOS ENGINEERING BY ANDY FLEENER

Slide 2

Slide 2 text

@andyﬂeener HI I’M ANDY AND I’M AN OPTIMIST

Slide 3

Slide 3 text

@andyﬂeener OPTIMISM CAN BE HARD WHEN EVERYTHING IS BROKEN ALL THE TIME

Slide 4

Slide 4 text

@andyﬂeener ABOUT ME ▸ Senior Software Engineer ▸ Ruby Developer for 9 years ▸ In Operations for last 4 years ▸ Complexity / System Safety Nerd

Slide 5

Slide 5 text

@andyﬂeener

Slide 6

Slide 6 text

Creative Commons Image: NASA Solar System Exploration

Slide 7

Slide 7 text

@andyﬂeener REAL BUGS I’VE SEEN IN THE WILD ▸ Ruby’s Timeout Library doesn’t actually timeout sometimes ▸ Node Framework miss handle’s an invalid url and crashes ▸ Ruby App server ﬂips it’s shit when a stream is closed early Creative Commons Image: Matteo X

Slide 8

Slide 8 text

@andyﬂeener THESE BUGS WERE AWESOME ▸ I learned a shitload about how each of these parts of an application work ▸ We tracked down the problem and worked through solutions for each ▸ These are the kinds of challenges that get me excited to go to work Creative Commons Image: Neil Moralee

Slide 9

Slide 9 text

@andyﬂeener #REALTALK: I ENJOY BREAKING THINGS

Slide 10

Slide 10 text

@andyﬂeener

Slide 11

Slide 11 text

@andyﬂeener SYSTEMS THINKING ▸ It can be very counterintuitive ▸ Success and Failure are not easily deﬁned ▸ Emergent properties of a system are the feedback required to close the loop Creative Commons Image: Liam Ross

Slide 12

Slide 12 text

@andyﬂeener COMPLEX SYSTEMS ARE PRETTY DOPE Creative Commons Image: John Smith

Slide 13

Slide 13 text

@andyﬂeener THREE TRUTHS ABOUT COMPLEX SYSTEMS ▸ They are inherently unsafe ▸ Failure is a normal state of the system ▸ Failure can actually make a system stronger Creative Commons Image: Bjoern von Thuelen

Slide 14

Slide 14 text

@andyﬂeener ENTER CHAOS ENGINEERING

Slide 15

Slide 15 text

@andyﬂeener CHAOS ENGINEERING IS THE DISCIPLINE OF EXPERIMENTING ON A DISTRIBUTED SYSTEM IN ORDER TO BUILD CONFIDENCE IN THE SYSTEM’S CAPABILITY TO WITHSTAND TURBULENT CONDITIONS IN PRODUCTION. PRINCIPLES OF CHAOS ENGINEERING

Slide 16

Slide 16 text

@andyﬂeener DISTRIBUTED SYSTEMS ARE COMPLEX SYSTEMS WE TOUCH EVERYDAY

Slide 17

Slide 17 text

@andyﬂeener BASIC PRINCIPLES ▸ Build a Hypothesis around Steady State Behavior ▸ Vary Real-world Events ▸ Learn from the changes to the Steady State Creative Commons Image: Hamed Sabe

Slide 18

Slide 18 text

@andyﬂeener BUILD A HYPOTHESIS AROUND STEADY STATE BEHAVIOR ▸ Attempt to understand as much of the system’s steady normal running state as you can ▸ Network is the backbone of distributed systems and it is designed to withstand failure ▸ Measure outcomes of the system by the value it provides ▸ APM is generally a good way to understanding Steady State Creative Commons Image: Seabamirum

Slide 19

Slide 19 text

@andyﬂeener VARY REAL-WORLD EVENTS ▸ Variables should reﬂect real world events ▸ Prioritize by both frequency and potential impact ▸ Systems that resist failure tend to fail catastrophically ▸ Rule #1 Never Trust the Network ▸ Chaos Variables can be any event capable of disrupting steady state Creative Commons Image: Leo Fung

Slide 20

Slide 20 text

@andyﬂeener LEARN FROM THE CHANGES TO THE STEADY STATE ▸ How did the system react to variable change? ▸ Some times it’s obvious ▸ Hopefully more frequently your system can tolerate the failure Creative Commons Image: Simon_sees

Slide 21

Slide 21 text

@andyﬂeener CHAOS ENGINEERING WAS FOUNDED ON RUNNING EXPERIMENTS IN PRODUCTION

Slide 22

Slide 22 text

@andyﬂeener

Slide 23

Slide 23 text

@andyﬂeener WHY WOULD YOU DO THAT? ▸ It’s impossible to create a duplicate complex system ▸ Failure in a production like environment won’t be exactly the same ▸ Fake it till you make it ▸ Do not run a Chaos Experiment on something you never tested ▸ Running experiments on production like systems is better than not running them at all Creative Commons Image: Toms River Fire Dept

Slide 24

Slide 24 text

@andyﬂeener COOL SO HOW DO I DO THIS?

Slide 25

Slide 25 text

@andyﬂeener NETWORK LATENCY IS A DISTRIBUTED SYSTEM’S KRYPTONITE

Slide 26

Slide 26 text

@andyﬂeener USE TOOLS THAT EXIST ▸ Toxiproxy by Shopify ▸ Comcast by tylertreat ▸ Chaos Monkey by Netﬂix Creative Commons Image: Toms River Fire Dept

Slide 27

Slide 27 text

@andyﬂeener DISTRIBUTED SYSTEMS NEED TO TALK TO EACH OTHER, SLOW THAT DOWN AND I BET YOU’LL SEE SOMETHING INTERESTING

Slide 28

Slide 28 text

@andyﬂeener

Slide 29

Slide 29 text

@andyﬂeener GOOGLE “CIRCUIT BREAKER PATTERN”

Slide 30

Slide 30 text

@andyﬂeener NETFLIX HAS YOUR BACK CHECK OUT HYSTRIX

Slide 31

Slide 31 text

@andyﬂeener BUT WHY SHOULD I EVEN DO THIS?

Slide 32

Slide 32 text

@andyﬂeener TO DISCOVER LATENT FAILURES

Slide 33

Slide 33 text

@andyﬂeener IT’S EASY TO ASSUME THE SYSTEM IS SAFE

Slide 34

Slide 34 text

@andyﬂeener THE PRIME DIRECTIVE OF CHAOS ENGINEERING IS TO LEARN

Slide 35

Slide 35 text

@andyﬂeener BY LEARNING, OUR SYSTEM CAN CONTINUE TO GROW AND IMPROVE

Slide 36

Slide 36 text

@andyﬂeener WE’RE ACTIVELY ADDING CAPACITY TO THE SYSTEM TO FAIL

Slide 37

Slide 37 text

@andyﬂeener IS THE KEY TO CREATING NOT JUST RELIABLE SYSTEMS BUT RESILIENT ONES ADDING CAPACITY TO FAIL

Slide 38

Slide 38 text

@andyﬂeener AND NOW FOR SOMETHING COMPLETELY DIFFERENT

Slide 39

Slide 39 text

@andyﬂeener ORGANIZATIONS ARE DISTRIBUTED SYSTEMS

Slide 40

Slide 40 text

@andyﬂeener WAIT WHAT?

Slide 41

Slide 41 text

@andyﬂeener

Slide 42

Slide 42 text

@andyﬂeener HOW DO WE APPLY CHAOS ENGINEERING TO OTHER SOCIOTECHNICAL SYSTEMS?

Slide 43

Slide 43 text

@andyﬂeener THAT’S NOT A THING

Slide 44

Slide 44 text

@andyﬂeener ORGANIZATIONS WHICH DESIGN SYSTEMS ARE CONSTRAINED TO PRODUCE DESIGNS WHICH ARE COPIES OF THE COMMUNICATION STRUCTURES OF THESE ORGANIZATIONS CONWAY’S LAW

Slide 45

Slide 45 text

@andyﬂeener CHAOS ENGINEERING AN ORGANIZATION ▸ Resilient Teams create Resilient Systems ▸ Teams are like services ▸ Communication is the Network of Teams Creative Commons Image: Meenakshi Madhavan

Slide 46

Slide 46 text

@andyﬂeener LEARNING ORGANIZATIONS HAVE CAPACITY TO FAIL

Slide 47

Slide 47 text

@andyﬂeener FAILURE INJECTION IS ONE OF THE BEST WAYS TO LEARN ABOUT A SYSTEM

Slide 48

Slide 48 text

@andyﬂeener WHAT DOES “FAILURE” MEAN HERE

Slide 49

Slide 49 text

@andyﬂeener MOVE PEOPLE BETWEEN TEAMS

Slide 50

Slide 50 text

@andyﬂeener YOUR TEAM LEADS MAY NOT LIKE THIS

Slide 51

Slide 51 text

@andyﬂeener HIGHLY COLLABORATIVE TEAMS CAN FALL APART IN COMMUNICATION BREAKDOWNS

Slide 52

Slide 52 text

@andyﬂeener CREATE PATHWAYS FOR COMMUNICATION

Slide 53

Slide 53 text

@andyﬂeener ADDING MORE AVENUES OF COMMUNICATION MAKES ONE FAILURE LESS IMPACTFUL