Slide 1

Slide 1 text

CHAOS ENGINEERING @tammybütow, Gremlin

Slide 2

Slide 2 text

“Build, test, revise, build. That was our motto.” @tammybütow DEVOPS @ NIKE DAY

Slide 3

Slide 3 text

“People constantly updated testing documents, and every day we made changes in the moment” @tammybütow DEVOPS @ NIKE DAY

Slide 4

Slide 4 text

NIKE HYPERADAPT 1.0 @tammybütow DEVOPS @ NIKE DAY

Slide 5

Slide 5 text

@tammybütow DEVOPS @ NIKE DAY

Slide 6

Slide 6 text

People Put Nike Hyperadapt 1.0 To The Test: • On The Court - 20 Basketball Research Athletes • Pounding Pavement - running at Nike campus • In The Gym - Nike Employee Training Classes • Walking And Working - 150 Employees @tammybütow DEVOPS @ NIKE DAY

Slide 7

Slide 7 text

How does this relate to Chaos Engineering? @tammybütow DEVOPS @ NIKE DAY

Slide 8

Slide 8 text

You can use Chaos Engineering to ensure your systems are as resilient as your sneakers. @tammybütow, Gremlin @tammybütow DEVOPS @ NIKE DAY

Slide 9

Slide 9 text

@tammybütow, Gremlin @tammybütow DEVOPS @ NIKE DAY

Slide 10

Slide 10 text

TAMMY BÜTOW Principal SRE, Gremlin Causing chaos in prod since 2009 @tammybütow @tammybütow DEVOPS @ NIKE DAY !

Slide 11

Slide 11 text

GREMLIN • We are practitioners of Chaos Engineering • We build software that helps engineers build resilient systems • We offer 11 ways to inject chaos for your Chaos Engineering experiments @tammybütow DEVOPS @ NIKE DAY

Slide 12

Slide 12 text

PART 1: LAYING THE FOUNDATION @tammybütow DEVOPS @ NIKE DAY

Slide 13

Slide 13 text

It would be silly to give an Olympic pole-vaulter a broom and ban them from practicing! @tammybütow DEVOPS @ NIKE DAY

Slide 14

Slide 14 text

“Thoughtful planned experiments designed to reveal the weaknesses in our systems” - Kolton Andrus, Gremlin CEO @tammybütow, Gremlin @tammybütow DEVOPS @ NIKE DAY

Slide 15

Slide 15 text

Eventually systems will break in many undesired ways. Break them first on purpose with controlled chaos! @tammybütow DEVOPS @ NIKE DAY

Slide 16

Slide 16 text

DOGFOODING • Using your own product. • For us that means using Gremlin for our Chaos Engineering experiments. • Failure Fridays @tammybütow DEVOPS @ NIKE DAY

Slide 17

Slide 17 text

Failure Fridays are dedicated time for teams to collaboratively focus on using Chaos Engineering practices to reveal weaknesses in your services. @tammybütow DEVOPS @ NIKE DAY

Slide 18

Slide 18 text

WHY DO DISTRIBUTED SYSTEMS NEED CHAOS? • Unusual hard to debug failures are common • Systems & companies scale rapidly and Chaos Engineering helps you learn along the way @tammybütow DEVOPS @ NIKE DAY

Slide 19

Slide 19 text

FULL-STACK CHAOS ENGINEERING • You can inject chaos at any layer. • API, App, Cache, Database, OS, Host, Network, Power & more. @tammybütow DEVOPS @ NIKE DAY

Slide 20

Slide 20 text

WHY RUN CHAOS ENGINEERING EXPERIMENTS? @tammybütow DEVOPS @ NIKE DAY

Slide 21

Slide 21 text

Are you confident that your metrics and alerting are as good as they should be? @tammybütow DEVOPS @ NIKE DAY #pagerpain

Slide 22

Slide 22 text

Are you confident your customers are getting as good an experience as they should be? @tammybütow DEVOPS @ NIKE DAY #customerpain

Slide 23

Slide 23 text

Are you losing money due to downtime and broken features? @tammybütow DEVOPS @ NIKE DAY #businesspain

Slide 24

Slide 24 text

HOW DO YOU RUN CHAOS ENGINEERING EXPERIMENTS? @tammybütow DEVOPS @ NIKE DAY

Slide 25

Slide 25 text

HOW TO RUN A CHAOS ENGINEERING EXPERIMENT • Form a hypothesis • Consider blast radius • Run experiment • Measure results • Find & fix issues or scale @tammybütow DEVOPS @ NIKE DAY ⚡

Slide 26

Slide 26 text

Don’t run before you can walk @tammybütow, Gremlin @tammybütow DEVOPS @ NIKE DAY

Slide 27

Slide 27 text

HOW TO CHOOSE A CHAOS EXPERIMENT • Identify top 5 critical systems • Choose 1 system • Whiteboard the system • Select attack: resource/ state/network • Determine scope @tammybütow DEVOPS @ NIKE DAY ⚡

Slide 28

Slide 28 text

WHAT SHOULD WE MEASURE? • Availability — 500s • Service specific KPIs • System metrics: CPU, IO, Disk • Customer complaints @tammybütow DEVOPS @ NIKE DAY

Slide 29

Slide 29 text

EXAMPLE SYSTEM: KUBERNETES RETAIL STORE @tammybütow DEVOPS @ NIKE DAY User Primary: kube-01 Node: kube-02 Node: kube-03 Node: kube-04

Slide 30

Slide 30 text

PART 2: RESOURCE CHAOS ENGINEERING @tammybütow DEVOPS @ NIKE DAY

Slide 31

Slide 31 text

@tammybütow DEVOPS @ NIKE DAY We can increase CPU, Disk, IO & Memory consumption to ensure monitoring is setup to catch problems. Important to catch issues before they turn into high severity incidents (unable to purchase new product!) and downtime for customers. RESOURCE CHAOS

Slide 32

Slide 32 text

CPU CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 33

Slide 33 text

@tammybütow DEVOPS @ NIKE DAY https://github.com/tammybutow/chaosengineeringbootcamp LET’S CREATE A “KNOWN-KNOWN” EXPERIMENT

Slide 34

Slide 34 text

@tammybütow DEVOPS @ NIKE DAY CHAOS IN TOP

Slide 35

Slide 35 text

@tammybütow DEVOPS @ NIKE DAY LET’S KILL THE CHAOS NOW

Slide 36

Slide 36 text

@tammybütow DEVOPS @ NIKE DAY NO MORE CHAOS IN TOP

Slide 37

Slide 37 text

DISK CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 38

Slide 38 text

@tammybütow DEVOPS @ NIKE DAY DISK CHAOS

Slide 39

Slide 39 text

MEMORY CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 40

Slide 40 text

@tammybütow DEVOPS @ NIKE DAY MEMORY CHAOS free -m

Slide 41

Slide 41 text

PART 3: STATE CHAOS ENGINEERING @tammybütow DEVOPS @ NIKE DAY

Slide 42

Slide 42 text

PROCESS CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 43

Slide 43 text

@tammybütow DEVOPS @ NIKE DAY Ways to create process chaos on purpose: PROCESS CHAOS • Kill one process • Loop kill a process • Spawn new processes • Fork bomb

Slide 44

Slide 44 text

@tammybütow DEVOPS @ NIKE DAY PROCESS CHAOS pkill -u chaos

Slide 45

Slide 45 text

SHUTDOWN CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 46

Slide 46 text

@tammybütow DEVOPS @ NIKE DAY SHUTDOWN CHAOS shutdown -h

Slide 47

Slide 47 text

WHAT ARE OTHER WAYS YOU CAN TURN OFF A SERVER? WHAT IF YOU WANT TO TURN OFF EVERY SERVER WHEN IT’S ONE WEEK OLD? @tammybütow DEVOPS @ NIKE DAY

Slide 48

Slide 48 text

@tammybütow DEVOPS @ NIKE DAY HALT, REBOOT & POWEROFF CHAOS halt

Slide 49

Slide 49 text

WHAT ABOUT SHUTTING DOWN
 CONTAINERS AND K8’S PODS? @tammybütow DEVOPS @ NIKE DAY

Slide 50

Slide 50 text

@tammybütow DEVOPS @ NIKE DAY THE MANY WAYS TO KILL CONTAINERS • Kill self • Kill a container from the host • Use one container to kill another • Use one container to kills several containers • Use several containers to kill several

Slide 51

Slide 51 text

The average lifespan of a container is 2.5 days And they fail in many unexpected ways. @tammybütow DEVOPS @ NIKE DAY

Slide 52

Slide 52 text

TIME TRAVEL CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 53

Slide 53 text

@tammybütow DEVOPS @ NIKE DAY TIME TRAVEL CHAOS AKA CLOCK SKEW ntpq

Slide 54

Slide 54 text

PART 4: NETWORK CHAOS ENGINEERING @tammybütow DEVOPS @ NIKE DAY

Slide 55

Slide 55 text

BLACKHOLE CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 56

Slide 56 text

@tammybütow DEVOPS @ NIKE DAY BLACKHOLE CHAOS ip route show

Slide 57

Slide 57 text

DNS CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 58

Slide 58 text

@tammybütow DEVOPS @ NIKE DAY DNS CHAOS

Slide 59

Slide 59 text

@tammybütow DEVOPS @ NIKE DAY DNS CHAOS

Slide 60

Slide 60 text

LATENCY CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 61

Slide 61 text

@tammybütow DEVOPS @ NIKE DAY LATENCY CHAOS mtr google.com

Slide 62

Slide 62 text

PACKET LOSS CHAOS @tammybütow DEVOPS @ NIKE DAY

Slide 63

Slide 63 text

@tammybütow DEVOPS @ NIKE DAY PACKET LOSS CHAOS

Slide 64

Slide 64 text

PART 5: COMPLEX OUTAGES @tammybütow DEVOPS @ NIKE DAY

Slide 65

Slide 65 text

We can combine different types of chaos engineering experiments to reproduce complicated outages. Reproducing outages gives you confidence you can handle it if/when it happens again. @tammybütow DEVOPS @ NIKE DAY

Slide 66

Slide 66 text

Let’s go back in time to look at some of the worst outage stories that kicked off the introduction of chaos engineering. @tammybütow DEVOPS @ NIKE DAY

Slide 67

Slide 67 text

DROPBOX’S WORST OUTAGE EVER @tammybütow DEVOPS @ NIKE DAY Some master-replica pairs were impacted which resulted in the site going down. https://blogs.dropbox.com/tech/2014/01/outage-post-mortem/

Slide 68

Slide 68 text

UBER’S DATABASE OUTAGE @tammybütow DEVOPS @ NIKE DAY 1.Master log replication to S3 failed 2.Logs backed up on the primary 3.Alerts fired to engineer but they are ignored 4.Disk fills up on database primary 5.Engineer deletes unarchived WAL files 6.Error in config prevents promotion — Matt Ranney, Uber, 2015

Slide 69

Slide 69 text

OUTAGES HAPPEN. @tammybütow DEVOPS @ NIKE DAY

Slide 70

Slide 70 text

THERE ARE MANY MORE OUTAGES YOU CAN READ ABOUT HERE: https://github.com/danluu/post-mortems @tammybütow DEVOPS @ NIKE DAY

Slide 71

Slide 71 text

HOW CAN YOU CONTINUE YOUR CHAOS ENGINEERING JOURNEY? @tammybütow DEVOPS @ NIKE DAY

Slide 72

Slide 72 text

@tammybütow DEVOPS @ NIKE DAY JOIN THE CHAOS SLACK GREMLIN.COM/CHAOS

Slide 73

Slide 73 text

@tammybütow DEVOPS @ NIKE DAY LEARN WITH THE GREMLIN COMMUNITY GREMLIN.COM/COMMUNITY

Slide 74

Slide 74 text

THANK YOU DEVOPS @ NIKE DAY @tammybütow #CHAOSENGINEERING