The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays

Slide 1

Slide 1 text

The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays Tammy Bryant Butow, SRE @ Gremlin @tambryantbutow

Slide 2

Slide 2 text

Get Certified Demonstrate your Chaos Engineering expertise, increase your visibility, and advance your career with a Gremlin Chaos Engineering Practitioner certification. gremlin.com/certification @tambryantbutow

Slide 3

Slide 3 text

Today’s News Headlines @tambryantbutow

Slide 4

Slide 4 text

@tambryantbutow Gene Kim Jez Humble Nicole Forsgren

Slide 5

Slide 5 text

@tambryantbutow Today we will explore three key practices that improve tempo and stability

Slide 6

Slide 6 text

Stability is measured by mean time to recover (MTTR) and change failure rate Tempo is measured by deployment frequency and change lead time What is tempo and stability? @tambryantbutow

Slide 7

Slide 7 text

What is tempo and stability? @tambryantbutow Tempo Deployment Frequency The rate that software is deployed to production or an app store (e.g. within a range of multiple times a day to once a year) Tempo Change Lead Time The time it takes to go from a customer making a request to the request being satisfied Stability Mean Time To Recover (MTTR) The mean time it takes a company to recover from downtime of their software Stability Change Failure Rate The likelihood of defect changes (e.g. ⅕ )

Slide 8

Slide 8 text

@tambryantbutow What are the three key practices?

Slide 9

Slide 9 text

1. Chaos Engineering 2. GameDays 3. Disaster Recovery What are the three key practices? @tambryantbutow

Slide 10

Slide 10 text

“Build systems that are designed to be deployed easily, can detect and tolerate failures, and can have various components of the system updated independently.” - Accelerate @tambryantbutow

Slide 11

Slide 11 text

@tambryantbutow What is the best way to know if your system can detect and tolerate failure?

Slide 12

Slide 12 text

@tambryantbutow Chaos Engineering (Let’s see a dependency demo!)

Slide 13

Slide 13 text

https://github.com/GoogleCloudPlatform/bank-of-anthos The architecture of our banking application @tambryantbutow

Slide 14

Slide 14 text

Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

Slide 15

Slide 15 text

https://app.gremlin.com/attacks/new/kubernetes Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? @tambryantbutow

Slide 16

Slide 16 text

The balance appears as $--- This could make the user think they have no money in their account @tambryantbutow

Slide 17

Slide 17 text

The user is still able to make a deposit of $1000 while the Balance Reader service is in a blackhole. @tambryantbutow

Slide 18

Slide 18 text

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @tambryantbutow

Slide 19

Slide 19 text

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @tambryantbutow

Slide 20

Slide 20 text

@tambryantbutow

Slide 21

Slide 21 text

Free demo environment to learn about black holes 1. Use this link to install with minikube on google cloud shell: https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_repo=https://gi thub.com/GoogleCloudPlatform/bank-of-anthos&cloudshell_workspace=.&cloudshell_tu torial=extras/cloudshell/tutorial.md 2. Click minikube → start 3. In Cloud Shell terminal, run kubectl apply -f bank-of-anthos/extras/jwt/jwt-secret.yaml 4. Click <> Cloud Code → Run on Kubernetes, change to port 4503 5. To create black holes, create a namespace for gremlin and install gremlin as helm chart https://github.com/gremlin/helm @tambryantbutow

Slide 22

Slide 22 text

https://github.com/GoogleCloudPlatform/bank-of-anthos Now let’s see how a blackhole impacts our Transaction History service @tambryantbutow

Slide 23

Slide 23 text

Does blackholing transaction history result in graceful degradation of the customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

Slide 24

Slide 24 text

https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow We will get an error message “Error: Could Not Load Transactions”

Slide 25

Slide 25 text

@tambryantbutow kubectl scale deployment transactionhistory --replicas=2 What can we do to mitigate against a blackhole? Depending on the service, scaling replicas may work well

Slide 26

Slide 26 text

@tambryantbutow kubectl get pods Now we have 2 transaction history pods

Slide 27

Slide 27 text

https://app.gremlin.com/attacks/new/kubernetes Let’s send 50% of transaction history pods into a blackhole @tambryantbutow

Slide 28

Slide 28 text

https://github.com/GoogleCloudPlatform/bank-of-anthos There will be a very short outage and then the other pod will take over Pod 2: Transaction History Pod 1: Transaction History Deployment Set: Transaction History replicas=2 @tambryantbutow

Slide 29

Slide 29 text

We are still able to see transaction history and no longer receive error messages. https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

Slide 30

Slide 30 text

GoogleCloudPlatform/bank-of-anthos @tambryantbutow

Slide 31

Slide 31 text

Key Practice #2 - GameDays @tambryantbutow

Slide 32

Slide 32 text

Accelerate shares that Game Days are a great way to build relationships within an organization. This is a cultural must-do to become a high-performing organization. How can GameDays improve tempo and stability? @tambryantbutow

Slide 33

Slide 33 text

gremlin.com/gameday What is a GameDay and how do you run one? @tambryantbutow

Slide 34

Slide 34 text

Invite 4+ people to attend (2+ teams). Spike load (Gatling) and introduce failure (Gremlin). Minimum time required = 10min What is an example GameDay? @tambryantbutow

Slide 35

Slide 35 text

● Dependency Testing (Flex) ● Capacity Plan Testing ● Autoscaling Testing What are more example GameDays? @tambryantbutow

Slide 36

Slide 36 text

Outages are simulated or actually created according to a pre-prepared plan, and teams must work together to maintain and restore service levels. How can Disaster Recovery improve tempo and stability? @tambryantbutow

Slide 37

Slide 37 text

“For DiRT-style events to be successful, an organization first needs to accept system and process failures as a means of learning… we design tests that require engineers from several groups who might not normally work together to interact with each other. That way, should a real large-scale disaster ever strike, these people will already have strong working relationships” - Kripa Krishnan, Director of Cloud Operations @ Google How can Disaster Recovery improve tempo and stability? @tambryantbutow

Slide 38

Slide 38 text

gremlin.com/community/tutorials/testing-disaster-recovery-with-chaos-engineering/ How do you perform DR with Gremlin? @tambryantbutow

Slide 39

Slide 39 text

Invite 4+ people to attend. Failover 5 core services using the Gremlin blackhole. (safer than a shutdown and faster to recover) Minimum time required = 5 minutes What is an example DiRT? @tambryantbutow

Slide 40

Slide 40 text

Yes - using the Gremlin API or integrating this into your CI/CD pipelines Can you automate this? @tambryantbutow

Slide 41

Slide 41 text

Slide 42

Slide 42 text

State of Chaos Engineering Report “Top-performing Chaos Engineering teams boast four nines of availability (52 min of downtime a year) with an MTTR of less than one hour.” - Kolton Andrus gremlin.com/state-of-chaos-engineering/2021 @tambryantbutow

Slide 43

Slide 43 text

Thank you! @tambryantbutow @tambryantbutow