Slide 1

Slide 1 text

The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays Tammy Bryant Butow, SRE @ Gremlin @tambryantbutow

Slide 2

Slide 2 text

Get Certified Demonstrate your Chaos Engineering expertise, increase your visibility, and advance your career with a Gremlin Chaos Engineering Practitioner certification. @tambryantbutow

Slide 3

Slide 3 text

Today’s News Headlines @tambryantbutow

Slide 4

Slide 4 text

@tambryantbutow Gene Kim Jez Humble Nicole Forsgren

Slide 5

Slide 5 text

@tambryantbutow Today we will explore three key practices that improve tempo and stability

Slide 6

Slide 6 text

Stability is measured by mean time to recover (MTTR) and change failure rate Tempo is measured by deployment frequency and change lead time What is tempo and stability? @tambryantbutow

Slide 7

Slide 7 text

What is tempo and stability? @tambryantbutow Tempo Deployment Frequency The rate that software is deployed to production or an app store (e.g. within a range of multiple times a day to once a year) Tempo Change Lead Time The time it takes to go from a customer making a request to the request being satisfied Stability Mean Time To Recover (MTTR) The mean time it takes a company to recover from downtime of their software Stability Change Failure Rate The likelihood of defect changes (e.g. ⅕ )

Slide 8

Slide 8 text

@tambryantbutow What are the three key practices?

Slide 9

Slide 9 text

1. Chaos Engineering 2. GameDays 3. Disaster Recovery What are the three key practices? @tambryantbutow

Slide 10

Slide 10 text

“Build systems that are designed to be deployed easily, can detect and tolerate failures, and can have various components of the system updated independently.” - Accelerate @tambryantbutow

Slide 11

Slide 11 text

@tambryantbutow What is the best way to know if your system can detect and tolerate failure?

Slide 12

Slide 12 text

@tambryantbutow Chaos Engineering (Let’s see a dependency demo!)

Slide 13

Slide 13 text The architecture of our banking application @tambryantbutow

Slide 14

Slide 14 text

Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? @tambryantbutow

Slide 15

Slide 15 text Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? @tambryantbutow

Slide 16

Slide 16 text

The balance appears as $--- This could make the user think they have no money in their account @tambryantbutow

Slide 17

Slide 17 text

The user is still able to make a deposit of $1000 while the Balance Reader service is in a blackhole. @tambryantbutow

Slide 18

Slide 18 text

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @tambryantbutow

Slide 19

Slide 19 text

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @tambryantbutow

Slide 20

Slide 20 text


Slide 21

Slide 21 text

Free demo environment to learn about black holes 1. Use this link to install with minikube on google cloud shell: torial=extras/cloudshell/ 2. Click minikube → start 3. In Cloud Shell terminal, run kubectl apply -f bank-of-anthos/extras/jwt/jwt-secret.yaml 4. Click <> Cloud Code → Run on Kubernetes, change to port 4503 5. To create black holes, create a namespace for gremlin and install gremlin as helm chart @tambryantbutow

Slide 22

Slide 22 text Now let’s see how a blackhole impacts our Transaction History service @tambryantbutow

Slide 23

Slide 23 text

Does blackholing transaction history result in graceful degradation of the customer experience? @tambryantbutow

Slide 24

Slide 24 text @tambryantbutow We will get an error message “Error: Could Not Load Transactions”

Slide 25

Slide 25 text

@tambryantbutow kubectl scale deployment transactionhistory --replicas=2 What can we do to mitigate against a blackhole? Depending on the service, scaling replicas may work well

Slide 26

Slide 26 text

@tambryantbutow kubectl get pods Now we have 2 transaction history pods

Slide 27

Slide 27 text Let’s send 50% of transaction history pods into a blackhole @tambryantbutow

Slide 28

Slide 28 text There will be a very short outage and then the other pod will take over Pod 2: Transaction History Pod 1: Transaction History Deployment Set: Transaction History replicas=2 @tambryantbutow

Slide 29

Slide 29 text

We are still able to see transaction history and no longer receive error messages. @tambryantbutow

Slide 30

Slide 30 text

GoogleCloudPlatform/bank-of-anthos @tambryantbutow

Slide 31

Slide 31 text

Key Practice #2 - GameDays @tambryantbutow

Slide 32

Slide 32 text

Accelerate shares that Game Days are a great way to build relationships within an organization. This is a cultural must-do to become a high-performing organization. How can GameDays improve tempo and stability? @tambryantbutow

Slide 33

Slide 33 text What is a GameDay and how do you run one? @tambryantbutow

Slide 34

Slide 34 text

Invite 4+ people to attend (2+ teams). Spike load (Gatling) and introduce failure (Gremlin). Minimum time required = 10min What is an example GameDay? @tambryantbutow

Slide 35

Slide 35 text

● Dependency Testing (Flex) ● Capacity Plan Testing ● Autoscaling Testing What are more example GameDays? @tambryantbutow

Slide 36

Slide 36 text

Outages are simulated or actually created according to a pre-prepared plan, and teams must work together to maintain and restore service levels. How can Disaster Recovery improve tempo and stability? @tambryantbutow

Slide 37

Slide 37 text

“For DiRT-style events to be successful, an organization first needs to accept system and process failures as a means of learning… we design tests that require engineers from several groups who might not normally work together to interact with each other. That way, should a real large-scale disaster ever strike, these people will already have strong working relationships” - Kripa Krishnan, Director of Cloud Operations @ Google How can Disaster Recovery improve tempo and stability? @tambryantbutow

Slide 38

Slide 38 text How do you perform DR with Gremlin? @tambryantbutow

Slide 39

Slide 39 text

Invite 4+ people to attend. Failover 5 core services using the Gremlin blackhole. (safer than a shutdown and faster to recover) Minimum time required = 5 minutes What is an example DiRT? @tambryantbutow

Slide 40

Slide 40 text

Yes - using the Gremlin API or integrating this into your CI/CD pipelines Can you automate this? @tambryantbutow

Slide 41

Slide 41 text

Get Certified Demonstrate your Chaos Engineering expertise, increase your visibility, and advance your career with a Gremlin Chaos Engineering Practitioner certification. @tambryantbutow

Slide 42

Slide 42 text

State of Chaos Engineering Report “Top-performing Chaos Engineering teams boast four nines of availability (52 min of downtime a year) with an MTTR of less than one hour.” - Kolton Andrus @tambryantbutow

Slide 43

Slide 43 text

Thank you! @tambryantbutow @tambryantbutow