The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays

The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays
Tammy Bryant Butow, SRE @ Gremlin @tambryantbutow

Get Certified Demonstrate your Chaos Engineering expertise, increase your visibility,
and advance your career with a Gremlin Chaos Engineering Practitioner certification. gremlin.com/certification @tambryantbutow

Today’s News Headlines @tambryantbutow

@tambryantbutow Gene Kim Jez Humble Nicole Forsgren

@tambryantbutow Today we will explore three key practices that improve
tempo and stability

Stability is measured by mean time to recover (MTTR) and
change failure rate Tempo is measured by deployment frequency and change lead time What is tempo and stability? @tambryantbutow

What is tempo and stability? @tambryantbutow Tempo Deployment Frequency The
rate that software is deployed to production or an app store (e.g. within a range of multiple times a day to once a year) Tempo Change Lead Time The time it takes to go from a customer making a request to the request being satisfied Stability Mean Time To Recover (MTTR) The mean time it takes a company to recover from downtime of their software Stability Change Failure Rate The likelihood of defect changes (e.g. ⅕ )

@tambryantbutow What are the three key practices?

1. Chaos Engineering 2. GameDays 3. Disaster Recovery What are
the three key practices? @tambryantbutow

“Build systems that are designed to be deployed easily, can
detect and tolerate failures, and can have various components of the system updated independently.” - Accelerate @tambryantbutow

@tambryantbutow What is the best way to know if your
system can detect and tolerate failure?

@tambryantbutow Chaos Engineering (Let’s see a dependency demo!)

https://github.com/GoogleCloudPlatform/bank-of-anthos The architecture of our banking application @tambryantbutow

Does blackholing a critical path service like the Balance Reader
result in graceful degradation of the customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

https://app.gremlin.com/attacks/new/kubernetes Does blackholing a critical path service like the Balance
Reader result in graceful degradation of the customer experience? @tambryantbutow

The balance appears as $--- This could make the user
think they have no money in their account @tambryantbutow

The user is still able to make a deposit of
$1000 while the Balance Reader service is in a blackhole. @tambryantbutow

The user is unable to send payments. They will see
an error that the payment failed due to Balance Reader. @tambryantbutow

@tambryantbutow

Free demo environment to learn about black holes 1. Use
this link to install with minikube on google cloud shell: https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_repo=https://gi thub.com/GoogleCloudPlatform/bank-of-anthos&cloudshell_workspace=.&cloudshell_tu torial=extras/cloudshell/tutorial.md 2. Click minikube → start 3. In Cloud Shell terminal, run kubectl apply -f bank-of-anthos/extras/jwt/jwt-secret.yaml 4. Click <> Cloud Code → Run on Kubernetes, change to port 4503 5. To create black holes, create a namespace for gremlin and install gremlin as helm chart https://github.com/gremlin/helm @tambryantbutow

https://github.com/GoogleCloudPlatform/bank-of-anthos Now let’s see how a blackhole impacts our Transaction
History service @tambryantbutow

Does blackholing transaction history result in graceful degradation of the
customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow We will get an error message “Error: Could
Not Load Transactions”

@tambryantbutow kubectl scale deployment transactionhistory --replicas=2 What can we do
to mitigate against a blackhole? Depending on the service, scaling replicas may work well

@tambryantbutow kubectl get pods Now we have 2 transaction history
pods

https://app.gremlin.com/attacks/new/kubernetes Let’s send 50% of transaction history pods into a
blackhole @tambryantbutow

https://github.com/GoogleCloudPlatform/bank-of-anthos There will be a very short outage and then
the other pod will take over Pod 2: Transaction History Pod 1: Transaction History Deployment Set: Transaction History replicas=2 @tambryantbutow

We are still able to see transaction history and no
longer receive error messages. https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

GoogleCloudPlatform/bank-of-anthos @tambryantbutow

Key Practice #2 - GameDays @tambryantbutow

Accelerate shares that Game Days are a great way to
build relationships within an organization. This is a cultural must-do to become a high-performing organization. How can GameDays improve tempo and stability? @tambryantbutow

gremlin.com/gameday What is a GameDay and how do you run
one? @tambryantbutow

Invite 4+ people to attend (2+ teams). Spike load (Gatling)
and introduce failure (Gremlin). Minimum time required = 10min What is an example GameDay? @tambryantbutow

• Dependency Testing (Flex) • Capacity Plan Testing • Autoscaling
Testing What are more example GameDays? @tambryantbutow

Outages are simulated or actually created according to a pre-prepared
plan, and teams must work together to maintain and restore service levels. How can Disaster Recovery improve tempo and stability? @tambryantbutow

“For DiRT-style events to be successful, an organization first needs
to accept system and process failures as a means of learning… we design tests that require engineers from several groups who might not normally work together to interact with each other. That way, should a real large-scale disaster ever strike, these people will already have strong working relationships” - Kripa Krishnan, Director of Cloud Operations @ Google How can Disaster Recovery improve tempo and stability? @tambryantbutow

gremlin.com/community/tutorials/testing-disaster-recovery-with-chaos-engineering/ How do you perform DR with Gremlin? @tambryantbutow

Invite 4+ people to attend. Failover 5 core services using
the Gremlin blackhole. (safer than a shutdown and faster to recover) Minimum time required = 5 minutes What is an example DiRT? @tambryantbutow

Yes - using the Gremlin API or integrating this into
your CI/CD pipelines Can you automate this? @tambryantbutow

Get Certified Demonstrate your Chaos Engineering expertise, increase your visibility,
and advance your career with a Gremlin Chaos Engineering Practitioner certification. gremlin.com/certification @tambryantbutow

State of Chaos Engineering Report “Top-performing Chaos Engineering teams boast
four nines of availability (52 min of downtime a year) with an MTTR of less than one hour.” - Kolton Andrus gremlin.com/state-of-chaos-engineering/2021 @tambryantbutow

Thank you! @tambryantbutow @tambryantbutow

The Road To Resilience: Chaos Engineering, Dis...

The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays

More Decks by Tammy Bryant Butow

Other Decks in Technology

Featured

Transcript