Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays

The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays

Tammy Bryant Butow

September 15, 2021
Tweet

More Decks by Tammy Bryant Butow

Other Decks in Technology

Transcript

  1. The Road To Resilience: Chaos Engineering, Disaster Recovery & GameDays

    Tammy Bryant Butow, SRE @ Gremlin @tambryantbutow
  2. Get Certified Demonstrate your Chaos Engineering expertise, increase your visibility,

    and advance your career with a Gremlin Chaos Engineering Practitioner certification. gremlin.com/certification @tambryantbutow
  3. Stability is measured by mean time to recover (MTTR) and

    change failure rate Tempo is measured by deployment frequency and change lead time What is tempo and stability? @tambryantbutow
  4. What is tempo and stability? @tambryantbutow Tempo Deployment Frequency The

    rate that software is deployed to production or an app store (e.g. within a range of multiple times a day to once a year) Tempo Change Lead Time The time it takes to go from a customer making a request to the request being satisfied Stability Mean Time To Recover (MTTR) The mean time it takes a company to recover from downtime of their software Stability Change Failure Rate The likelihood of defect changes (e.g. ⅕ )
  5. 1. Chaos Engineering 2. GameDays 3. Disaster Recovery What are

    the three key practices? @tambryantbutow
  6. “Build systems that are designed to be deployed easily, can

    detect and tolerate failures, and can have various components of the system updated independently.” - Accelerate @tambryantbutow
  7. @tambryantbutow What is the best way to know if your

    system can detect and tolerate failure?
  8. Does blackholing a critical path service like the Balance Reader

    result in graceful degradation of the customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow
  9. https://app.gremlin.com/attacks/new/kubernetes Does blackholing a critical path service like the Balance

    Reader result in graceful degradation of the customer experience? @tambryantbutow
  10. The balance appears as $--- This could make the user

    think they have no money in their account @tambryantbutow
  11. The user is still able to make a deposit of

    $1000 while the Balance Reader service is in a blackhole. @tambryantbutow
  12. The user is unable to send payments. They will see

    an error that the payment failed due to Balance Reader. @tambryantbutow
  13. The user is unable to send payments. They will see

    an error that the payment failed due to Balance Reader. @tambryantbutow
  14. Free demo environment to learn about black holes 1. Use

    this link to install with minikube on google cloud shell: https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_repo=https://gi thub.com/GoogleCloudPlatform/bank-of-anthos&cloudshell_workspace=.&cloudshell_tu torial=extras/cloudshell/tutorial.md 2. Click minikube → start 3. In Cloud Shell terminal, run kubectl apply -f bank-of-anthos/extras/jwt/jwt-secret.yaml 4. Click <> Cloud Code → Run on Kubernetes, change to port 4503 5. To create black holes, create a namespace for gremlin and install gremlin as helm chart https://github.com/gremlin/helm @tambryantbutow
  15. Does blackholing transaction history result in graceful degradation of the

    customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow
  16. @tambryantbutow kubectl scale deployment transactionhistory --replicas=2 What can we do

    to mitigate against a blackhole? Depending on the service, scaling replicas may work well
  17. https://github.com/GoogleCloudPlatform/bank-of-anthos There will be a very short outage and then

    the other pod will take over Pod 2: Transaction History Pod 1: Transaction History Deployment Set: Transaction History replicas=2 @tambryantbutow
  18. We are still able to see transaction history and no

    longer receive error messages. https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow
  19. Accelerate shares that Game Days are a great way to

    build relationships within an organization. This is a cultural must-do to become a high-performing organization. How can GameDays improve tempo and stability? @tambryantbutow
  20. Invite 4+ people to attend (2+ teams). Spike load (Gatling)

    and introduce failure (Gremlin). Minimum time required = 10min What is an example GameDay? @tambryantbutow
  21. • Dependency Testing (Flex) • Capacity Plan Testing • Autoscaling

    Testing What are more example GameDays? @tambryantbutow
  22. Outages are simulated or actually created according to a pre-prepared

    plan, and teams must work together to maintain and restore service levels. How can Disaster Recovery improve tempo and stability? @tambryantbutow
  23. “For DiRT-style events to be successful, an organization first needs

    to accept system and process failures as a means of learning… we design tests that require engineers from several groups who might not normally work together to interact with each other. That way, should a real large-scale disaster ever strike, these people will already have strong working relationships” - Kripa Krishnan, Director of Cloud Operations @ Google How can Disaster Recovery improve tempo and stability? @tambryantbutow
  24. Invite 4+ people to attend. Failover 5 core services using

    the Gremlin blackhole. (safer than a shutdown and faster to recover) Minimum time required = 5 minutes What is an example DiRT? @tambryantbutow
  25. Yes - using the Gremlin API or integrating this into

    your CI/CD pipelines Can you automate this? @tambryantbutow
  26. Get Certified Demonstrate your Chaos Engineering expertise, increase your visibility,

    and advance your career with a Gremlin Chaos Engineering Practitioner certification. gremlin.com/certification @tambryantbutow
  27. State of Chaos Engineering Report “Top-performing Chaos Engineering teams boast

    four nines of availability (52 min of downtime a year) with an MTTR of less than one hour.” - Kolton Andrus gremlin.com/state-of-chaos-engineering/2021 @tambryantbutow