Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[GOTOBER] GameDays: Practice Thoughtful Chaos Engineering

Ho Ming Li
November 02, 2018

[GOTOBER] GameDays: Practice Thoughtful Chaos Engineering

GameDay is a dedicated time to intentionally create failure scenarios in a safe environment. Regularly running GameDays is an effective Chaos Engineering practice to test the resiliency of your services; to validate the technical intricacies, and to also surface conversations around observability and incident management. GameDays can also expose you to blind spots when systems are operating under suboptimal conditions. In this talk, Ho Ming will be sharing what it takes to run successful GameDays.

Ho Ming Li

November 02, 2018
Tweet

More Decks by Ho Ming Li

Other Decks in Technology

Transcript

  1. Lead Solutions Architect @ Gremlin Ex - AWS, NetApp, IBM

    Ran 10+ GameDays With 5+ Companies a retired gamer Ho Ming Li @HoReaL
  2. GameDay Dedicated time for teams to collaboratively focus on using

    Chaos Engineering practices to reveal weaknesses in your systems
  3. “Computers aren’t the thing. They’re the thing that gets us

    to the thing.” - Halt and Catch Fire Chaos Engineering isn’t the thing. It’s the thing that gets us to Resilience.
  4. Why resilience matters - Motivations Business Case Avoid very costly

    downtime. Engineering Case Improve quality and increase agility. On-Call Case Avoid constant fire fighting and pager fatigue.
  5. Let’s do… Active-Active Multi-Region AI-Enabled Detection with Auto-magic Recovery System

    Using Request Tokens on Globally Decentralized Blockchain But… Can you withstand a critical host going down right now?
  6. CTO / VP Engineering Budget / Objectives Organizer Invitations /

    Coordination Engineering Director / Manager Prioritization / Engineer Availability Engineers / Subject Matter Expert Architecture / Experiments New Hires / Interns Learning / New Perspectives Other Stakeholders / Observers Understand Impact on their Functions Chaos Party
  7. Anatomy of a GameDay GameDay Experiment #1 Experiment #2 Attack

    (Inject Failure) Attack Attack Attack ... ... Experiment #3 Attack Attack ...
  8. World Cup 2018 Guess which team won? If they play

    again, do you know which team will win? vs vs vs
  9. World Cup 2018 Guess which team won? If they play

    again, do you know which team will win? 1(3) - 1(4) 0 - 2 0 - 1
  10. Scenarios Terminate Hosts Inject Latency Consume CPU Fill up Disk

    Bad Hosts Going Offline Service is Slow Workload runs Hot Logs not Rotating Attacks
  11. Observibility Get rid of the Fog of War so you

    can clearly see the map and strategize accordingly. Gain Deep Insight with: - Metrics - Logging - Request Tracing
  12. Level Up Manual Runs - Exploratory - GameDays Automated Runs

    - Scheduled Runs - Include in Pipeline
  13. GameDay #1 .. N GameDay is not just a one

    time event Think about the next GameDay Track and Measure Success over time
  14. Chaos All the Things Very Loosely Break down of “an

    application” “Edge”: DNS, CDN, Cloudflare “Front End”: LB, API, Nginx “Back End”: MySQL, Kafka, ES “Infrastructure”: Kubernetes, Container, Virtual Machine, Physical Server, Data Center "AWS Lambda protects you against some infrastructure failures, but you still need to defend against weakness in your own code." -- Yan Cui, Principal Engineer at DAZN
  15. Magnified Wait App Server to Database slowed Attack: Inject small

    amount of latency between services Hypothesis: Users experience slowness roughly equates to the injected delay
  16. I can see this, but I can’t see that Requiring

    Database to process messages from Queue Attack: Consumer cannot connect to Database. Hypothesis: Consumer can no longer process messages.
  17. Loosely coupled... … or Not Orchestrating Containers in Microservices Attack:

    Container dies. Hypothesis: Orchestrator will spawn new container.