Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Definitive Guide to GameDays - Road to Resilient Services

Definitive Guide to GameDays - Road to Resilient Services

GameDay is a dedicated time to intentionally create failure scenarios in a safe environment. Regularly running GameDays is an effective Chaos Engineering practice to test the resiliency of your services; to validate the technical intricacies, and to also surface conversations around observability and incident management. GameDays can also expose you to blind spots when systems are operating under suboptimal conditions. In this talk, Ho Ming will be sharing what it takes to run successful GameDays.

Ho Ming Li

May 15, 2018
Tweet

More Decks by Ho Ming Li

Other Decks in Technology

Transcript

  1. Definitive Guide to GameDays Road to Resilient Services Ho Ming

    Li @HoReaL Solutions Architect @ Gremlin
  2. Ho Ming Li Solutions Architect @HoReaL SW Development Release Engineering

    QA Engineering Professional Services Enterprise Support Solutions Architecture
  3. Dedicated time for teams to collaboratively focus on using Chaos

    Engineering practices to reveal weaknesses in your services
  4. Who Topics to expect Useful Topics Influencers Engineers On-Call How

    to practice CE Continuous Chaos Waverers Senior Management What is CE Cost of Downtime Passives Engineering Managers Why practice CE Incident & on-call reduction Moaners Specific Individuals “I’m too busy” We don’t learn by “always doing things the way we’ve always done them” Opponents Support “There’s already so much chaos” Impact of incident management Fanatics Specific Individuals “I believe in unit tests” CE is unit tests for alerting and monitoring Skeptics Specific Individuals “We won’t get value from this” Defence protection & training Mutineers Specific Individuals “We don’t need to do this” Data on top 5 most unreliable services & focus on resilience
  5. THE Chaos Crew to make it happen Executive: CTO /

    VP Engineering Budget / Objectives Executive Assistant / Organizer Invitations / Coordination Engineering Director / Manager Prioritization / E. Availability Engineers / Subject Matter Expert Architecture / Experiments New Hires / Interns Learning / New Perspectives
  6. How long did it take to launch? How long till

    service recovers? How much time left before Fallback fails?
  7. Break things together! Join us. Learn from us. Teach us.

    Chaos Engineering Community Slack (https://tinyurl.com/chaoseng)