Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos: breaking your systems to make them unbre...

Chaos: breaking your systems to make them unbreakable - Jason Yee, DevOpsDays Chicago 2020

As applications become more distributed and complex, so do our failure modes. In this presentation, I’ll share why you shouldn’t just embrace failure, but why you should induce it to intentionally cause and learn from failure.

This presentation will start with some basic information on why you should start running Chaos experiments (sometimes called Game Days). I’ll then share how to do it and include advice from running Chaos Engineering at Gremlin and Datadog. We’ll end the session with a live, interactive Chaos experiment.

By the end of the session, attendees will be able to make a strong case to convince their managers of the value of Chaos Engineering and have the knowledge to be able to begin running Chaos experiments in their own environments.

DevOpsDays Chicago

September 01, 2020
Tweet

More Decks by DevOpsDays Chicago

Other Decks in Technology

Transcript

  1. @gitbisect @gremlininc Jason Yee Director of Advocacy [email protected] @gitbisect Whiskey

    nerd, Pokemon trainer, chocolate maker Previously: Datadog, MongoDB, O’Reilly Media
  2. @gitbisect @gremlininc Jason Yee I’m a really good storyteller [email protected]

    @gitbisect Whiskey nerd, Pokemon trainer, chocolate maker I’ve worked at places that hire smart people. I’m probably smart too?
  3. @gitbisect @gremlininc Chaos Engineering FTW! http://j.mp/chaos-gcn How we* did Chaos

    Engineering at Datadog * The royal we. In other words, “they.”
  4. @gitbisect @gremlininc Chaos Engineering It’s science! Make a hypothesis about

    system reliability Experiment by causing controlled failure
  5. @gitbisect @gremlininc Chaos Engineering It’s science! Make a hypothesis about

    system reliability Experiment by causing controlled failure Analyze & share the results
  6. @gitbisect @gremlininc Chaos Engineering It’s science! Make a hypothesis about

    system reliability Experiment by causing controlled failure Analyze & share the results Improve your system
  7. @gitbisect @gremlininc The top 3 challenges 1. Lack of time

    2. Lack of process 3. Lack of priority
  8. @gitbisect @gremlininc Solutions to the top 3 challenges 1. Lack

    of time - make it smaller 2. Lack of process 3. Lack of priority
  9. @gitbisect @gremlininc Solutions to the top 3 challenges 1. Lack

    of time - make it smaller 2. Lack of process - make a run book 3. Lack of priority
  10. @gitbisect @gremlininc Solutions to the top 3 challenges 1. Lack

    of time - make it smaller 2. Lack of process - make a run book 3. Lack of priority - limit the scope
  11. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • 3 engineers • Google Calendar integration for scheduling • Zoom integration for communicating during the GameDay
  12. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • Google form for the run book
  13. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario • Game Master - Launches the attack, evaluates abort conditions. • Responder - Monitors the application, observes effects, declares incident & implements run book • Scribe - Records findings & observations
  14. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario • Keep scenarios simple. Verify what you think you know. • Examples: • Raise CPU by X% and verify this can be seen in monitoring • Kill a pod and verify that it restarts • Block access to 3rd party service & verify alerts/errors
  15. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario 20mins* - Run the attack & observe effects 1. Notify #ops of the GameDay. 2. Start the attack 3. Is it visible in monitoring? Should it throw an alert? 4. Is there a runbook or docs? *can add time if the attack causes an incident
  16. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario 20mins* - Run the attack & observe effects 5mins - Create the ticket *can add time if the attack causes an incident
  17. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • Google form for the run book • Slite for notes
  18. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • Google form for the run book • Slite for notes • Gremlin for the Chaos
  19. @gitbisect @gremlininc Take aways • Smaller GameDay teams • Shorter

    GameDays • Simpler attacks • Singular outcomes
  20. @gitbisect @gremlininc Resources • Running GameDays at Datadog: http://j.mp/chaos-gcn •

    More resources: http://gremlin.com/community • Gremlin Free: http://gremlin.com/free • Chaos Conf: http://chaosconf.io