Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos: breaking your systems to make them unbreakable - Jason Yee, DevOpsDays Chicago 2020

Chaos: breaking your systems to make them unbreakable - Jason Yee, DevOpsDays Chicago 2020

As applications become more distributed and complex, so do our failure modes. In this presentation, I’ll share why you shouldn’t just embrace failure, but why you should induce it to intentionally cause and learn from failure.

This presentation will start with some basic information on why you should start running Chaos experiments (sometimes called Game Days). I’ll then share how to do it and include advice from running Chaos Engineering at Gremlin and Datadog. We’ll end the session with a live, interactive Chaos experiment.

By the end of the session, attendees will be able to make a strong case to convince their managers of the value of Chaos Engineering and have the knowledge to be able to begin running Chaos experiments in their own environments.

91fc4cf4a51d1c2d5e3a2c881dadfc7e?s=128

DevOpsDays Chicago

September 01, 2020
Tweet

Transcript

  1. @gitbisect @gremlininc unmet expectations.

  2. @gitbisect @gremlininc

  3. @gitbisect @gremlininc

  4. @gitbisect @gremlininc

  5. @gitbisect @gremlininc complexity keeps increasing

  6. @gitbisect @gremlininc complexity keeps increasing incidents are more frequent &

    cost more money
  7. @gitbisect @gremlininc complexity keeps increasing incidents are more frequent &

    cost more money Chaos Engineering FTW!
  8. @gitbisect @gremlininc Jason Yee Director of Advocacy jyee@gremlin.com @gitbisect Whiskey

    nerd, Pokemon trainer, chocolate maker Previously: Datadog, MongoDB, O’Reilly Media
  9. @gitbisect @gremlininc Jason Yee I’m a really good storyteller jyee@gremlin.com

    @gitbisect Whiskey nerd, Pokemon trainer, chocolate maker I’ve worked at places that hire smart people. I’m probably smart too?
  10. @gitbisect @gremlininc Chaos Engineering FTW! http://j.mp/chaos-gcn How we* did Chaos

    Engineering at Datadog * The royal we. In other words, “they.”
  11. @gitbisect @gremlininc Gremlin Chaos Engineering Platform http://gremlin.com/community

  12. @gitbisect @gremlininc Chaos Engineering

  13. @gitbisect @gremlininc Thoughtful, planned experiments designed to reveal the weaknesses

    in our systems
  14. @gitbisect @gremlininc Chaos Engineering It’s science! Make a hypothesis about

    system reliability
  15. @gitbisect @gremlininc Chaos Engineering It’s science! Make a hypothesis about

    system reliability Experiment by causing controlled failure
  16. @gitbisect @gremlininc Chaos Engineering It’s science! Make a hypothesis about

    system reliability Experiment by causing controlled failure Analyze & share the results
  17. @gitbisect @gremlininc Chaos Engineering It’s science! Make a hypothesis about

    system reliability Experiment by causing controlled failure Analyze & share the results Improve your system
  18. @gitbisect @gremlininc unmet expectations.

  19. @gitbisect @gremlininc Expectations Reality

  20. @gitbisect @gremlininc The top 3 challenges

  21. @gitbisect @gremlininc The top 3 challenges 1. Lack of time

  22. @gitbisect @gremlininc The top 3 challenges 1. Lack of time

    2. Lack of process
  23. @gitbisect @gremlininc The top 3 challenges 1. Lack of time

    2. Lack of process 3. Lack of priority
  24. @gitbisect @gremlininc Solutions to the top 3 challenges 1. Lack

    of time - make it smaller 2. Lack of process 3. Lack of priority
  25. @gitbisect @gremlininc Solutions to the top 3 challenges 1. Lack

    of time - make it smaller 2. Lack of process - make a run book 3. Lack of priority
  26. @gitbisect @gremlininc Solutions to the top 3 challenges 1. Lack

    of time - make it smaller 2. Lack of process - make a run book 3. Lack of priority - limit the scope
  27. @gitbisect @gremlininc Mini GameDays

  28. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • 3 engineers • Google Calendar integration for scheduling • Zoom integration for communicating during the GameDay
  29. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • Google form for the run book
  30. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario • Game Master - Launches the attack, evaluates abort conditions. • Responder - Monitors the application, observes effects, declares incident & implements run book • Scribe - Records findings & observations
  31. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario • Keep scenarios simple. Verify what you think you know. • Examples: • Raise CPU by X% and verify this can be seen in monitoring • Kill a pod and verify that it restarts • Block access to 3rd party service & verify alerts/errors
  32. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario 20mins* - Run the attack & observe effects 1. Notify #ops of the GameDay. 2. Start the attack 3. Is it visible in monitoring? Should it throw an alert? 4. Is there a runbook or docs? *can add time if the attack causes an incident
  33. @gitbisect @gremlininc The run book 5mins - Assign roles and

    select a scenario 20mins* - Run the attack & observe effects 5mins - Create the ticket *can add time if the attack causes an incident
  34. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • Google form for the run book • Slite for notes
  35. @gitbisect @gremlininc The tools • Donut (slack bot) for team

    building • Google form for the run book • Slite for notes • Gremlin for the Chaos
  36. @gitbisect @gremlininc Take aways • Smaller GameDay teams • Shorter

    GameDays • Simpler attacks • Singular outcomes
  37. @gitbisect @gremlininc Start small. Build practice.

  38. @gitbisect @gremlininc Resources • Running GameDays at Datadog: http://j.mp/chaos-gcn •

    More resources: http://gremlin.com/community • Gremlin Free: http://gremlin.com/free • Chaos Conf: http://chaosconf.io
  39. @gitbisect @gremlininc Thanks! jyee@gremlin.com @gitbisect splcenter.org