$30 off During Our Annual Pro Sale. View Details »

Chaos: breaking your systems to make them unbreakable - Jason Yee, DevOpsDays Chicago 2020

Chaos: breaking your systems to make them unbreakable - Jason Yee, DevOpsDays Chicago 2020

As applications become more distributed and complex, so do our failure modes. In this presentation, I’ll share why you shouldn’t just embrace failure, but why you should induce it to intentionally cause and learn from failure.

This presentation will start with some basic information on why you should start running Chaos experiments (sometimes called Game Days). I’ll then share how to do it and include advice from running Chaos Engineering at Gremlin and Datadog. We’ll end the session with a live, interactive Chaos experiment.

By the end of the session, attendees will be able to make a strong case to convince their managers of the value of Chaos Engineering and have the knowledge to be able to begin running Chaos experiments in their own environments.

DevOpsDays Chicago

September 01, 2020
Tweet

More Decks by DevOpsDays Chicago

Other Decks in Technology

Transcript

  1. @gitbisect @gremlininc
    unmet expectations.

    View Slide

  2. @gitbisect @gremlininc

    View Slide

  3. @gitbisect @gremlininc

    View Slide

  4. @gitbisect @gremlininc

    View Slide

  5. @gitbisect @gremlininc
    complexity keeps increasing

    View Slide

  6. @gitbisect @gremlininc
    complexity keeps increasing
    incidents are more frequent & cost more money

    View Slide

  7. @gitbisect @gremlininc
    complexity keeps increasing
    incidents are more frequent & cost more money
    Chaos Engineering FTW!

    View Slide

  8. @gitbisect @gremlininc
    Jason Yee
    Director of Advocacy
    [email protected]
    @gitbisect
    Whiskey nerd, Pokemon trainer,
    chocolate maker
    Previously: Datadog, MongoDB, O’Reilly
    Media

    View Slide

  9. @gitbisect @gremlininc
    Jason Yee
    I’m a really good storyteller
    [email protected]
    @gitbisect
    Whiskey nerd, Pokemon trainer,
    chocolate maker
    I’ve worked at places that hire smart
    people. I’m probably smart too?

    View Slide

  10. @gitbisect @gremlininc
    Chaos Engineering FTW!
    http://j.mp/chaos-gcn
    How we* did Chaos Engineering at Datadog
    * The royal we. In other words, “they.”

    View Slide

  11. @gitbisect @gremlininc
    Gremlin
    Chaos Engineering Platform
    http://gremlin.com/community

    View Slide

  12. @gitbisect @gremlininc
    Chaos Engineering

    View Slide

  13. @gitbisect @gremlininc
    Thoughtful, planned experiments
    designed to reveal the weaknesses
    in our systems

    View Slide

  14. @gitbisect @gremlininc
    Chaos Engineering
    It’s science!
    Make a hypothesis about system reliability

    View Slide

  15. @gitbisect @gremlininc
    Chaos Engineering
    It’s science!
    Make a hypothesis about system reliability
    Experiment by causing controlled failure

    View Slide

  16. @gitbisect @gremlininc
    Chaos Engineering
    It’s science!
    Make a hypothesis about system reliability
    Experiment by causing controlled failure
    Analyze & share the results

    View Slide

  17. @gitbisect @gremlininc
    Chaos Engineering
    It’s science!
    Make a hypothesis about system reliability
    Experiment by causing controlled failure
    Analyze & share the results
    Improve your system

    View Slide

  18. @gitbisect @gremlininc
    unmet expectations.

    View Slide

  19. @gitbisect @gremlininc
    Expectations Reality

    View Slide

  20. @gitbisect @gremlininc
    The top 3 challenges

    View Slide

  21. @gitbisect @gremlininc
    The top 3 challenges
    1. Lack of time

    View Slide

  22. @gitbisect @gremlininc
    The top 3 challenges
    1. Lack of time
    2. Lack of process

    View Slide

  23. @gitbisect @gremlininc
    The top 3 challenges
    1. Lack of time
    2. Lack of process
    3. Lack of priority

    View Slide

  24. @gitbisect @gremlininc
    Solutions to the top 3 challenges
    1. Lack of time - make it smaller
    2. Lack of process
    3. Lack of priority

    View Slide

  25. @gitbisect @gremlininc
    Solutions to the top 3 challenges
    1. Lack of time - make it smaller
    2. Lack of process - make a run book
    3. Lack of priority

    View Slide

  26. @gitbisect @gremlininc
    Solutions to the top 3 challenges
    1. Lack of time - make it smaller
    2. Lack of process - make a run book
    3. Lack of priority - limit the scope

    View Slide

  27. @gitbisect @gremlininc
    Mini GameDays

    View Slide

  28. @gitbisect @gremlininc
    The tools
    • Donut (slack bot) for team
    building
    • 3 engineers
    • Google Calendar
    integration for
    scheduling
    • Zoom integration for
    communicating during
    the GameDay

    View Slide

  29. @gitbisect @gremlininc
    The tools
    • Donut (slack bot) for team
    building
    • Google form for the run book

    View Slide

  30. @gitbisect @gremlininc
    The run book
    5mins - Assign roles and select a scenario
    • Game Master - Launches the attack, evaluates abort conditions.
    • Responder - Monitors the application, observes effects, declares incident
    & implements run book
    • Scribe - Records findings & observations

    View Slide

  31. @gitbisect @gremlininc
    The run book
    5mins - Assign roles and select a scenario
    • Keep scenarios simple. Verify what you think you know.
    • Examples:
    • Raise CPU by X% and verify this can be seen in monitoring
    • Kill a pod and verify that it restarts
    • Block access to 3rd party service & verify alerts/errors

    View Slide

  32. @gitbisect @gremlininc
    The run book
    5mins - Assign roles and select a scenario
    20mins* - Run the attack & observe effects
    1. Notify #ops of the GameDay.
    2. Start the attack
    3. Is it visible in monitoring? Should it throw an alert?
    4. Is there a runbook or docs?
    *can add time if the attack causes an incident

    View Slide

  33. @gitbisect @gremlininc
    The run book
    5mins - Assign roles and select a scenario
    20mins* - Run the attack & observe effects
    5mins - Create the ticket
    *can add time if the attack causes an incident

    View Slide

  34. @gitbisect @gremlininc
    The tools
    • Donut (slack bot) for team
    building
    • Google form for the run book
    • Slite for notes

    View Slide

  35. @gitbisect @gremlininc
    The tools
    • Donut (slack bot) for team
    building
    • Google form for the run book
    • Slite for notes
    • Gremlin for the Chaos

    View Slide

  36. @gitbisect @gremlininc
    Take aways
    • Smaller GameDay teams
    • Shorter GameDays
    • Simpler attacks
    • Singular outcomes

    View Slide

  37. @gitbisect @gremlininc
    Start small. Build practice.

    View Slide

  38. @gitbisect @gremlininc
    Resources
    • Running GameDays at Datadog: http://j.mp/chaos-gcn
    • More resources: http://gremlin.com/community
    • Gremlin Free: http://gremlin.com/free
    • Chaos Conf: http://chaosconf.io

    View Slide

  39. @gitbisect @gremlininc
    Thanks!
    [email protected]
    @gitbisect
    splcenter.org

    View Slide