Dallas Chaos Engineering Community Meetup - Feb 2018

Dallas Chaos Engineering Community Meetup - Feb 2018

A presentation from the Dallas Chaos Engineering Community Meetup in Feb 2018

203e64aeb53ae59b2b4dcf923c163c23?s=128

Tammy Bütow

February 08, 2018
Tweet

Transcript

  1. Chaos Engineering Community Meetup. Feb 6 , 2018 Tammy Butow

    Principal SRE Gremlin.com @Gremlininc @tammybutow
  2. Welcome Hello I’m Tammy Butow, you can find me on

    twitter @tammybutow. I work at Gremlin, I’m a Site Reliability Engineer. I work remotely from Melbourne right now, our head office is in Silicon Valley. Where else can you find me? Twitter: twitter.com/tammybutow Twitch: twitchtv.com/tammychaos Website: tammybutow.com
  3. None
  4. None
  5. What is Chaos Engineering A brief introduction to the practice

    of CE
  6. What is Chaos Engineering A brief introduction to the practice

    of CE Chaos Engineering is an emerging discipline, but the underlying concepts are not. Failure is going to happen - Are you ready? Put simply, Chaos Engineering is one approach to “breaking things on purpose” that teaches us new information about our systems through experimentation. By triggering incidents intentionally in a controlled way, we gain confidence that our systems can deal with those failures before they occur in production. By practicing Chaos Engineering you’ll learn how to build systems and organizations that improve in the face of failure.
  7. What is Chaos Engineering A brief introduction to the practice

    of CE The lesson we should learn and remember is that sooner or later, all complex systems will fail. It’s not a matter of if, it’s a matter of when. There will always be something that can — and will — go wrong. Break Things on Purpose. Building resilient systems requires experience with failure. Waiting for things to break in production is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. By simulating potential errors in advance, we can verify that our systems behave as we expect — and to fix them if they don’t.
  8. A Word of Caution A brief introduction to the practice

    of CE You should never conduct a chaos experiment in production if you already know that it will cause severe damage, possibly affecting customers — and with them, your reputation. Always try to fix known problems first! Chaos Engineering requires a base level of resilience.
  9. The History of Chaos Engineering https://coggle.it/diagram/WiKceGDAwgABrmyv/0a2d4968c94723e48e1256e67df51d0f4217027143924b23517832f53c536e62

  10. What is the state of Chaos Engineering right now?

  11. https://twitter.com/TechCrunch/status/960179520610492417

  12. https://techcrunch.com/2018/02/04/the-rise-of-chaos-engineering/

  13. meetup.com/pro/chaos

  14. gremlin.com/community

  15. Which service teams should use Chaos Engineering? Where should we

    focus first? My top 3 recommendations for selecting services/systems: 1. Tier 0 / critical services - “what are your top 5 most critical systems?” 2. Services which serve critical functions, e.g. bushfire warning system 3. Services which store critical data, e.g. data storage/big data
  16. None
  17. Establishing a High Severity Incident Management Program

  18. Establishing a High Severity Incident Management Program

  19. Establishing a High Severity Incident Management Program

  20. Mini Bootcamp: Chaos Engineering + Docker Be prepared for outages.

  21. Mini Bootcamp Materials A brief introduction to the practice of

    CE We have the following: 1. A droplet from DigitalOcean (cloud infrastructure) 2. Docker (Containers) 3. Docker Voting App (demo application) 4. Gremlin (chaos engineering) 5. Datadog (monitoring)
  22. Bootcamp Materials A DigitalOcean Droplet

  23. Bootcamp Materials Docker and Docker Compose on your DigitalOcean Droplet

  24. Bootcamp Materials A demo Docker application, Cats vs Dogs voting

    app Vote: http://159.65.74.124:5000/ Results: http://159.65.74.124:5001/
  25. Gremlin Gremlin’s Failure as a Service to find weaknesses in

    your system before they cause problems. https://app.gremlin.com/dashboard
  26. Datadog Monitoring agent and dashboards for your application and containers

    docker run -d --name dd-agent -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e API_KEY=faff9c88d8cdd357d76505f595f23797 -e SD_BACKEND=docker datadog/docker-dd-agent:latest
  27. Let’s get started…. Time for hands on Chaos Engineering

  28. Create an attack using Gremlin (UI or CLI) https://app.gremlin.com/dashboard

  29. Create an attack using Gremlin (UI or CLI) https://app.gremlin.com/dashboard

  30. Create an attack using Gremlin (UI or CLI) https://app.gremlin.com/dashboard docker

    run -it \ > --cap-add=NET_ADMIN \ > -e GREMLIN_ORG_ID="${GREMLIN_ORG_ID}" \ > -e GREMLIN_ORG_SECRET="${GREMLIN_ORG_SECRET}" \ > -v /var/run/docker.sock:/var/run/docker.sock \ > gremlin/gremlin attack-container 466bbb0e5246 cpu
  31. atop Monitoring from within the container you are attacking

  32. Datadog A sidecar will be created to perform the attack

  33. How can you learn more about Chaos Engineering? Useful resources

    and ways to learn 1. Chaos Engineering Community on Slack @ https://tinyurl.com/chaoseng 2. Follow Gremlin on Twitter @gremlininc 3. Technical Papers @ https://blog.gremlin.com/ 4. Conferences (Qcon, Velocity and SREcon) 5. Follow Chaos Engineers on Twitter (@koltonandrus & @callmeforni)
  34. Q & A What’s on your mind?

  35. Thank You: • Everyone who attended tonight • Joel &

    Jennifer Tammy Butow Gremlin.com @Gremlininc @tammybutow