Embracing Failure in a Container World

Embracing Failure in a Container World

Sooner or later, all complex systems will fail. It's not a matter of "if", it's a matter of "when". There will always be something that can -- and will -- go wrong, especially with today's distributed systems. Accept it and focus on the things you can control: creating a quality service that is resilient to failure.

Building resilient systems requires experience with failure. Waiting for things to break is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. In this talk, Mathias is going to show you how to apply this idea to the wonderful world of containers.

(Talk given at ContainerDays 2017: http://www.containerdays.io/)

2190d7a468f51fa3be5eabfc9397a28b?s=128

Mathias Lafeldt

June 21, 2017
Tweet

Transcript

  1. Embracing Failure in a Container World 1

  2. Hello I'm Mathias Lafeldt. @mlafeldt 2

  3. 3

  4. Outages in 2017 • 2017-01-31: GitLab database outage • 2017-02-28:

    Amazon S3 service disruption in us-east-1 • 2017-05-16: Starbucks server outage ☕ • ... • Add your incidents here 4
  5. Sooner or later, all complex systems will fail 5

  6. There's always something with Docker in production. — Henning Jacobs,

    Zalando (ContainerDays 2017) 6
  7. Accept it and create a quality product that is resilient

    to failures 7
  8. Building resilient systems requires experience with failure 8

  9. Chaos Engineering 101 • Proactively inject failures by simulating real-world

    events • Verify that our systems behave as we expect • Fix them if they don't • Discover new properties of our systems • Netflix: http://principlesofchaos.org/ 9
  10. Wonderland: Jimdo's PaaS speakerdeck.com/mlafeldt/a-journey-through-wonderland 10

  11. GameDays at Jimdo 1. Gather the team in front of

    a big screen 2. Think up failure modes and estimate expected impact 3. Go through chaos experiments together 4. Write down measured impact 5. Create follow-up tickets for all flaws 11
  12. 12

  13. docker pull mlafeldt/simianarmy github.com/mlafeldt/docker-simianarmy 13

  14. Unleash the monkey! ! $ docker run -it --rm \

    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \ -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \ -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \ -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \ -e SIMIANARMY_CHAOS_ASG_ENABLED=true \ -e SIMIANARMY_CHAOS_LEASHED=false \ mlafeldt/simianarmy 14
  15. On-demand termination /simianarmy/api/v1/chaos github.com/mlafeldt/chaosmonkey 15

  16. Trigger new chaos events $ go get -u github.com/mlafeldt/chaosmonkey $

    chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy ShutdownInstance $ chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy DetachVolumes \ -count 5 -interval 10s -probability 0.2 16
  17. Bonus: Slack notifications 17

  18. Failure injection points • Infrastructure provider • Hosts • Internal

    and external dependencies • Client libraries • Applications • Containers ... 18
  19. Pumba github.com/gaia-adm/ pumba 19

  20. Pumba example $ docker run -it --rm --name ubuntu ubuntu:16.04

    $ pumba netem --tc-image gaiadocker/iproute2 \ --duration 60s delay --time 3000 ubuntu 20
  21. How does it work? $ docker run --rm \ --cap-add

    NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc add dev eth0 root netem delay 3000ms $ docker run --rm \ --cap-add NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc del dev eth0 root netem 21
  22. 22

  23. Gremlin • Failure as a Service • Attack hosts, containers,

    applications • Impact CPU, RAM, I/O, network traffic, system time, etc. • Web UI + CLI • Safe and secure 23
  24. Takeaways • Building resilient systems requires experience with failure •

    Don't wait for things to break in production • Proactively inject failures in a controlled way • Use existing chaos tools • ⾠ Start small! ⾠ 24
  25. Production Ready mailing list https://tinyletter.com/production-ready 34 articles, including: The Discipline

    of Chaos Engineering A Little Story about Amazon ECS, systemd, and Chaos Monkey The Power of Less Code Writing Your First Postmortem Go, Mental Models, and Side Effects 25
  26. Thank you. https://tinyletter.com/production-ready @mlafeldt 26