Embracing Failure in a Container World

Embracing Failure in a Container World

Sooner or later, all complex systems will fail. It's not a matter of "if", it's a matter of "when". There will always be something that can -- and will -- go wrong, especially with today's distributed systems. Accept it and focus on the things you can control: creating a quality service that is resilient to failure.

Building resilient systems requires experience with failure. Waiting for things to break is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. In this talk, Mathias is going to show you how to apply this idea to the wonderful world of containers.

(Talk given at ContainerDays 2017: http://www.containerdays.io/)

2190d7a468f51fa3be5eabfc9397a28b?s=128

Mathias Lafeldt

June 21, 2017
Tweet

Transcript

  1. 3.

    3

  2. 4.

    Outages in 2017 • 2017-01-31: GitLab database outage • 2017-02-28:

    Amazon S3 service disruption in us-east-1 • 2017-05-16: Starbucks server outage ☕ • ... • Add your incidents here 4
  3. 9.

    Chaos Engineering 101 • Proactively inject failures by simulating real-world

    events • Verify that our systems behave as we expect • Fix them if they don't • Discover new properties of our systems • Netflix: http://principlesofchaos.org/ 9
  4. 11.

    GameDays at Jimdo 1. Gather the team in front of

    a big screen 2. Think up failure modes and estimate expected impact 3. Go through chaos experiments together 4. Write down measured impact 5. Create follow-up tickets for all flaws 11
  5. 12.

    12

  6. 14.

    Unleash the monkey! ! $ docker run -it --rm \

    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \ -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \ -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \ -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \ -e SIMIANARMY_CHAOS_ASG_ENABLED=true \ -e SIMIANARMY_CHAOS_LEASHED=false \ mlafeldt/simianarmy 14
  7. 16.

    Trigger new chaos events $ go get -u github.com/mlafeldt/chaosmonkey $

    chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy ShutdownInstance $ chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy DetachVolumes \ -count 5 -interval 10s -probability 0.2 16
  8. 18.

    Failure injection points • Infrastructure provider • Hosts • Internal

    and external dependencies • Client libraries • Applications • Containers ... 18
  9. 20.

    Pumba example $ docker run -it --rm --name ubuntu ubuntu:16.04

    $ pumba netem --tc-image gaiadocker/iproute2 \ --duration 60s delay --time 3000 ubuntu 20
  10. 21.

    How does it work? $ docker run --rm \ --cap-add

    NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc add dev eth0 root netem delay 3000ms $ docker run --rm \ --cap-add NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc del dev eth0 root netem 21
  11. 22.

    22

  12. 23.

    Gremlin • Failure as a Service • Attack hosts, containers,

    applications • Impact CPU, RAM, I/O, network traffic, system time, etc. • Web UI + CLI • Safe and secure 23
  13. 24.

    Takeaways • Building resilient systems requires experience with failure •

    Don't wait for things to break in production • Proactively inject failures in a controlled way • Use existing chaos tools • ⾠ Start small! ⾠ 24
  14. 25.

    Production Ready mailing list https://tinyletter.com/production-ready 34 articles, including: The Discipline

    of Chaos Engineering A Little Story about Amazon ECS, systemd, and Chaos Monkey The Power of Less Code Writing Your First Postmortem Go, Mental Models, and Side Effects 25