Embracing Failure in a Container World

Slide 1

Slide 1 text

Embracing Failure in a Container World 1

Slide 2

Slide 2 text

Hello I'm Mathias Lafeldt. @mlafeldt 2

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Outages in 2017 • 2017-01-31: GitLab database outage • 2017-02-28: Amazon S3 service disruption in us-east-1 • 2017-05-16: Starbucks server outage ☕ • ... • Add your incidents here 4

Slide 5

Slide 5 text

Sooner or later, all complex systems will fail 5

Slide 6

Slide 6 text

There's always something with Docker in production. — Henning Jacobs, Zalando (ContainerDays 2017) 6

Slide 7

Slide 7 text

Accept it and create a quality product that is resilient to failures 7

Slide 8

Slide 8 text

Building resilient systems requires experience with failure 8

Slide 9

Slide 9 text

Chaos Engineering 101 • Proactively inject failures by simulating real-world events • Verify that our systems behave as we expect • Fix them if they don't • Discover new properties of our systems • Netﬂix: http://principlesofchaos.org/ 9

Slide 10

Slide 10 text

Wonderland: Jimdo's PaaS speakerdeck.com/mlafeldt/a-journey-through-wonderland 10

Slide 11

Slide 11 text

GameDays at Jimdo 1. Gather the team in front of a big screen 2. Think up failure modes and estimate expected impact 3. Go through chaos experiments together 4. Write down measured impact 5. Create follow-up tickets for all ﬂaws 11

Slide 12

Slide 12 text

Slide 13

Slide 13 text

docker pull mlafeldt/simianarmy github.com/mlafeldt/docker-simianarmy 13

Slide 14

Slide 14 text

Unleash the monkey! ! $ docker run -it --rm \ -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \ -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \ -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \ -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \ -e SIMIANARMY_CHAOS_ASG_ENABLED=true \ -e SIMIANARMY_CHAOS_LEASHED=false \ mlafeldt/simianarmy 14

Slide 15

Slide 15 text

On-demand termination /simianarmy/api/v1/chaos github.com/mlafeldt/chaosmonkey 15

Slide 16

Slide 16 text

Trigger new chaos events $ go get -u github.com/mlafeldt/chaosmonkey $ chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy ShutdownInstance $ chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy DetachVolumes \ -count 5 -interval 10s -probability 0.2 16

Slide 17

Slide 17 text

Bonus: Slack notifications 17

Slide 18

Slide 18 text

Failure injection points • Infrastructure provider • Hosts • Internal and external dependencies • Client libraries • Applications • Containers ... 18

Slide 19

Slide 19 text

Pumba github.com/gaia-adm/ pumba 19

Slide 20

Slide 20 text

Pumba example $ docker run -it --rm --name ubuntu ubuntu:16.04 $ pumba netem --tc-image gaiadocker/iproute2 \ --duration 60s delay --time 3000 ubuntu 20

Slide 21

Slide 21 text

How does it work? $ docker run --rm \ --cap-add NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc add dev eth0 root netem delay 3000ms $ docker run --rm \ --cap-add NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc del dev eth0 root netem 21

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Gremlin • Failure as a Service • Attack hosts, containers, applications • Impact CPU, RAM, I/O, network trafﬁc, system time, etc. • Web UI + CLI • Safe and secure 23

Slide 24

Slide 24 text

Takeaways • Building resilient systems requires experience with failure • Don't wait for things to break in production • Proactively inject failures in a controlled way • Use existing chaos tools • ⾠ Start small! ⾠ 24

Slide 25

Slide 25 text

Production Ready mailing list https://tinyletter.com/production-ready 34 articles, including: The Discipline of Chaos Engineering A Little Story about Amazon ECS, systemd, and Chaos Monkey The Power of Less Code Writing Your First Postmortem Go, Mental Models, and Side Effects 25

Slide 26

Slide 26 text

Thank you. https://tinyletter.com/production-ready @mlafeldt 26