Embracing Failure in a Container World

Embracing Failure in a Container World 1

Hello I'm Mathias Lafeldt. @mlafeldt 2

Outages in 2017 • 2017-01-31: GitLab database outage • 2017-02-28:
Amazon S3 service disruption in us-east-1 • 2017-05-16: Starbucks server outage ☕ • ... • Add your incidents here 4

Sooner or later, all complex systems will fail 5

There's always something with Docker in production. — Henning Jacobs,
Zalando (ContainerDays 2017) 6

Accept it and create a quality product that is resilient
to failures 7

Building resilient systems requires experience with failure 8

Chaos Engineering 101 • Proactively inject failures by simulating real-world
events • Verify that our systems behave as we expect • Fix them if they don't • Discover new properties of our systems • Netﬂix: http://principlesofchaos.org/ 9

Wonderland: Jimdo's PaaS speakerdeck.com/mlafeldt/a-journey-through-wonderland 10

GameDays at Jimdo 1. Gather the team in front of
a big screen 2. Think up failure modes and estimate expected impact 3. Go through chaos experiments together 4. Write down measured impact 5. Create follow-up tickets for all ﬂaws 11

docker pull mlafeldt/simianarmy github.com/mlafeldt/docker-simianarmy 13

Unleash the monkey! ! $ docker run -it --rm \
-e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \ -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \ -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \ -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \ -e SIMIANARMY_CHAOS_ASG_ENABLED=true \ -e SIMIANARMY_CHAOS_LEASHED=false \ mlafeldt/simianarmy 14

On-demand termination /simianarmy/api/v1/chaos github.com/mlafeldt/chaosmonkey 15

Trigger new chaos events $ go get -u github.com/mlafeldt/chaosmonkey $
chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy ShutdownInstance $ chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \ -group ExampleAutoScalingGroup \ -strategy DetachVolumes \ -count 5 -interval 10s -probability 0.2 16

Bonus: Slack notifications 17

Failure injection points • Infrastructure provider • Hosts • Internal
and external dependencies • Client libraries • Applications • Containers ... 18

Pumba github.com/gaia-adm/ pumba 19

Pumba example $ docker run -it --rm --name ubuntu ubuntu:16.04
$ pumba netem --tc-image gaiadocker/iproute2 \ --duration 60s delay --time 3000 ubuntu 20

How does it work? $ docker run --rm \ --cap-add
NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc add dev eth0 root netem delay 3000ms $ docker run --rm \ --cap-add NET_ADMIN \ --net=container:ubuntu \ gaiadocker/iproute2 qdisc del dev eth0 root netem 21

Gremlin • Failure as a Service • Attack hosts, containers,
applications • Impact CPU, RAM, I/O, network trafﬁc, system time, etc. • Web UI + CLI • Safe and secure 23

Takeaways • Building resilient systems requires experience with failure •
Don't wait for things to break in production • Proactively inject failures in a controlled way • Use existing chaos tools • ⾠ Start small! ⾠ 24

Production Ready mailing list https://tinyletter.com/production-ready 34 articles, including: The Discipline
of Chaos Engineering A Little Story about Amazon ECS, systemd, and Chaos Monkey The Power of Less Code Writing Your First Postmortem Go, Mental Models, and Side Effects 25

Thank you. https://tinyletter.com/production-ready @mlafeldt 26

Embracing Failure in a Container World

Embracing Failure in a Container World

Mathias Lafeldt

More Decks by Mathias Lafeldt

Other Decks in Technology

Featured

Transcript