$30 off During Our Annual Pro Sale. View Details »

Embracing Failure in a Container World

Embracing Failure in a Container World

Sooner or later, all complex systems will fail. It's not a matter of "if", it's a matter of "when". There will always be something that can -- and will -- go wrong, especially with today's distributed systems. Accept it and focus on the things you can control: creating a quality service that is resilient to failure.

Building resilient systems requires experience with failure. Waiting for things to break is not an option. We should rather inject failures proactively in a controlled way to gain confidence that our production systems can withstand those failures. In this talk, Mathias is going to show you how to apply this idea to the wonderful world of containers.

(Talk given at ContainerDays 2017: http://www.containerdays.io/)

Mathias Lafeldt

June 21, 2017
Tweet

More Decks by Mathias Lafeldt

Other Decks in Technology

Transcript

  1. Embracing Failure
    in a Container World
    1

    View Slide

  2. Hello
    I'm Mathias Lafeldt.
    @mlafeldt
    2

    View Slide

  3. 3

    View Slide

  4. Outages in 2017
    • 2017-01-31: GitLab database outage
    • 2017-02-28: Amazon S3 service disruption in us-east-1
    • 2017-05-16: Starbucks server outage ☕
    • ...
    • Add your incidents here
    4

    View Slide

  5. Sooner or later,
    all complex
    systems
    will fail
    5

    View Slide

  6. There's always something
    with Docker in production.
    — Henning Jacobs, Zalando
    (ContainerDays 2017)
    6

    View Slide

  7. Accept it and create a
    quality product that is
    resilient to failures
    7

    View Slide

  8. Building
    resilient systems
    requires
    experience
    with failure
    8

    View Slide

  9. Chaos Engineering 101
    • Proactively inject failures by simulating real-world events
    • Verify that our systems behave as we expect
    • Fix them if they don't
    • Discover new properties of our systems
    • Netflix: http://principlesofchaos.org/
    9

    View Slide

  10. Wonderland: Jimdo's PaaS
    speakerdeck.com/mlafeldt/a-journey-through-wonderland
    10

    View Slide

  11. GameDays at Jimdo
    1. Gather the team in front of a big screen
    2. Think up failure modes and estimate expected impact
    3. Go through chaos experiments together
    4. Write down measured impact
    5. Create follow-up tickets for all flaws
    11

    View Slide

  12. 12

    View Slide

  13. docker pull mlafeldt/simianarmy
    github.com/mlafeldt/docker-simianarmy
    13

    View Slide

  14. Unleash the monkey! !
    $ docker run -it --rm \
    -e SIMIANARMY_CLIENT_AWS_ACCOUNTKEY=$AWS_ACCESS_KEY_ID \
    -e SIMIANARMY_CLIENT_AWS_SECRETKEY=$AWS_SECRET_ACCESS_KEY \
    -e SIMIANARMY_CLIENT_AWS_REGION=$AWS_REGION \
    -e SIMIANARMY_CALENDAR_ISMONKEYTIME=true \
    -e SIMIANARMY_CHAOS_ASG_ENABLED=true \
    -e SIMIANARMY_CHAOS_LEASHED=false \
    mlafeldt/simianarmy
    14

    View Slide

  15. On-demand termination
    /simianarmy/api/v1/chaos
    github.com/mlafeldt/chaosmonkey
    15

    View Slide

  16. Trigger new chaos events
    $ go get -u github.com/mlafeldt/chaosmonkey
    $ chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \
    -group ExampleAutoScalingGroup \
    -strategy ShutdownInstance
    $ chaosmonkey -endpoint http://$DOCKER_HOST_IP:8080 \
    -group ExampleAutoScalingGroup \
    -strategy DetachVolumes \
    -count 5 -interval 10s -probability 0.2
    16

    View Slide

  17. Bonus: Slack notifications
    17

    View Slide

  18. Failure injection points
    • Infrastructure provider
    • Hosts
    • Internal and external dependencies
    • Client libraries
    • Applications
    • Containers ...
    18

    View Slide

  19. Pumba
    github.com/gaia-adm/
    pumba
    19

    View Slide

  20. Pumba example
    $ docker run -it --rm --name ubuntu ubuntu:16.04
    $ pumba netem --tc-image gaiadocker/iproute2 \
    --duration 60s delay --time 3000 ubuntu
    20

    View Slide

  21. How does it work?
    $ docker run --rm \
    --cap-add NET_ADMIN \
    --net=container:ubuntu \
    gaiadocker/iproute2 qdisc add dev eth0 root netem delay 3000ms
    $ docker run --rm \
    --cap-add NET_ADMIN \
    --net=container:ubuntu \
    gaiadocker/iproute2 qdisc del dev eth0 root netem
    21

    View Slide

  22. 22

    View Slide

  23. Gremlin
    • Failure as a Service
    • Attack hosts, containers, applications
    • Impact CPU, RAM, I/O, network traffic, system time, etc.
    • Web UI + CLI
    • Safe and secure
    23

    View Slide

  24. Takeaways
    • Building resilient systems requires experience with failure
    • Don't wait for things to break in production
    • Proactively inject failures in a controlled way
    • Use existing chaos tools
    • ⾠ Start small! ⾠
    24

    View Slide

  25. Production Ready mailing list
    https://tinyletter.com/production-ready
    34 articles, including:
    The Discipline of Chaos Engineering
    A Little Story about Amazon ECS, systemd, and Chaos Monkey
    The Power of Less Code
    Writing Your First Postmortem
    Go, Mental Models, and Side Effects
    25

    View Slide

  26. Thank you.
    https://tinyletter.com/production-ready
    @mlafeldt
    26

    View Slide