Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Practical Chaos Engineering: Breaking things on purpose to make them more resilient against failure

Practical Chaos Engineering: Breaking things on purpose to make them more resilient against failure

With the wide adoption of micro-services and large-scale distributed systems, architectures have grow increasingly complex and hard to understand. Worse, the software systems running them have become extremely difficult to debug and test, increasing the risk of outages. With these new challenges, new tools are required and since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to reveal failures before they become outages. In this talk, we will first introduce chaos engineering and show the audience how to start practicing chaos engineering on the AWS cloud. We will walk through the tools and methods they can use to inject failures in their architecture in order to make them more resilient to failure.

Following the previous introduction to Chaos Engineering, in this hands on session, I will show the audience how to practically inject failures into software systems using a few different tools and methods - e.g using Gremlin, Chaos Toolkit, AWS System Manager, AWS Lambda, ToxiProxy, etc.

Adrian Hornsby

November 04, 2019
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Practical Chaos Engineering:
    Breaking things on purpose to make them more resilient
    against failure
    Adrian Hornsby
    Principal Evangelist
    Amazon Web Services

    View full-size slide

  2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Failures are a given and
    everything will eventually fail
    over time.
    Werner Vogels
    CTO – Amazon.com


    View full-size slide

  3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Building confidence through testing
    Unit testing of components:
    • Tested in isolation to ensure function meets expectations.
    Functional testing of integrations:
    • Each execution path tested to assure expected results.
    Is it enough???

    View full-size slide

  4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    GameDay at Amazon
    Creating Resiliency Through Destruction
    https://www.youtube.com/watch?v=zoz0ZjfrQ9s

    View full-size slide

  6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering
    https://github.com/Netflix/SimianArmy

    View full-size slide

  7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos engineering is NOT about breaking
    things randomly without a purpose, chaos
    engineering is about breaking things in a
    controlled environment and through well-
    planned experiments in order to build
    confidence in your application to withstand
    turbulent conditions.

    View full-size slide

  8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Break your systems on purpose.
    Find out their weaknesses and fix
    them before they break when
    least expected.

    View full-size slide

  9. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering

    View full-size slide

  10. Operations
    Infrastructure
    Application
    Software

    View full-size slide

  11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://aws.amazon.com/wellarchitected

    View full-size slide

  12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://medium.com/@adhorn

    View full-size slide

  13. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering

    View full-size slide

  14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What is steady state?
    • ”normal” behavior of your system
    https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

    View full-size slide

  16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What is steady state?
    • ”normal” behavior of your system
    • Business + Ops Metric
    https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

    View full-size slide

  17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What if…?
    “What if this load balancer breaks?”
    “What if Redis becomes slow?”
    “What if a host on Cassandra goes away?”
    ”What if latency increases by 300ms?”
    ”What if the database stops?”
    Make it everyone’s problem!

    View full-size slide

  19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Failure injection
    • Start small & build confidence
    • Application level
    • Host level
    • Resource attacks (CPU, memory, …)
    • Network attacks (dependencies, latency, …)
    • AZ attack
    • Region attack
    • People attack

    View full-size slide

  21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Benefits of Failure Injection
    - Practicing failure contingency.
    - Understanding of the effects of real world failures.
    - Understanding the efficacy of the fault tolerance mechanisms.
    - Removing designs faults in the fault tolerance mechanisms.
    - Understanding the effectiveness of your observability.
    - Understanding the blast-radius of failures and help reduce it.
    - Understanding the weak links in the design – especially single points of failures.
    - Understanding failure propagation between system component. Learn to avoid
    cascading effect.
    - …

    View full-size slide

  22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Blast radius
    • How many customers?
    • What functionality?
    • How many locations?

    View full-size slide

  23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Rules of thumbs
    Have an emergency STOP or a good exit plan!
    Careful with state that can’t be rolled back
    (corrupt or incorrect data)

    View full-size slide

  24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Quantifying the result of the experiment
    • Time to detect?
    • Time for notification? And escalation?
    • Time to public notification?
    • Time for graceful degradation to kick-in?
    • Time for self healing to happen?
    • Time to recovery – partial and full?
    • Time to all-clear and stable?

    View full-size slide

  26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    PostMortems – COE (Correction of Errors)
    The 5 WHYs
    Outage
    Because
    of …
    Because
    of …
    Because
    of …
    Because
    of …

    View full-size slide

  27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    More questions to ask
    • Can you clarify if there were any preceding events?
    • Why would they believe acting in this way was the best course of action to
    deliver the desired outcome?
    • Is there another failure mode that could present here?
    • What decisions or events prior to this made this work before?
    • Why stop there – are there places to dig deeper that could shine a light more
    on this?
    • Did others step in to help, to advise, or to intercede?

    View full-size slide

  28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    DON’T blame people!!!

    View full-size slide

  29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Rules to remember!
    There is no isolated ‘cause’ of an accident.

    View full-size slide

  30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  31. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    How to do Failure Injection

    View full-size slide

  32. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Start simple and local!!
    $ docker stop 94a214bbeebd

    View full-size slide

  33. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Burn CPU with Stress(–ng)
    $ stress-ng --random 50 -t 60 --metrics-brief --times
    https://kernel.ubuntu.com/~cking/stress-ng/

    View full-size slide

  34. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Adding delay to the network
    $ tc qdisc add dev eth0 root netem delay 200ms 40ms
    25% loss 15.3% 25% duplicate 1% corrupt 0.1%
    reorder 25% 50%

    View full-size slide

  35. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Blocks DNS resolution
    $ iptables -I OUTPUT -p udp -d --dport 53 -j DROP
    Get the DNS:
    $ cat /etc/resolv.conf
    search eu-west-1.compute.internal
    nameserver 172.31.0.2
    $ dig showme.mynameserver.xyz

    View full-size slide

  36. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    How to DDoS yourself
    $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

    View full-size slide

  37. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Other fun things to do
    • Fill up disk
    • Network packet loss (using traffic-shaping)
    • Network packet corruption (using traffic-shaping)
    • Kills random processes
    • Detach (force) all EBS volumes
    • Mess with /etc/hosts

    View full-size slide

  38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://github.com/Netflix/SimianArmy
    Set of scheduled agent:
    • shuts down services randomly
    • slows down performances
    • checks conformity
    • breaks an entire region
    • Integrates with spinnaker (CI/CD)

    View full-size slide

  39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    ToxiProxy
    • HTTP API
    • Build for Automated testing in mind
    • Not for production environment
    • Fast
    • Toxics for:
    • Timeouts, latency, connections and bandwidth limitation, etc..
    • CLI
    • Stable and well tested (used for 3 years at Shopify)
    • Open Source: https://github.com/Shopify/toxiproxy

    View full-size slide

  40. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

    View full-size slide

  41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://github.com/asobti/kube-monkey

    View full-size slide

  42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Pumba
    https://github.com/alexei-led/pumba/

    View full-size slide

  43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://blog.thundra.io/chaos-test-your-lambda-functions-with-thundra

    View full-size slide

  44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    APIs J

    View full-size slide

  45. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Fault Injection Queries for Amazon Aurora
    SQL commands issued to simulate:
    • A crash of the master instance or an Aurora Replica
    • A failure of an Aurora Replica
    • A disk failure
    • Disk congestion
    https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

    View full-size slide

  46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Fault Injection Queries for Amazon Aurora
    SQL commands issued to simulate:
    • A crash of the master instance or an Aurora Replica
    • A failure of an Aurora Replica
    • A disk failure
    • Disk congestion
    ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN
    DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER |
    MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

    View full-size slide

  47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    ❯ aws lambda put-function-concurrency --function-name --
    reserved-concurrent-executions 0

    View full-size slide

  48. Injecting Chaos to AWS Lambda
    $ pip install chaos-lambda

    View full-size slide

  49. https://github.com/adhorn/aws-lambda-chaos-injection

    View full-size slide

  50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    The Chaos Toolkit
    • Simplifying Adoption of Chaos Engineering
    • An Open API to Chaos Engineering
    • Open source extensions for
    • Infrastructure/Platform Fault Injections
    • Application Fault Injections
    • Observability
    • Integrates easily into CI/CD pipelines

    View full-size slide

  51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  55. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Injecting Chaos to Amazon EC2 using AWS System Manager
    https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

    View full-size slide

  56. https://github.com/adhorn/chaos-ssm-documents

    View full-size slide

  57. SSM Run (send) Command
    $ aws ssm send-command
    --document-name "cpu-stress"
    --document-version "1"
    --targets '[{"Key":"InstanceIds","Values":[
    " i-094c8367024633d96 ","i-04d0976f9fb658c23"]}]’
    --parameters '{"duration":["60"],"cpu":["0"]}’
    --timeout-seconds 600
    --max-concurrency "50"
    --max-errors "0"
    --output-s3-bucket-name "adhorn-chaos-ssm-output"
    --region eu-west-1

    View full-size slide

  58. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Big challenges to chaos engineering
    • Chaos Engineering won’t make your system more robust,
    People will.
    • Chaos Engineering won’t replace __all__ the rest (test, quality, …)
    • Chaos Engineering is NOT the only way to learn from failure
    • Rollbacks are HARD because of state.
    • Your systems will continue to fail, sorry.

    View full-size slide

  59. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Big challenges to chaos engineering
    Mostly Cultural
    • no time or flexibility to simulate disasters.
    • teams already spending all of its time fixing things.
    • can be very political.
    • might force deep conversations.
    • deeply invested in a specific technical roadmap (micro-services) that
    chaos engineering tests show is not as resilient to failures as originally
    predicted.

    View full-size slide

  60. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Changing culture takes time!
    Be patient…

    View full-size slide

  61. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Thank you!
    Adrian Hornsby
    https://medium.com/@adhorn
    adhorn

    View full-size slide