Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Resilient Applications using Chaos Engineering on AWS

Building Resilient Applications using Chaos Engineering on AWS

Architectures have grown increasingly complex and hard to understand. Worse, the software systems running them have become extremely difficult to debug and test, increasing the risk of outages. With these new challenges, new tools are required and since failures have become more and more chaotic in nature, we must turn to chaos engineering in order to reveal failures before they become outages. In this talk, I will talk about chaos engineering, a discipline that promotes breaking things on purpose in order to learn how to build more robust systems. I will demo the tools and methods used to inject failures in order to make systems more resilient to failure.

Adrian Hornsby

January 30, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Technology

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Building Resilient Applications using
    Chaos Engineering on AWS
    Adrian Hornsby
    Principal Technical Evangelist
    Amazon Web Services

    View full-size slide

  2. Can you afford to lose $100,000?

    View full-size slide

  3. Because that is the average amount of one hour
    of downtime reported by an ITIC study this year.
    Source: Information Technology Intelligence Consulting Research

    View full-size slide

  4. • A volunteer firefighter
    • Created GameDay in 2006 to
    purposefully create regular major failures.
    • Founded Chef, the Velocity Web
    Performance & Operations Conference.
    Jesse Robbins, “Master of Disaster”
    GameDay at Amazon

    View full-size slide

  5. “Simian Army to keep our cloud safe, secure, and highly available.”
    - 2011 Netflix blog
    Set of scheduled agent:
    • shuts down services randomly
    • slows down performances
    • checks conformity
    • breaks an entire region
    • Integrates with spinnaker (CI/CD)
    Rise of the monkeys
    https://github.com/Netflix/SimianArmy

    View full-size slide

  6. Chaos Engineering formalized by Netflix (mid-2015 )
    principlesofchaos.org

    View full-size slide

  7. Chaos engineering is NOT about breaking
    things randomly without a purpose, chaos
    engineering is about breaking things in a
    controlled environment and through well-
    planned experiments in order to build
    confidence in your application to withstand
    turbulent conditions.

    View full-size slide

  8. STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Chaos Engineering
    A scientific method

    View full-size slide

  9. Break your systems on purpose.
    Find out their weaknesses and fix
    them before they break when
    least expected.

    View full-size slide

  10. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering

    View full-size slide

  11. Operations
    Infrastructure
    Application
    Software

    View full-size slide

  12. https://aws.amazon.com/wellarchitected

    View full-size slide

  13. M
    ore
    Inform
    ation
    https://aws.amazon.com/builders-library

    View full-size slide

  14. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering

    View full-size slide

  15. STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  16. What is steady state?
    • ”normal” behavior of your system
    https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

    View full-size slide

  17. What is steady state?
    • Business + Ops Metric
    https://medium.com/netflix-techblog/

    View full-size slide

  18. STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  19. What if…?
    “What if this load balancer breaks?”
    “What if Redis becomes slow?”
    “What if a host on Cassandra goes away?”
    ”What if latency increases by 300ms?”
    ”What if the database stops?”
    Make it everyone’s problem!

    View full-size slide

  20. STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  21. Failure injection
    • Start small & build confidence
    • Application level (exceptions, errors, …)
    • Host level (services, processes, …)
    • Resource attacks (CPU, memory, IO, …)
    • Network attacks (dependencies, latency, packet loss…)
    • AZ attack
    • Region attack
    • People attack

    View full-size slide

  22. Rules of thumbs
    • Start very small
    • As close as possible to production
    • Minimize the blast radius.
    • Have an emergency STOP!
    • Careful with state that can’t be rolled back
    (corrupt or incorrect data)

    View full-size slide

  23. Blast radius
    • How many customers?
    • What functionality?
    • How many locations?

    View full-size slide

  24. STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  25. Quantifying the result of the experiment
    • Time to detect?
    • Time for notification? And escalation?
    • Time to public notification?
    • Time for graceful degradation to kick-in?
    • Time for self healing to happen?
    • Time to recovery – partial and full?
    • Time to all-clear and stable?

    View full-size slide

  26. Postmortems – COE (Correction of Errors)
    • What happened?
    • What was the impact on customers and your business?
    • What were the contributing factors?
    • What data do you have to support this?
    • especially metrics and graphs
    • What lessons did you learn?
    • What corrective actions are you taking?
    • Actions items
    • Related items (trouble tickets etc.)

    View full-size slide

  27. Tools Processes
    Culture
    Technology

    View full-size slide

  28. Two rules to remember ALWAYS!

    View full-size slide

  29. DON’T blame that one person …

    View full-size slide

  30. There is no isolated ‘cause’ of an accident.

    View full-size slide

  31. STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  32. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Monkeys

    View full-size slide

  33. Start simple and local!!
    $ docker stop 94a214bbeebd

    View full-size slide

  34. DDoS yourself
    $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

    View full-size slide

  35. Burn CPU with Stress(–ng)
    $ stress-ng --cpu 0--cpu-method matrixprod -t 60s
    https://kernel.ubuntu.com/~cking/stress-ng/

    View full-size slide

  36. Adding latency to the network
    $ tc qdisc add dev eth0 root netem delay 300ms

    View full-size slide

  37. Blocks DNS resolution
    $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP

    View full-size slide

  38. Other fun things to do
    • Fill up disk
    • Network packet loss (using traffic-shaping)
    • Network packet corruption (using traffic-shaping)
    • Kills random processes
    • Detach (force) all EBS volumes
    • Mess with /etc/hosts

    View full-size slide

  39. https://github.com/Netflix/SimianArmy
    Set of scheduled agent:
    • shuts down services randomly
    • slows down performances
    • checks conformity
    • breaks an entire region
    • Integrates with spinnaker (CI/CD)
    Simian Army

    View full-size slide

  40. The Chaos Toolkit
    • Simplifying Adoption of Chaos Engineering
    • An Open API to Chaos Engineering
    • Open source extensions for
    • Infrastructure/Platform Fault Injections
    • Application Fault Injections
    • Observability
    • Integrates easily into CI/CD pipelines

    View full-size slide

  41. ToxiProxy
    • HTTP API
    • Build for Automated testing in mind
    • Not for production environment
    • Fast
    • Toxics for:
    • Timeouts, latency, connections and bandwidth limitation, etc..
    • CLI
    • Stable and well tested (used for 3 years at Shopify)
    • Open Source: https://github.com/Shopify/toxiproxy

    View full-size slide

  42. https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

    View full-size slide

  43. Pumba
    https://github.com/alexei-led/pumba/

    View full-size slide

  44. Fault Injection Queries for Amazon Aurora
    SQL commands issued to simulate:
    • A crash of the master instance or an Aurora Replica
    • A failure of an Aurora Replica
    • A disk failure
    • Disk congestion
    https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

    View full-size slide

  45. Fault Injection Queries for Amazon Aurora
    SQL commands issued to simulate:
    • A crash of the master instance or an Aurora Replica
    • A failure of an Aurora Replica
    • A disk failure
    • Disk congestion
    ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN
    DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER |
    MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

    View full-size slide

  46. ❯ aws lambda put-function-concurrency --function-name --
    reserved-concurrent-executions 0

    View full-size slide

  47. Injecting Chaos to AWS Lambda
    $ pip install chaos-lambda

    View full-size slide

  48. https://medium.com/@adhorn/injecting-chaos-to-aws-lambda-functions-using-lambda-layers-2963f996e0ba
    Injecting Chaos to AWS Lambda functions using
    Lambda Layers

    View full-size slide

  49. https://github.com/adhorn/aws-lambda-chaos-injection

    View full-size slide

  50. https://github.com/gunnargrosch/failure-lambda

    View full-size slide

  51. Injecting Chaos to Amazon EC2 using AWS System Manager
    https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

    View full-size slide

  52. SSM Run (send) Command
    $ aws ssm send-command
    --document-name "cpu-stress"
    --document-version "1"
    --targets '[{"Key":"InstanceIds","Values":[
    " i-094c8367024633d96 ","i-04d0976f9fb658c23"]}]’
    --parameters '{"duration":["60"],"cpu":["0"]}’
    --timeout-seconds 600
    --max-concurrency "50"
    --max-errors "0"
    --output-s3-bucket-name "adhorn-chaos-ssm-output"
    --region eu-west-1

    View full-size slide

  53. https://github.com/adhorn/chaos-ssm-documents

    View full-size slide

  54. https://github.com/adhorn/aws-chaos-scripts

    View full-size slide

  55. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Chaos Engineering

    View full-size slide

  56. Big challenges to chaos engineering
    • Chaos Engineering won’t make your system more robust,
    People will.
    • Chaos Engineering won’t replace __all__ the rest (test, quality,
    …)
    • Chaos Engineering is NOT the only way to learn from failure
    • Rollbacks are HARD because of state.
    • Your systems will continue to fail, sorry.
    • Starting is perceived as hard!

    View full-size slide

  57. Big challenges to chaos engineering
    Mostly Cultural
    • no time or flexibility to simulate disasters.
    • teams already spending all of its time fixing things.
    • can be very political.
    • might force deep conversations.
    • deeply invested in a specific technical roadmap (micro-services) that
    chaos engineering tests show is not as resilient to failures as originally
    predicted.

    View full-size slide

  58. Tools Processes
    Culture
    Technology

    View full-size slide

  59. Thank you!
    © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Adrian Hornsby
    https://medium.com/@adhorn
    adhorn

    View full-size slide