Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying chaos engineering principles for building fault-tolerant applications

Applying chaos engineering principles for building fault-tolerant applications

Failures are inevitable. Regardless of the engineering efforts put into building fault-tolerant applications and handling edge cases, one day, a case beyond our reach will turn a benign failure into a catastrophic one. Therefore, we must test and continuously improve our application’s resilience to failures to minimise its blast-radius and its impact on user experience. Chaos engineering has emerged as one of the best methods to do that. However, while interest is growing, few have managed to build sustainable chaos engineering practices. In this two-part seminar, I will first introduce chaos engineering and its principles, and explain how to get started with it. I will then walk through and demo some of the tools and methods that you can use today to inject failures in software systems to make them more resilient to failure.

Adrian Hornsby

April 16, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Programming

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Adrian Hornsby
    Principal Technical Evangelist
    Amazon Web Services
    Chaos Engineering on AWS
    Building Resilient Systems

    View full-size slide

  2. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Challenges with distributed systems

    View full-size slide

  3. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Amazon Twitter Netflix

    View full-size slide

  4. 1. POST REQUEST: CLIENT puts request MESSAGE onto
    NETWORK.
    2. DELIVER REQUEST: NETWORK delivers MESSAGE to
    SERVER.
    3. VALIDATE REQUEST: SERVER validates MESSAGE.
    4. UPDATE SERVER STATE: SERVER updates its state, if
    necessary, based on MESSAGE.
    5. POST REPLY: SERVER puts reply REPLY onto
    NETWORK.
    6. DELIVER REPLY: NETWORK delivers REPLY to
    CLIENT.
    7. VALIDATE REPLY: CLIENT validates REPLY.
    8. UPDATE CLIENT STATE: CLIENT updates its state, if
    necessary, based on REPLY.
    https://aws.amazon.com/builders-library/challenges-with-distributed-systems

    View full-size slide

  5. © 2020, Amazon Web Services, Inc. or its Affiliates.
    “Failures are a given and
    everything will eventually fail
    over time”.
    Werner Vogels
    CTO – Amazon.com

    View full-size slide

  6. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Distributed systems are hard because
    • Errors happen anytime , often in combination with other errors.
    • Results of network operations can be unknown (succeeded, failed, or received
    but not processed).
    • Problems occur at all logical levels.
    • Problems get worse at higher levels of the system, due to recursion.
    • Bugs often show up long after they are deployed to a system.
    • Bugs can spread across an entire system.
    • Many problems derive from the laws of physics and can’t be changed.

    View full-size slide

  7. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Is traditional testing enough?
    Testing: verifying a KNOWN condition:
    e.g. assert(A = B) ?
    Unit testing of components:
    • Tested in isolation to ensure function meets expectations.
    Functional testing of integrations:
    • Each execution path tested to assure expected results.

    View full-size slide

  8. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Is traditional testing enough?
    Testing: verifying a KNOWN condition:
    e.g. assert(A = B) ?
    Unit testing of components:
    • Tested in isolation to ensure function meets expectations.
    Functional testing of integrations:
    • Each execution path tested to assure expected results.

    View full-size slide

  9. © 2020, Amazon Web Services, Inc. or its Affiliates.
    The Scientific Method

    View full-size slide

  10. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Make
    Observation
    Think of Interesting
    Questions
    Formulate
    Hypotheses
    Develop
    Testable
    Predictions
    Gather Data to
    Test Predictions
    Develop
    General
    Theories
    Refine, Alter,
    Expand or
    Reject
    Hypotheses
    The Scientific Method

    View full-size slide

  11. © 2020, Amazon Web Services, Inc. or its Affiliates.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Chaos Engineering
    A scientific method

    View full-size slide

  12. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Chaos Engineering formalized
    principlesofchaos.org

    View full-size slide

  13. © 2020, Amazon Web Services, Inc. or its Affiliates.
    CHAOS ENGINEERING?
    THAT’S MY THING!!

    View full-size slide

  14. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Chaos engineering is NOT about breaking things
    randomly without a purpose.

    View full-size slide

  15. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Chaos engineering is about breaking things in a
    controlled environment and through well-planned
    experiments in order to build confidence in your
    application to withstand turbulent conditions.

    View full-size slide

  16. © 2020, Amazon Web Services, Inc. or its Affiliates.
    “CHAOS DOESN’T CAUSE PROBLEMS.
    IT REVEALS THEM.”
    - Nora Jones

    View full-size slide

  17. © 2020, Amazon Web Services, Inc. or its Affiliates.
    ü Building confidence against failure

    View full-size slide

  18. © 2020, Amazon Web Services, Inc. or its Affiliates.

    View full-size slide

  19. © 2020, Amazon Web Services, Inc. or its Affiliates.
    • A volunteer firefighter
    • Created GameDay in 2006 to
    purposefully create regular major failures.
    • Founded Chef, the Velocity Web
    Performance & Operations Conference.
    Jesse Robbins, “Master of Disaster”
    GameDay at Amazon

    View full-size slide

  20. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Jesse Robbins, “Master of Disaster”
    GameDay at Amazon
    • Test, train and prepare Amazon systems,
    software, and people to respond to a
    disaster.
    • Increase Amazon retail website resiliency
    by purposely injecting failures into critical
    systems.

    View full-size slide

  21. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://www.youtube.com/watch?v=zoz0ZjfrQ9s
    Jesse Robbins – mid 2000’s

    View full-size slide

  22. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Find weaknesses and fix them before
    they break when least expected.

    View full-size slide

  23. © 2020, Amazon Web Services, Inc. or its Affiliates.
    ü Building confidence against failures
    ü Reducing Skill Atrophy

    View full-size slide

  24. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Training is not a one-time occurrence.
    It should be an ongoing process of expanding
    knowledge, exercising skills, and passing on these
    abilities for the benefit of the organization.

    View full-size slide

  25. © 2020, Amazon Web Services, Inc. or its Affiliates.
    ü Building confidence against failures
    ü Reducing Skill Atrophy
    ü Improving Recovery Time

    View full-size slide

  26. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Because that is the average amount of one hour
    of downtime reported by an ITIC study this year.
    Source: Information Technology Intelligence Consulting Research
    Can you afford to lose $100,000?

    View full-size slide

  27. © 2020, Amazon Web Services, Inc. or its Affiliates.
    System Availability
    Availability =
    Normal Operation Time
    Total Time
    MTBF**
    MTBF** + MTTR*
    =
    * Mean Time To Repair (MTTR)
    **Mean Time Between Failure (MTBF)

    View full-size slide

  28. © 2020, Amazon Web Services, Inc. or its Affiliates.
    ü Building confidence against failures
    ü Reducing Skill Atrophy
    ü Improving Recovery Time
    ü And a lot more …

    View full-size slide

  29. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Prerequisites to chaos engineering

    View full-size slide

  30. © 2020, Amazon Web Services, Inc. or its Affiliates.
    © 2020, Amazon Web Services, Inc. or its Affiliates.
    ü People
    ü Operations
    ü Application
    ü Network & Data
    ü Infrastructure

    View full-size slide

  31. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Operations
    Infrastructure
    Application
    Software

    View full-size slide

  32. © 2020, Amazon Web Services, Inc. or its Affiliates.

    View full-size slide

  33. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://medium.com/@adhorn

    View full-size slide

  34. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://aws.amazon.com/wellarchitected

    View full-size slide

  35. © 2020, Amazon Web Services, Inc. or its Affiliates.
    M
    ore
    Inform
    ation
    https://aws.amazon.com/builders-library

    View full-size slide

  36. © 2020, Amazon Web Services, Inc. or its Affiliates.
    © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Don’t break things in
    prod before you have
    done your home work!

    View full-size slide

  37. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Phases of Chaos Engineering

    View full-size slide

  38. © 2020, Amazon Web Services, Inc. or its Affiliates.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  39. © 2020, Amazon Web Services, Inc. or its Affiliates.
    What is it?
    • ”Normal” behavior of your system
    • Not the internal attributes of the system (CPU, memory, etc.)
    • Operational metrics tied with customer experience yields best results.
    The steady state varies when an unmitigated failure triggers an unexpected
    problem, and should cause the chaos experiment to be aborted.
    Steady
    State

    View full-size slide

  40. © 2020, Amazon Web Services, Inc. or its Affiliates.
    What is steady state?
    ”Normal” behavior of your system
    Steady
    State

    View full-size slide

  41. © 2020, Amazon Web Services, Inc. or its Affiliates.
    What is steady state?
    • Business + Ops Metric
    https://medium.com/netflix-techblog/
    Steady
    State

    View full-size slide

  42. © 2020, Amazon Web Services, Inc. or its Affiliates.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  43. © 2020, Amazon Web Services, Inc. or its Affiliates.
    What if…?
    “What if this load balancer breaks?”
    “What if Redis becomes slow?”
    “What if a host on Cassandra goes away?”
    ”What if latency increases by 300ms?”
    ”What if the database stops?”
    Make it everyone’s problem!
    H
    ypothesis

    View full-size slide

  44. © 2020, Amazon Web Services, Inc. or its Affiliates.
    H
    ypothesis

    View full-size slide

  45. IF YOU HAVEN’T VERIFIED IT, IT’S PROBABLY BROKEN.
    H
    ypothesis

    View full-size slide

  46. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Where to start?
    • Pick hypothesis
    • Scope the experiment
    • Identify metrics
    • Notify the organization
    • Think Blast radius!!!
    • Simulating AZ failure
    • Injecting latency between services
    • Randomly throwing exceptions.
    • Maxing out CPU to verify scaling
    policies.
    • Database failovers & backups
    H
    ypothesis

    View full-size slide

  47. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Disclaimer!
    Don’t make an hypothesis that you know
    will break you!
    H
    ypothesis

    View full-size slide

  48. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Rules of thumbs
    • Start very small
    • As close as possible to production
    • Minimize the blast radius.
    • Have an emergency STOP!
    • Careful with state that can’t be rolled
    back(corrupt or incorrect data)
    H
    ypothesis

    View full-size slide

  49. © 2020, Amazon Web Services, Inc. or its Affiliates.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  50. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Failure injection
    Start small and build confidence
    • Application level (exceptions, errors, etc)
    • Host level (services, processes, etc)
    • Resource attacks (CPU, memory, IO, etc)
    • Network attacks (dependencies, latency, packet loss, etc)
    • AZ attack
    • Region attack
    • People attack
    Run
    Experim
    ent

    View full-size slide

  51. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Routing
    mechanism
    Users
    Old application
    version
    New application
    version
    Run
    Experim
    ent
    Canary deployment
    10% traffic
    90% traffic
    https://medium.com/@adhorn/immutable-infrastructure-21f6613e7a23

    View full-size slide

  52. © 2020, Amazon Web Services, Inc. or its Affiliates.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  53. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Quantifying the result of the experiment
    • Time to detect?
    • Time for notification? And escalation?
    • Time to public notification?
    • Time for graceful degradation to kick-in?
    • Time for self healing to happen?
    • Time to recovery – partial and full?
    • Time to all-clear and stable?
    Verify

    View full-size slide

  54. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Postmortems – COE (Correction of Errors)
    • What happened?
    • What was the impact on customers and your business?
    • What were the contributing factors?
    • What data do you have to support this?
    • especially metrics and graphs
    • What lessons did you learn?
    • What corrective actions are you taking?
    • Actions items
    • Related items (trouble tickets etc.)
    Verify

    View full-size slide

  55. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Dive deep on the causes
    Question: Why did the associate damage his thumb?
    Answer: Because his thumb got caught in the conveyor.
    Question: Why did his thumb get caught in the conveyor?
    Answer: Because he was chasing his bag, which was on a running conveyor
    belt.
    Question: Why did he chase his bag?
    Answer: Because he placed his bag on the conveyor, but it then turned-on by
    surprise
    Question: Why was his bag on the conveyor?
    Answer: Because he used the conveyor as a table
    Possible Conclusion: So, one likely cause of the associate’s damaged thumb is
    that he needed a table, there wasn’t one around, so he used a conveyor as a
    table.
    Verify

    View full-size slide

  56. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Tools Processes
    Culture
    Technology
    Verify

    View full-size slide

  57. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Never Let a Good Crisis Go To Waste
    Verify

    View full-size slide

  58. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Two rules to remember
    ALWAYS!
    Verify

    View full-size slide

  59. © 2020, Amazon Web Services, Inc. or its Affiliates.
    DON’T blame that one person …
    Verify

    View full-size slide

  60. © 2020, Amazon Web Services, Inc. or its Affiliates.
    There is no isolated ‘cause’ of an accident.
    Verify

    View full-size slide

  61. © 2020, Amazon Web Services, Inc. or its Affiliates.
    STEADY
    STATE
    HYPOTHESIS
    RUN
    EXPERIMENT
    VERIFY
    IMPROVE
    Phases of Chaos Engineering

    View full-size slide

  62. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Fix it!
    Im
    prove

    View full-size slide

  63. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Audit
    Weekly Operational Metrics Review
    • Continuous inspection mechanism
    • Maintains focus on operations
    • Foundation of a healthy operations program
    Typical Agenda - typically divided into fifteen-minute slots
    • Share successes and failings
    • Action items follow up
    • Review COEs
    • Review key service metrics
    • Identify new best practices
    Im
    prove

    View full-size slide

  64. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Bananas for Monkeys

    View full-size slide

  65. © 2020, Amazon Web Services, Inc. or its Affiliates.
    INTRODUCE CHAOS ENGINEERING
    EARLY IN THE JOURNEY.
    DON’T WAIT!

    View full-size slide

  66. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Eleanor
    https://github.com/adhorn/eleanor

    View full-size slide

  67. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Start simple and local!!
    $ docker stop database
    or anything else ;-)

    View full-size slide

  68. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  69. DDoS yourself
    $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

    View full-size slide

  70. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  71. Burn CPU with Stress(–ng)
    $ stress-ng --cpu 0--cpu-method matrixprod -t 60s
    https://kernel.ubuntu.com/~cking/stress-ng/

    View full-size slide

  72. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  73. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Adding latency to the network
    $ tc qdisc add dev eth0 root netem delay 300ms

    View full-size slide

  74. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Blocks DNS resolution
    $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP

    View full-size slide

  75. Other fun things to do
    • Fill up disk
    • Network packet loss (using traffic-shaping)
    • Network packet corruption (using traffic-shaping)
    • Kills random processes
    • Detach (force) all EBS volumes
    • Mess with config files
    • …

    View full-size slide

  76. © 2020, Amazon Web Services, Inc. or its Affiliates.
    “Simian Army to keep our cloud safe, secure, and highly available.”
    - 2011 Netflix blog
    Set of scheduled agent:
    • shuts down services randomly
    • slows down performances
    • checks conformity
    • breaks an entire region
    • Integrates with spinnaker (CI/CD)
    Simian Army
    https://github.com/Netflix/SimianArmy

    View full-size slide

  77. https://chaosiq.io

    View full-size slide

  78. © 2020, Amazon Web Services, Inc. or its Affiliates.
    The Chaos Toolkit
    https://chaostoolkit.org
    • Simplifying Adoption of Chaos Engineering
    • An Open API to Chaos Engineering
    • Open source extensions for
    • Infrastructure/Platform Fault Injections
    • Application Fault Injections
    • Observability
    • Integrates easily into CI/CD pipelines

    View full-size slide

  79. © 2020, Amazon Web Services, Inc. or its Affiliates.

    View full-size slide

  80. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  81. © 2020, Amazon Web Services, Inc. or its Affiliates.

    View full-size slide

  82. © 2020, Amazon Web Services, Inc. or its Affiliates.

    View full-size slide

  83. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  84. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Verica.io

    View full-size slide

  85. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://github.com/adhorn/aws-
    chaos-scripts
    SDKs J

    View full-size slide

  86. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  87. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Injecting Chaos to Amazon
    EC2 using AWS System
    Manager

    View full-size slide

  88. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://github.com/adhorn/chaos-ssm-documents

    View full-size slide

  89. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  90. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo (SSM + ChaosToolkit)

    View full-size slide

  91. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Injecting Chaos to AWS Lambda
    $ pip install chaos-lambda

    View full-size slide

  92. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://github.com/adhorn/aws-lambda-
    chaos-injection

    View full-size slide

  93. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Demo

    View full-size slide

  94. https://github.com/gunnargrosch
    /failure-lambda

    View full-size slide

  95. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Fault Injection Queries for Amazon Aurora
    SQL commands issued to simulate:
    • A crash of the master instance or an Aurora Replica
    • A failure of an Aurora Replica
    • A disk failure
    • Disk congestion
    https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

    View full-size slide

  96. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Fault Injection Queries for Amazon Aurora
    SQL commands issued to simulate:
    • A crash of the master instance or an Aurora Replica
    • A failure of an Aurora Replica
    • A disk failure
    • Disk congestion
    ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN
    DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER |
    MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

    View full-size slide

  97. © 2020, Amazon Web Services, Inc. or its Affiliates.
    ToxiProxy
    • HTTP API
    • Build for Automated testing in mind
    • Not for production environment
    • Fast
    • Toxics for:
    • Timeouts, latency, connections and bandwidth limitation, etc..
    • CLI
    • Stable and well tested (used for 3 years at Shopify)
    • Open Source: https://github.com/Shopify/toxiproxy

    View full-size slide

  98. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

    View full-size slide

  99. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Challenges of Chaos Engineering

    View full-size slide

  100. © 2020, Amazon Web Services, Inc. or its Affiliates.
    https://xkcd.com/1428/
    Mister Chaos

    View full-size slide

  101. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Big challenges to chaos engineering
    Mostly Cultural
    • Starting is perceived as hard!
    • no time or flexibility to simulate disasters.
    • teams already spending all of its time fixing things.
    • can be very political.
    • might force deep conversations.
    • deeply invested in a specific technical roadmap (micro-services) that
    chaos engineering tests show is not as resilient to failures as originally
    predicted.

    View full-size slide

  102. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Chaos Engineering won’t make
    your system more robust,
    People will.

    View full-size slide

  103. © 2020, Amazon Web Services, Inc. or its Affiliates.
    Thank you!
    Adrian Hornsby
    https://medium.com/@adhorn
    @adhorn

    View full-size slide