Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying chaos engineering principles for building fault-tolerant applications

Applying chaos engineering principles for building fault-tolerant applications

Failures are inevitable. Regardless of the engineering efforts put into building fault-tolerant applications and handling edge cases, one day, a case beyond our reach will turn a benign failure into a catastrophic one. Therefore, we must test and continuously improve our application’s resilience to failures to minimise its blast-radius and its impact on user experience. Chaos engineering has emerged as one of the best methods to do that. However, while interest is growing, few have managed to build sustainable chaos engineering practices. In this two-part seminar, I will first introduce chaos engineering and its principles, and explain how to get started with it. I will then walk through and demo some of the tools and methods that you can use today to inject failures in software systems to make them more resilient to failure.

E6c942c0f8e6042fbd47fcd3c4113b90?s=128

Adrian Hornsby

April 16, 2020
Tweet

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. Adrian

    Hornsby Principal Technical Evangelist Amazon Web Services Chaos Engineering on AWS Building Resilient Systems
  2. © 2020, Amazon Web Services, Inc. or its Affiliates. Challenges

    with distributed systems
  3. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon

    Twitter Netflix
  4. 1. POST REQUEST: CLIENT puts request MESSAGE onto NETWORK. 2.

    DELIVER REQUEST: NETWORK delivers MESSAGE to SERVER. 3. VALIDATE REQUEST: SERVER validates MESSAGE. 4. UPDATE SERVER STATE: SERVER updates its state, if necessary, based on MESSAGE. 5. POST REPLY: SERVER puts reply REPLY onto NETWORK. 6. DELIVER REPLY: NETWORK delivers REPLY to CLIENT. 7. VALIDATE REPLY: CLIENT validates REPLY. 8. UPDATE CLIENT STATE: CLIENT updates its state, if necessary, based on REPLY. https://aws.amazon.com/builders-library/challenges-with-distributed-systems
  5. © 2020, Amazon Web Services, Inc. or its Affiliates. “Failures

    are a given and everything will eventually fail over time”. Werner Vogels CTO – Amazon.com
  6. © 2020, Amazon Web Services, Inc. or its Affiliates. Distributed

    systems are hard because • Errors happen anytime , often in combination with other errors. • Results of network operations can be unknown (succeeded, failed, or received but not processed). • Problems occur at all logical levels. • Problems get worse at higher levels of the system, due to recursion. • Bugs often show up long after they are deployed to a system. • Bugs can spread across an entire system. • Many problems derive from the laws of physics and can’t be changed.
  7. © 2020, Amazon Web Services, Inc. or its Affiliates. Is

    traditional testing enough? Testing: verifying a KNOWN condition: e.g. assert(A = B) ? Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results.
  8. © 2020, Amazon Web Services, Inc. or its Affiliates. Is

    traditional testing enough? Testing: verifying a KNOWN condition: e.g. assert(A = B) ? Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results.
  9. © 2020, Amazon Web Services, Inc. or its Affiliates. The

    Scientific Method
  10. © 2020, Amazon Web Services, Inc. or its Affiliates. Make

    Observation Think of Interesting Questions Formulate Hypotheses Develop Testable Predictions Gather Data to Test Predictions Develop General Theories Refine, Alter, Expand or Reject Hypotheses The Scientific Method
  11. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Chaos Engineering A scientific method
  12. None
  13. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    Engineering formalized principlesofchaos.org
  14. © 2020, Amazon Web Services, Inc. or its Affiliates. CHAOS

    ENGINEERING? THAT’S MY THING!!
  15. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    engineering is NOT about breaking things randomly without a purpose.
  16. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    engineering is about breaking things in a controlled environment and through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.
  17. © 2020, Amazon Web Services, Inc. or its Affiliates. “CHAOS

    DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” - Nora Jones
  18. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failure
  19. © 2020, Amazon Web Services, Inc. or its Affiliates.

  20. © 2020, Amazon Web Services, Inc. or its Affiliates. •

    A volunteer firefighter • Created GameDay in 2006 to purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon
  21. © 2020, Amazon Web Services, Inc. or its Affiliates. Jesse

    Robbins, “Master of Disaster” GameDay at Amazon • Test, train and prepare Amazon systems, software, and people to respond to a disaster. • Increase Amazon retail website resiliency by purposely injecting failures into critical systems.
  22. © 2020, Amazon Web Services, Inc. or its Affiliates. https://www.youtube.com/watch?v=zoz0ZjfrQ9s

    Jesse Robbins – mid 2000’s
  23. © 2020, Amazon Web Services, Inc. or its Affiliates. Find

    weaknesses and fix them before they break when least expected.
  24. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failures ü Reducing Skill Atrophy
  25. © 2020, Amazon Web Services, Inc. or its Affiliates. Training

    is not a one-time occurrence. It should be an ongoing process of expanding knowledge, exercising skills, and passing on these abilities for the benefit of the organization.
  26. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time
  27. © 2020, Amazon Web Services, Inc. or its Affiliates. Because

    that is the average amount of one hour of downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research Can you afford to lose $100,000?
  28. © 2020, Amazon Web Services, Inc. or its Affiliates. System

    Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
  29. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time ü And a lot more …
  30. © 2020, Amazon Web Services, Inc. or its Affiliates. Prerequisites

    to chaos engineering
  31. © 2020, Amazon Web Services, Inc. or its Affiliates. ©

    2020, Amazon Web Services, Inc. or its Affiliates. ü People ü Operations ü Application ü Network & Data ü Infrastructure
  32. © 2020, Amazon Web Services, Inc. or its Affiliates. Operations

    Infrastructure Application Software
  33. © 2020, Amazon Web Services, Inc. or its Affiliates.

  34. © 2020, Amazon Web Services, Inc. or its Affiliates. https://medium.com/@adhorn

  35. © 2020, Amazon Web Services, Inc. or its Affiliates. https://aws.amazon.com/wellarchitected

  36. © 2020, Amazon Web Services, Inc. or its Affiliates. M

    ore Inform ation https://aws.amazon.com/builders-library
  37. © 2020, Amazon Web Services, Inc. or its Affiliates. ©

    2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Don’t break things in prod before you have done your home work!
  38. © 2020, Amazon Web Services, Inc. or its Affiliates. Phases

    of Chaos Engineering
  39. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  40. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    is it? • ”Normal” behavior of your system • Not the internal attributes of the system (CPU, memory, etc.) • Operational metrics tied with customer experience yields best results. The steady state varies when an unmitigated failure triggers an unexpected problem, and should cause the chaos experiment to be aborted. Steady State
  41. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    is steady state? ”Normal” behavior of your system Steady State
  42. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    is steady state? • Business + Ops Metric https://medium.com/netflix-techblog/ Steady State
  43. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  44. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem! H ypothesis
  45. © 2020, Amazon Web Services, Inc. or its Affiliates. H

    ypothesis
  46. IF YOU HAVEN’T VERIFIED IT, IT’S PROBABLY BROKEN. H ypothesis

  47. © 2020, Amazon Web Services, Inc. or its Affiliates. Where

    to start? • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization • Think Blast radius!!! • Simulating AZ failure • Injecting latency between services • Randomly throwing exceptions. • Maxing out CPU to verify scaling policies. • Database failovers & backups H ypothesis
  48. © 2020, Amazon Web Services, Inc. or its Affiliates. Disclaimer!

    Don’t make an hypothesis that you know will break you! H ypothesis
  49. © 2020, Amazon Web Services, Inc. or its Affiliates. Rules

    of thumbs • Start very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back(corrupt or incorrect data) H ypothesis
  50. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  51. © 2020, Amazon Web Services, Inc. or its Affiliates. Failure

    injection Start small and build confidence • Application level (exceptions, errors, etc) • Host level (services, processes, etc) • Resource attacks (CPU, memory, IO, etc) • Network attacks (dependencies, latency, packet loss, etc) • AZ attack • Region attack • People attack Run Experim ent
  52. © 2020, Amazon Web Services, Inc. or its Affiliates. Routing

    mechanism Users Old application version New application version Run Experim ent Canary deployment 10% traffic 90% traffic https://medium.com/@adhorn/immutable-infrastructure-21f6613e7a23
  53. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  54. © 2020, Amazon Web Services, Inc. or its Affiliates. Quantifying

    the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable? Verify
  55. © 2020, Amazon Web Services, Inc. or its Affiliates. Postmortems

    – COE (Correction of Errors) • What happened? • What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.) Verify
  56. © 2020, Amazon Web Services, Inc. or its Affiliates. Dive

    deep on the causes Question: Why did the associate damage his thumb? Answer: Because his thumb got caught in the conveyor. Question: Why did his thumb get caught in the conveyor? Answer: Because he was chasing his bag, which was on a running conveyor belt. Question: Why did he chase his bag? Answer: Because he placed his bag on the conveyor, but it then turned-on by surprise Question: Why was his bag on the conveyor? Answer: Because he used the conveyor as a table Possible Conclusion: So, one likely cause of the associate’s damaged thumb is that he needed a table, there wasn’t one around, so he used a conveyor as a table. Verify
  57. © 2020, Amazon Web Services, Inc. or its Affiliates. Tools

    Processes Culture Technology Verify
  58. © 2020, Amazon Web Services, Inc. or its Affiliates. Never

    Let a Good Crisis Go To Waste Verify
  59. © 2020, Amazon Web Services, Inc. or its Affiliates. Two

    rules to remember ALWAYS! Verify
  60. © 2020, Amazon Web Services, Inc. or its Affiliates. DON’T

    blame that one person … Verify
  61. © 2020, Amazon Web Services, Inc. or its Affiliates. There

    is no isolated ‘cause’ of an accident. Verify
  62. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  63. © 2020, Amazon Web Services, Inc. or its Affiliates. Fix

    it! Im prove
  64. © 2020, Amazon Web Services, Inc. or its Affiliates. Audit

    Weekly Operational Metrics Review • Continuous inspection mechanism • Maintains focus on operations • Foundation of a healthy operations program Typical Agenda - typically divided into fifteen-minute slots • Share successes and failings • Action items follow up • Review COEs • Review key service metrics • Identify new best practices Im prove
  65. © 2020, Amazon Web Services, Inc. or its Affiliates. Bananas

    for Monkeys
  66. © 2020, Amazon Web Services, Inc. or its Affiliates. INTRODUCE

    CHAOS ENGINEERING EARLY IN THE JOURNEY. DON’T WAIT!
  67. © 2020, Amazon Web Services, Inc. or its Affiliates. Eleanor

    https://github.com/adhorn/eleanor
  68. © 2020, Amazon Web Services, Inc. or its Affiliates. Start

    simple and local!! $ docker stop database or anything else ;-)
  69. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  70. DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

  71. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  72. Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t

    60s https://kernel.ubuntu.com/~cking/stress-ng/
  73. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  74. © 2020, Amazon Web Services, Inc. or its Affiliates. Adding

    latency to the network $ tc qdisc add dev eth0 root netem delay 300ms
  75. © 2020, Amazon Web Services, Inc. or its Affiliates. Blocks

    DNS resolution $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP
  76. Other fun things to do • Fill up disk •

    Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with config files • …
  77. © 2020, Amazon Web Services, Inc. or its Affiliates. “Simian

    Army to keep our cloud safe, secure, and highly available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army https://github.com/Netflix/SimianArmy
  78. https://chaosiq.io

  79. © 2020, Amazon Web Services, Inc. or its Affiliates. The

    Chaos Toolkit https://chaostoolkit.org • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines
  80. © 2020, Amazon Web Services, Inc. or its Affiliates.

  81. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  82. © 2020, Amazon Web Services, Inc. or its Affiliates.

  83. © 2020, Amazon Web Services, Inc. or its Affiliates.

  84. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  85. © 2020, Amazon Web Services, Inc. or its Affiliates. Verica.io

  86. © 2020, Amazon Web Services, Inc. or its Affiliates. https://github.com/adhorn/aws-

    chaos-scripts SDKs J
  87. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  88. © 2020, Amazon Web Services, Inc. or its Affiliates. Injecting

    Chaos to Amazon EC2 using AWS System Manager
  89. © 2020, Amazon Web Services, Inc. or its Affiliates. https://github.com/adhorn/chaos-ssm-documents

  90. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  91. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

    (SSM + ChaosToolkit)
  92. © 2020, Amazon Web Services, Inc. or its Affiliates. Injecting

    Chaos to AWS Lambda $ pip install chaos-lambda
  93. © 2020, Amazon Web Services, Inc. or its Affiliates. https://github.com/adhorn/aws-lambda-

    chaos-injection
  94. © 2020, Amazon Web Services, Inc. or its Affiliates. Demo

  95. https://github.com/gunnargrosch /failure-lambda

  96. © 2020, Amazon Web Services, Inc. or its Affiliates. Fault

    Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html
  97. © 2020, Amazon Web Services, Inc. or its Affiliates. Fault

    Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };
  98. © 2020, Amazon Web Services, Inc. or its Affiliates. ToxiProxy

    • HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy
  99. © 2020, Amazon Web Services, Inc. or its Affiliates. https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

  100. © 2020, Amazon Web Services, Inc. or its Affiliates. Challenges

    of Chaos Engineering
  101. © 2020, Amazon Web Services, Inc. or its Affiliates. https://xkcd.com/1428/

    Mister Chaos
  102. © 2020, Amazon Web Services, Inc. or its Affiliates. Big

    challenges to chaos engineering Mostly Cultural • Starting is perceived as hard! • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  103. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    Engineering won’t make your system more robust, People will.
  104. © 2020, Amazon Web Services, Inc. or its Affiliates. Thank

    you! Adrian Hornsby https://medium.com/@adhorn @adhorn