Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying chaos engineering principles for building fault-tolerant applications

Applying chaos engineering principles for building fault-tolerant applications

Failures are inevitable. Regardless of the engineering efforts put into building fault-tolerant applications and handling edge cases, one day, a case beyond our reach will turn a benign failure into a catastrophic one. Therefore, we must test and continuously improve our application’s resilience to failures to minimise its blast-radius and its impact on user experience. Chaos engineering has emerged as one of the best methods to do that. However, while interest is growing, few have managed to build sustainable chaos engineering practices. In this two-part seminar, I will first introduce chaos engineering and its principles, and explain how to get started with it. I will then walk through and demo some of the tools and methods that you can use today to inject failures in software systems to make them more resilient to failure.

Adrian Hornsby

April 16, 2020
Tweet

More Decks by Adrian Hornsby

Other Decks in Programming

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. Adrian

    Hornsby Principal Technical Evangelist Amazon Web Services Chaos Engineering on AWS Building Resilient Systems
  2. 1. POST REQUEST: CLIENT puts request MESSAGE onto NETWORK. 2.

    DELIVER REQUEST: NETWORK delivers MESSAGE to SERVER. 3. VALIDATE REQUEST: SERVER validates MESSAGE. 4. UPDATE SERVER STATE: SERVER updates its state, if necessary, based on MESSAGE. 5. POST REPLY: SERVER puts reply REPLY onto NETWORK. 6. DELIVER REPLY: NETWORK delivers REPLY to CLIENT. 7. VALIDATE REPLY: CLIENT validates REPLY. 8. UPDATE CLIENT STATE: CLIENT updates its state, if necessary, based on REPLY. https://aws.amazon.com/builders-library/challenges-with-distributed-systems
  3. © 2020, Amazon Web Services, Inc. or its Affiliates. “Failures

    are a given and everything will eventually fail over time”. Werner Vogels CTO – Amazon.com
  4. © 2020, Amazon Web Services, Inc. or its Affiliates. Distributed

    systems are hard because • Errors happen anytime , often in combination with other errors. • Results of network operations can be unknown (succeeded, failed, or received but not processed). • Problems occur at all logical levels. • Problems get worse at higher levels of the system, due to recursion. • Bugs often show up long after they are deployed to a system. • Bugs can spread across an entire system. • Many problems derive from the laws of physics and can’t be changed.
  5. © 2020, Amazon Web Services, Inc. or its Affiliates. Is

    traditional testing enough? Testing: verifying a KNOWN condition: e.g. assert(A = B) ? Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results.
  6. © 2020, Amazon Web Services, Inc. or its Affiliates. Is

    traditional testing enough? Testing: verifying a KNOWN condition: e.g. assert(A = B) ? Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results.
  7. © 2020, Amazon Web Services, Inc. or its Affiliates. Make

    Observation Think of Interesting Questions Formulate Hypotheses Develop Testable Predictions Gather Data to Test Predictions Develop General Theories Refine, Alter, Expand or Reject Hypotheses The Scientific Method
  8. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Chaos Engineering A scientific method
  9. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    Engineering formalized principlesofchaos.org
  10. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    engineering is NOT about breaking things randomly without a purpose.
  11. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    engineering is about breaking things in a controlled environment and through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.
  12. © 2020, Amazon Web Services, Inc. or its Affiliates. “CHAOS

    DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” - Nora Jones
  13. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failure
  14. © 2020, Amazon Web Services, Inc. or its Affiliates. •

    A volunteer firefighter • Created GameDay in 2006 to purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon
  15. © 2020, Amazon Web Services, Inc. or its Affiliates. Jesse

    Robbins, “Master of Disaster” GameDay at Amazon • Test, train and prepare Amazon systems, software, and people to respond to a disaster. • Increase Amazon retail website resiliency by purposely injecting failures into critical systems.
  16. © 2020, Amazon Web Services, Inc. or its Affiliates. Find

    weaknesses and fix them before they break when least expected.
  17. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failures ü Reducing Skill Atrophy
  18. © 2020, Amazon Web Services, Inc. or its Affiliates. Training

    is not a one-time occurrence. It should be an ongoing process of expanding knowledge, exercising skills, and passing on these abilities for the benefit of the organization.
  19. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time
  20. © 2020, Amazon Web Services, Inc. or its Affiliates. Because

    that is the average amount of one hour of downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research Can you afford to lose $100,000?
  21. © 2020, Amazon Web Services, Inc. or its Affiliates. System

    Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)
  22. © 2020, Amazon Web Services, Inc. or its Affiliates. ü

    Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time ü And a lot more …
  23. © 2020, Amazon Web Services, Inc. or its Affiliates. ©

    2020, Amazon Web Services, Inc. or its Affiliates. ü People ü Operations ü Application ü Network & Data ü Infrastructure
  24. © 2020, Amazon Web Services, Inc. or its Affiliates. M

    ore Inform ation https://aws.amazon.com/builders-library
  25. © 2020, Amazon Web Services, Inc. or its Affiliates. ©

    2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Don’t break things in prod before you have done your home work!
  26. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  27. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    is it? • ”Normal” behavior of your system • Not the internal attributes of the system (CPU, memory, etc.) • Operational metrics tied with customer experience yields best results. The steady state varies when an unmitigated failure triggers an unexpected problem, and should cause the chaos experiment to be aborted. Steady State
  28. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    is steady state? ”Normal” behavior of your system Steady State
  29. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    is steady state? • Business + Ops Metric https://medium.com/netflix-techblog/ Steady State
  30. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  31. © 2020, Amazon Web Services, Inc. or its Affiliates. What

    if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem! H ypothesis
  32. © 2020, Amazon Web Services, Inc. or its Affiliates. Where

    to start? • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization • Think Blast radius!!! • Simulating AZ failure • Injecting latency between services • Randomly throwing exceptions. • Maxing out CPU to verify scaling policies. • Database failovers & backups H ypothesis
  33. © 2020, Amazon Web Services, Inc. or its Affiliates. Disclaimer!

    Don’t make an hypothesis that you know will break you! H ypothesis
  34. © 2020, Amazon Web Services, Inc. or its Affiliates. Rules

    of thumbs • Start very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back(corrupt or incorrect data) H ypothesis
  35. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  36. © 2020, Amazon Web Services, Inc. or its Affiliates. Failure

    injection Start small and build confidence • Application level (exceptions, errors, etc) • Host level (services, processes, etc) • Resource attacks (CPU, memory, IO, etc) • Network attacks (dependencies, latency, packet loss, etc) • AZ attack • Region attack • People attack Run Experim ent
  37. © 2020, Amazon Web Services, Inc. or its Affiliates. Routing

    mechanism Users Old application version New application version Run Experim ent Canary deployment 10% traffic 90% traffic https://medium.com/@adhorn/immutable-infrastructure-21f6613e7a23
  38. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  39. © 2020, Amazon Web Services, Inc. or its Affiliates. Quantifying

    the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable? Verify
  40. © 2020, Amazon Web Services, Inc. or its Affiliates. Postmortems

    – COE (Correction of Errors) • What happened? • What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.) Verify
  41. © 2020, Amazon Web Services, Inc. or its Affiliates. Dive

    deep on the causes Question: Why did the associate damage his thumb? Answer: Because his thumb got caught in the conveyor. Question: Why did his thumb get caught in the conveyor? Answer: Because he was chasing his bag, which was on a running conveyor belt. Question: Why did he chase his bag? Answer: Because he placed his bag on the conveyor, but it then turned-on by surprise Question: Why was his bag on the conveyor? Answer: Because he used the conveyor as a table Possible Conclusion: So, one likely cause of the associate’s damaged thumb is that he needed a table, there wasn’t one around, so he used a conveyor as a table. Verify
  42. © 2020, Amazon Web Services, Inc. or its Affiliates. There

    is no isolated ‘cause’ of an accident. Verify
  43. © 2020, Amazon Web Services, Inc. or its Affiliates. STEADY

    STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering
  44. © 2020, Amazon Web Services, Inc. or its Affiliates. Audit

    Weekly Operational Metrics Review • Continuous inspection mechanism • Maintains focus on operations • Foundation of a healthy operations program Typical Agenda - typically divided into fifteen-minute slots • Share successes and failings • Action items follow up • Review COEs • Review key service metrics • Identify new best practices Im prove
  45. © 2020, Amazon Web Services, Inc. or its Affiliates. INTRODUCE

    CHAOS ENGINEERING EARLY IN THE JOURNEY. DON’T WAIT!
  46. © 2020, Amazon Web Services, Inc. or its Affiliates. Start

    simple and local!! $ docker stop database or anything else ;-)
  47. Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t

    60s https://kernel.ubuntu.com/~cking/stress-ng/
  48. © 2020, Amazon Web Services, Inc. or its Affiliates. Adding

    latency to the network $ tc qdisc add dev eth0 root netem delay 300ms
  49. © 2020, Amazon Web Services, Inc. or its Affiliates. Blocks

    DNS resolution $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP
  50. Other fun things to do • Fill up disk •

    Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with config files • …
  51. © 2020, Amazon Web Services, Inc. or its Affiliates. “Simian

    Army to keep our cloud safe, secure, and highly available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army https://github.com/Netflix/SimianArmy
  52. © 2020, Amazon Web Services, Inc. or its Affiliates. The

    Chaos Toolkit https://chaostoolkit.org • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines
  53. © 2020, Amazon Web Services, Inc. or its Affiliates. Injecting

    Chaos to Amazon EC2 using AWS System Manager
  54. © 2020, Amazon Web Services, Inc. or its Affiliates. Injecting

    Chaos to AWS Lambda $ pip install chaos-lambda
  55. © 2020, Amazon Web Services, Inc. or its Affiliates. Fault

    Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html
  56. © 2020, Amazon Web Services, Inc. or its Affiliates. Fault

    Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };
  57. © 2020, Amazon Web Services, Inc. or its Affiliates. ToxiProxy

    • HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy
  58. © 2020, Amazon Web Services, Inc. or its Affiliates. Big

    challenges to chaos engineering Mostly Cultural • Starting is perceived as hard! • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.
  59. © 2020, Amazon Web Services, Inc. or its Affiliates. Chaos

    Engineering won’t make your system more robust, People will.
  60. © 2020, Amazon Web Services, Inc. or its Affiliates. Thank

    you! Adrian Hornsby https://medium.com/@adhorn @adhorn