Applying chaos engineering principles for building fault-tolerant applications

© 2020, Amazon Web Services, Inc. or its Affiliates. Adrian
Hornsby Principal Technical Evangelist Amazon Web Services Chaos Engineering on AWS Building Resilient Systems

© 2020, Amazon Web Services, Inc. or its Affiliates. Challenges
with distributed systems

© 2020, Amazon Web Services, Inc. or its Affiliates. Amazon
Twitter Netflix

1. POST REQUEST: CLIENT puts request MESSAGE onto NETWORK. 2.
DELIVER REQUEST: NETWORK delivers MESSAGE to SERVER. 3. VALIDATE REQUEST: SERVER validates MESSAGE. 4. UPDATE SERVER STATE: SERVER updates its state, if necessary, based on MESSAGE. 5. POST REPLY: SERVER puts reply REPLY onto NETWORK. 6. DELIVER REPLY: NETWORK delivers REPLY to CLIENT. 7. VALIDATE REPLY: CLIENT validates REPLY. 8. UPDATE CLIENT STATE: CLIENT updates its state, if necessary, based on REPLY. https://aws.amazon.com/builders-library/challenges-with-distributed-systems

© 2020, Amazon Web Services, Inc. or its Affiliates. “Failures
are a given and everything will eventually fail over time”. Werner Vogels CTO – Amazon.com

© 2020, Amazon Web Services, Inc. or its Affiliates. Distributed
systems are hard because • Errors happen anytime , often in combination with other errors. • Results of network operations can be unknown (succeeded, failed, or received but not processed). • Problems occur at all logical levels. • Problems get worse at higher levels of the system, due to recursion. • Bugs often show up long after they are deployed to a system. • Bugs can spread across an entire system. • Many problems derive from the laws of physics and can’t be changed.

© 2020, Amazon Web Services, Inc. or its Affiliates. Is
traditional testing enough? Testing: verifying a KNOWN condition: e.g. assert(A = B) ? Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results.

© 2020, Amazon Web Services, Inc. or its Affiliates. The
Scientific Method

© 2020, Amazon Web Services, Inc. or its Affiliates. Make
Observation Think of Interesting Questions Formulate Hypotheses Develop Testable Predictions Gather Data to Test Predictions Develop General Theories Refine, Alter, Expand or Reject Hypotheses The Scientific Method

© 2020, Amazon Web Services, Inc. or its Affiliates. STEADY
STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Chaos Engineering A scientific method

© 2020, Amazon Web Services, Inc. or its Affiliates. Chaos
Engineering formalized principlesofchaos.org

© 2020, Amazon Web Services, Inc. or its Affiliates. CHAOS
ENGINEERING? THAT’S MY THING!!

engineering is NOT about breaking things randomly without a purpose.

engineering is about breaking things in a controlled environment and through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.

© 2020, Amazon Web Services, Inc. or its Affiliates. “CHAOS
DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” - Nora Jones

© 2020, Amazon Web Services, Inc. or its Affiliates. ü
Building confidence against failure

© 2020, Amazon Web Services, Inc. or its Affiliates. •
A volunteer firefighter • Created GameDay in 2006 to purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon

© 2020, Amazon Web Services, Inc. or its Affiliates. Jesse
Robbins, “Master of Disaster” GameDay at Amazon • Test, train and prepare Amazon systems, software, and people to respond to a disaster. • Increase Amazon retail website resiliency by purposely injecting failures into critical systems.

© 2020, Amazon Web Services, Inc. or its Affiliates. https://www.youtube.com/watch?v=zoz0ZjfrQ9s
Jesse Robbins – mid 2000’s

© 2020, Amazon Web Services, Inc. or its Affiliates. Find
weaknesses and fix them before they break when least expected.

Building confidence against failures ü Reducing Skill Atrophy

© 2020, Amazon Web Services, Inc. or its Affiliates. Training
is not a one-time occurrence. It should be an ongoing process of expanding knowledge, exercising skills, and passing on these abilities for the benefit of the organization.

Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time

© 2020, Amazon Web Services, Inc. or its Affiliates. Because
that is the average amount of one hour of downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research Can you afford to lose $100,000?

© 2020, Amazon Web Services, Inc. or its Affiliates. System
Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)

Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time ü And a lot more …

© 2020, Amazon Web Services, Inc. or its Affiliates. Prerequisites
to chaos engineering

© 2020, Amazon Web Services, Inc. or its Affiliates. ©
2020, Amazon Web Services, Inc. or its Affiliates. ü People ü Operations ü Application ü Network & Data ü Infrastructure

© 2020, Amazon Web Services, Inc. or its Affiliates. Operations
Infrastructure Application Software

© 2020, Amazon Web Services, Inc. or its Affiliates. M
ore Inform ation https://aws.amazon.com/builders-library

© 2020, Amazon Web Services, Inc. or its Affiliates. Phases
of Chaos Engineering

STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

© 2020, Amazon Web Services, Inc. or its Affiliates. What
is it? • ”Normal” behavior of your system • Not the internal attributes of the system (CPU, memory, etc.) • Operational metrics tied with customer experience yields best results. The steady state varies when an unmitigated failure triggers an unexpected problem, and should cause the chaos experiment to be aborted. Steady State

is steady state? ”Normal” behavior of your system Steady State

is steady state? • Business + Ops Metric https://medium.com/netflix-techblog/ Steady State

if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem! H ypothesis

© 2020, Amazon Web Services, Inc. or its Affiliates. H
ypothesis

IF YOU HAVEN’T VERIFIED IT, IT’S PROBABLY BROKEN. H ypothesis

© 2020, Amazon Web Services, Inc. or its Affiliates. Where
to start? • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization • Think Blast radius!!! • Simulating AZ failure • Injecting latency between services • Randomly throwing exceptions. • Maxing out CPU to verify scaling policies. • Database failovers & backups H ypothesis

© 2020, Amazon Web Services, Inc. or its Affiliates. Disclaimer!
Don’t make an hypothesis that you know will break you! H ypothesis

© 2020, Amazon Web Services, Inc. or its Affiliates. Rules
of thumbs • Start very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back(corrupt or incorrect data) H ypothesis

© 2020, Amazon Web Services, Inc. or its Affiliates. Failure
injection Start small and build confidence • Application level (exceptions, errors, etc) • Host level (services, processes, etc) • Resource attacks (CPU, memory, IO, etc) • Network attacks (dependencies, latency, packet loss, etc) • AZ attack • Region attack • People attack Run Experim ent

© 2020, Amazon Web Services, Inc. or its Affiliates. Routing
mechanism Users Old application version New application version Run Experim ent Canary deployment 10% traffic 90% traffic https://medium.com/@adhorn/immutable-infrastructure-21f6613e7a23

© 2020, Amazon Web Services, Inc. or its Affiliates. Quantifying
the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable? Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. Postmortems
– COE (Correction of Errors) • What happened? • What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.) Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. Dive
deep on the causes Question: Why did the associate damage his thumb? Answer: Because his thumb got caught in the conveyor. Question: Why did his thumb get caught in the conveyor? Answer: Because he was chasing his bag, which was on a running conveyor belt. Question: Why did he chase his bag? Answer: Because he placed his bag on the conveyor, but it then turned-on by surprise Question: Why was his bag on the conveyor? Answer: Because he used the conveyor as a table Possible Conclusion: So, one likely cause of the associate’s damaged thumb is that he needed a table, there wasn’t one around, so he used a conveyor as a table. Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. Tools
Processes Culture Technology Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. Never
Let a Good Crisis Go To Waste Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. Two
rules to remember ALWAYS! Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. DON’T
blame that one person … Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. There
is no isolated ‘cause’ of an accident. Verify

© 2020, Amazon Web Services, Inc. or its Affiliates. Fix
it! Im prove

© 2020, Amazon Web Services, Inc. or its Affiliates. Audit
Weekly Operational Metrics Review • Continuous inspection mechanism • Maintains focus on operations • Foundation of a healthy operations program Typical Agenda - typically divided into fifteen-minute slots • Share successes and failings • Action items follow up • Review COEs • Review key service metrics • Identify new best practices Im prove

© 2020, Amazon Web Services, Inc. or its Affiliates. Bananas
for Monkeys

© 2020, Amazon Web Services, Inc. or its Affiliates. INTRODUCE
CHAOS ENGINEERING EARLY IN THE JOURNEY. DON’T WAIT!

© 2020, Amazon Web Services, Inc. or its Affiliates. Start
simple and local!! $ docker stop database or anything else ;-)

DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t
60s https://kernel.ubuntu.com/~cking/stress-ng/

© 2020, Amazon Web Services, Inc. or its Affiliates. Adding
latency to the network $ tc qdisc add dev eth0 root netem delay 300ms

© 2020, Amazon Web Services, Inc. or its Affiliates. Blocks
DNS resolution $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP

Other fun things to do • Fill up disk •
Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with config files • …

© 2020, Amazon Web Services, Inc. or its Affiliates. “Simian
Army to keep our cloud safe, secure, and highly available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army https://github.com/Netflix/SimianArmy

https://chaosiq.io

© 2020, Amazon Web Services, Inc. or its Affiliates. The
Chaos Toolkit https://chaostoolkit.org • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines

(SSM + ChaosToolkit)

https://github.com/gunnargrosch /failure-lambda

© 2020, Amazon Web Services, Inc. or its Affiliates. Fault
Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

© 2020, Amazon Web Services, Inc. or its Affiliates. Fault
Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

© 2020, Amazon Web Services, Inc. or its Affiliates. ToxiProxy
• HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy

© 2020, Amazon Web Services, Inc. or its Affiliates. Big
challenges to chaos engineering Mostly Cultural • Starting is perceived as hard! • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

Engineering won’t make your system more robust, People will.

Applying chaos engineering principles for build...

Applying chaos engineering principles for building fault-tolerant applications

More Decks by Adrian Hornsby

Other Decks in Programming

Featured

Transcript