Slide 1

Slide 1 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Adrian Hornsby Principal Technical Evangelist Amazon Web Services Chaos Engineering on AWS Building Resilient Systems

Slide 2

Slide 2 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Challenges with distributed systems

Slide 3

Slide 3 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Twitter Netflix

Slide 4

Slide 4 text

1. POST REQUEST: CLIENT puts request MESSAGE onto NETWORK. 2. DELIVER REQUEST: NETWORK delivers MESSAGE to SERVER. 3. VALIDATE REQUEST: SERVER validates MESSAGE. 4. UPDATE SERVER STATE: SERVER updates its state, if necessary, based on MESSAGE. 5. POST REPLY: SERVER puts reply REPLY onto NETWORK. 6. DELIVER REPLY: NETWORK delivers REPLY to CLIENT. 7. VALIDATE REPLY: CLIENT validates REPLY. 8. UPDATE CLIENT STATE: CLIENT updates its state, if necessary, based on REPLY. https://aws.amazon.com/builders-library/challenges-with-distributed-systems

Slide 5

Slide 5 text

© 2020, Amazon Web Services, Inc. or its Affiliates. “Failures are a given and everything will eventually fail over time”. Werner Vogels CTO – Amazon.com

Slide 6

Slide 6 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Distributed systems are hard because • Errors happen anytime , often in combination with other errors. • Results of network operations can be unknown (succeeded, failed, or received but not processed). • Problems occur at all logical levels. • Problems get worse at higher levels of the system, due to recursion. • Bugs often show up long after they are deployed to a system. • Bugs can spread across an entire system. • Many problems derive from the laws of physics and can’t be changed.

Slide 7

Slide 7 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Is traditional testing enough? Testing: verifying a KNOWN condition: e.g. assert(A = B) ? Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results.

Slide 8

Slide 8 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Is traditional testing enough? Testing: verifying a KNOWN condition: e.g. assert(A = B) ? Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results.

Slide 9

Slide 9 text

© 2020, Amazon Web Services, Inc. or its Affiliates. The Scientific Method

Slide 10

Slide 10 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Make Observation Think of Interesting Questions Formulate Hypotheses Develop Testable Predictions Gather Data to Test Predictions Develop General Theories Refine, Alter, Expand or Reject Hypotheses The Scientific Method

Slide 11

Slide 11 text

© 2020, Amazon Web Services, Inc. or its Affiliates. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Chaos Engineering A scientific method

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Chaos Engineering formalized principlesofchaos.org

Slide 14

Slide 14 text

© 2020, Amazon Web Services, Inc. or its Affiliates. CHAOS ENGINEERING? THAT’S MY THING!!

Slide 15

Slide 15 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Chaos engineering is NOT about breaking things randomly without a purpose.

Slide 16

Slide 16 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Chaos engineering is about breaking things in a controlled environment and through well-planned experiments in order to build confidence in your application to withstand turbulent conditions.

Slide 17

Slide 17 text

© 2020, Amazon Web Services, Inc. or its Affiliates. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” - Nora Jones

Slide 18

Slide 18 text

© 2020, Amazon Web Services, Inc. or its Affiliates. ü Building confidence against failure

Slide 19

Slide 19 text

© 2020, Amazon Web Services, Inc. or its Affiliates.

Slide 20

Slide 20 text

© 2020, Amazon Web Services, Inc. or its Affiliates. • A volunteer firefighter • Created GameDay in 2006 to purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon

Slide 21

Slide 21 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Jesse Robbins, “Master of Disaster” GameDay at Amazon • Test, train and prepare Amazon systems, software, and people to respond to a disaster. • Increase Amazon retail website resiliency by purposely injecting failures into critical systems.

Slide 22

Slide 22 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://www.youtube.com/watch?v=zoz0ZjfrQ9s Jesse Robbins – mid 2000’s

Slide 23

Slide 23 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Find weaknesses and fix them before they break when least expected.

Slide 24

Slide 24 text

© 2020, Amazon Web Services, Inc. or its Affiliates. ü Building confidence against failures ü Reducing Skill Atrophy

Slide 25

Slide 25 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Training is not a one-time occurrence. It should be an ongoing process of expanding knowledge, exercising skills, and passing on these abilities for the benefit of the organization.

Slide 26

Slide 26 text

© 2020, Amazon Web Services, Inc. or its Affiliates. ü Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time

Slide 27

Slide 27 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Because that is the average amount of one hour of downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research Can you afford to lose $100,000?

Slide 28

Slide 28 text

© 2020, Amazon Web Services, Inc. or its Affiliates. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)

Slide 29

Slide 29 text

© 2020, Amazon Web Services, Inc. or its Affiliates. ü Building confidence against failures ü Reducing Skill Atrophy ü Improving Recovery Time ü And a lot more …

Slide 30

Slide 30 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Prerequisites to chaos engineering

Slide 31

Slide 31 text

© 2020, Amazon Web Services, Inc. or its Affiliates. © 2020, Amazon Web Services, Inc. or its Affiliates. ü People ü Operations ü Application ü Network & Data ü Infrastructure

Slide 32

Slide 32 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Operations Infrastructure Application Software

Slide 33

Slide 33 text

© 2020, Amazon Web Services, Inc. or its Affiliates.

Slide 34

Slide 34 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://medium.com/@adhorn

Slide 35

Slide 35 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://aws.amazon.com/wellarchitected

Slide 36

Slide 36 text

© 2020, Amazon Web Services, Inc. or its Affiliates. M ore Inform ation https://aws.amazon.com/builders-library

Slide 37

Slide 37 text

© 2020, Amazon Web Services, Inc. or its Affiliates. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Don’t break things in prod before you have done your home work!

Slide 38

Slide 38 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Phases of Chaos Engineering

Slide 39

Slide 39 text

© 2020, Amazon Web Services, Inc. or its Affiliates. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 40

Slide 40 text

© 2020, Amazon Web Services, Inc. or its Affiliates. What is it? • ”Normal” behavior of your system • Not the internal attributes of the system (CPU, memory, etc.) • Operational metrics tied with customer experience yields best results. The steady state varies when an unmitigated failure triggers an unexpected problem, and should cause the chaos experiment to be aborted. Steady State

Slide 41

Slide 41 text

© 2020, Amazon Web Services, Inc. or its Affiliates. What is steady state? ”Normal” behavior of your system Steady State

Slide 42

Slide 42 text

© 2020, Amazon Web Services, Inc. or its Affiliates. What is steady state? • Business + Ops Metric https://medium.com/netflix-techblog/ Steady State

Slide 43

Slide 43 text

© 2020, Amazon Web Services, Inc. or its Affiliates. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 44

Slide 44 text

© 2020, Amazon Web Services, Inc. or its Affiliates. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem! H ypothesis

Slide 45

Slide 45 text

© 2020, Amazon Web Services, Inc. or its Affiliates. H ypothesis

Slide 46

Slide 46 text

IF YOU HAVEN’T VERIFIED IT, IT’S PROBABLY BROKEN. H ypothesis

Slide 47

Slide 47 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Where to start? • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization • Think Blast radius!!! • Simulating AZ failure • Injecting latency between services • Randomly throwing exceptions. • Maxing out CPU to verify scaling policies. • Database failovers & backups H ypothesis

Slide 48

Slide 48 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Disclaimer! Don’t make an hypothesis that you know will break you! H ypothesis

Slide 49

Slide 49 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Rules of thumbs • Start very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back(corrupt or incorrect data) H ypothesis

Slide 50

Slide 50 text

© 2020, Amazon Web Services, Inc. or its Affiliates. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 51

Slide 51 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Failure injection Start small and build confidence • Application level (exceptions, errors, etc) • Host level (services, processes, etc) • Resource attacks (CPU, memory, IO, etc) • Network attacks (dependencies, latency, packet loss, etc) • AZ attack • Region attack • People attack Run Experim ent

Slide 52

Slide 52 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Routing mechanism Users Old application version New application version Run Experim ent Canary deployment 10% traffic 90% traffic https://medium.com/@adhorn/immutable-infrastructure-21f6613e7a23

Slide 53

Slide 53 text

© 2020, Amazon Web Services, Inc. or its Affiliates. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 54

Slide 54 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable? Verify

Slide 55

Slide 55 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Postmortems – COE (Correction of Errors) • What happened? • What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.) Verify

Slide 56

Slide 56 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Dive deep on the causes Question: Why did the associate damage his thumb? Answer: Because his thumb got caught in the conveyor. Question: Why did his thumb get caught in the conveyor? Answer: Because he was chasing his bag, which was on a running conveyor belt. Question: Why did he chase his bag? Answer: Because he placed his bag on the conveyor, but it then turned-on by surprise Question: Why was his bag on the conveyor? Answer: Because he used the conveyor as a table Possible Conclusion: So, one likely cause of the associate’s damaged thumb is that he needed a table, there wasn’t one around, so he used a conveyor as a table. Verify

Slide 57

Slide 57 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Tools Processes Culture Technology Verify

Slide 58

Slide 58 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Never Let a Good Crisis Go To Waste Verify

Slide 59

Slide 59 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Two rules to remember ALWAYS! Verify

Slide 60

Slide 60 text

© 2020, Amazon Web Services, Inc. or its Affiliates. DON’T blame that one person … Verify

Slide 61

Slide 61 text

© 2020, Amazon Web Services, Inc. or its Affiliates. There is no isolated ‘cause’ of an accident. Verify

Slide 62

Slide 62 text

© 2020, Amazon Web Services, Inc. or its Affiliates. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 63

Slide 63 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Fix it! Im prove

Slide 64

Slide 64 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Audit Weekly Operational Metrics Review • Continuous inspection mechanism • Maintains focus on operations • Foundation of a healthy operations program Typical Agenda - typically divided into fifteen-minute slots • Share successes and failings • Action items follow up • Review COEs • Review key service metrics • Identify new best practices Im prove

Slide 65

Slide 65 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Bananas for Monkeys

Slide 66

Slide 66 text

© 2020, Amazon Web Services, Inc. or its Affiliates. INTRODUCE CHAOS ENGINEERING EARLY IN THE JOURNEY. DON’T WAIT!

Slide 67

Slide 67 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Eleanor https://github.com/adhorn/eleanor

Slide 68

Slide 68 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Start simple and local!! $ docker stop database or anything else ;-)

Slide 69

Slide 69 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 70

Slide 70 text

DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

Slide 71

Slide 71 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 72

Slide 72 text

Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t 60s https://kernel.ubuntu.com/~cking/stress-ng/

Slide 73

Slide 73 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 74

Slide 74 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Adding latency to the network $ tc qdisc add dev eth0 root netem delay 300ms

Slide 75

Slide 75 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Blocks DNS resolution $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP

Slide 76

Slide 76 text

Other fun things to do • Fill up disk • Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with config files • …

Slide 77

Slide 77 text

© 2020, Amazon Web Services, Inc. or its Affiliates. “Simian Army to keep our cloud safe, secure, and highly available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army https://github.com/Netflix/SimianArmy

Slide 78

Slide 78 text

https://chaosiq.io

Slide 79

Slide 79 text

© 2020, Amazon Web Services, Inc. or its Affiliates. The Chaos Toolkit https://chaostoolkit.org • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines

Slide 80

Slide 80 text

© 2020, Amazon Web Services, Inc. or its Affiliates.

Slide 81

Slide 81 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 82

Slide 82 text

© 2020, Amazon Web Services, Inc. or its Affiliates.

Slide 83

Slide 83 text

© 2020, Amazon Web Services, Inc. or its Affiliates.

Slide 84

Slide 84 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 85

Slide 85 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Verica.io

Slide 86

Slide 86 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://github.com/adhorn/aws- chaos-scripts SDKs J

Slide 87

Slide 87 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 88

Slide 88 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Injecting Chaos to Amazon EC2 using AWS System Manager

Slide 89

Slide 89 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://github.com/adhorn/chaos-ssm-documents

Slide 90

Slide 90 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 91

Slide 91 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo (SSM + ChaosToolkit)

Slide 92

Slide 92 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Injecting Chaos to AWS Lambda $ pip install chaos-lambda

Slide 93

Slide 93 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://github.com/adhorn/aws-lambda- chaos-injection

Slide 94

Slide 94 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Demo

Slide 95

Slide 95 text

https://github.com/gunnargrosch /failure-lambda

Slide 96

Slide 96 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

Slide 97

Slide 97 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

Slide 98

Slide 98 text

© 2020, Amazon Web Services, Inc. or its Affiliates. ToxiProxy • HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy

Slide 99

Slide 99 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

Slide 100

Slide 100 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Challenges of Chaos Engineering

Slide 101

Slide 101 text

© 2020, Amazon Web Services, Inc. or its Affiliates. https://xkcd.com/1428/ Mister Chaos

Slide 102

Slide 102 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Big challenges to chaos engineering Mostly Cultural • Starting is perceived as hard! • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

Slide 103

Slide 103 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Chaos Engineering won’t make your system more robust, People will.

Slide 104

Slide 104 text

© 2020, Amazon Web Services, Inc. or its Affiliates. Thank you! Adrian Hornsby https://medium.com/@adhorn @adhorn