Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Practical Chaos Engineering: Breaking things on purpose to make them more resilient against failure Adrian Hornsby Principal Evangelist Amazon Web Services

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

Slide 5

Slide 5 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 6

Slide 6 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

Slide 7

Slide 7 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering https://github.com/Netflix/SimianArmy

Slide 8

Slide 8 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.

Slide 9

Slide 9 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

Slide 10

Slide 10 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering

Slide 11

Slide 11 text

Operations Infrastructure Application Software

Slide 12

Slide 12 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://aws.amazon.com/wellarchitected

Slide 13

Slide 13 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://medium.com/@adhorn

Slide 14

Slide 14 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering

Slide 15

Slide 15 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 16

Slide 16 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

Slide 17

Slide 17 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is steady state? • ”normal” behavior of your system • Business + Ops Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Slide 18

Slide 18 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 19

Slide 19 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

Slide 20

Slide 20 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 21

Slide 21 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failure injection • Start small & build confidence • Application level • Host level • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • AZ attack • Region attack • People attack

Slide 22

Slide 22 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Benefits of Failure Injection - Practicing failure contingency. - Understanding of the effects of real world failures. - Understanding the efficacy of the fault tolerance mechanisms. - Removing designs faults in the fault tolerance mechanisms. - Understanding the effectiveness of your observability. - Understanding the blast-radius of failures and help reduce it. - Understanding the weak links in the design – especially single points of failures. - Understanding failure propagation between system component. Learn to avoid cascading effect. - …

Slide 23

Slide 23 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blast radius • How many customers? • What functionality? • How many locations?

Slide 24

Slide 24 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rules of thumbs Have an emergency STOP or a good exit plan! Careful with state that can’t be rolled back (corrupt or incorrect data)

Slide 25

Slide 25 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 26

Slide 26 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

Slide 27

Slide 27 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of …

Slide 28

Slide 28 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

Slide 29

Slide 29 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. DON’T blame people!!!

Slide 30

Slide 30 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Rules to remember! There is no isolated ‘cause’ of an accident.

Slide 31

Slide 31 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to do Failure Injection

Slide 34

Slide 34 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start simple and local!! $ docker stop 94a214bbeebd

Slide 35

Slide 35 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Burn CPU with Stress(–ng) $ stress-ng --random 50 -t 60 --metrics-brief --times https://kernel.ubuntu.com/~cking/stress-ng/

Slide 36

Slide 36 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adding delay to the network $ tc qdisc add dev eth0 root netem delay 200ms 40ms 25% loss 15.3% 25% duplicate 1% corrupt 0.1% reorder 25% 50%

Slide 37

Slide 37 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blocks DNS resolution $ iptables -I OUTPUT -p udp -d --dport 53 -j DROP Get the DNS: $ cat /etc/resolv.conf search eu-west-1.compute.internal nameserver 172.31.0.2 $ dig showme.mynameserver.xyz

Slide 38

Slide 38 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

Slide 39

Slide 39 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Other fun things to do • Fill up disk • Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with /etc/hosts

Slide 40

Slide 40 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD)

Slide 41

Slide 41 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. ToxiProxy • HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy

Slide 42

Slide 42 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

Slide 43

Slide 43 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://github.com/asobti/kube-monkey

Slide 44

Slide 44 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pumba https://github.com/alexei-led/pumba/

Slide 45

Slide 45 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://blog.thundra.io/chaos-test-your-lambda-functions-with-thundra

Slide 46

Slide 46 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. APIs J

Slide 47

Slide 47 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

Slide 48

Slide 48 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

Slide 49

Slide 49 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. ❯ aws lambda put-function-concurrency --function-name -- reserved-concurrent-executions 0

Slide 50

Slide 50 text

Injecting Chaos to AWS Lambda $ pip install chaos-lambda

Slide 51

Slide 51 text

https://github.com/adhorn/aws-lambda-chaos-injection

Slide 52

Slide 52 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Chaos Toolkit • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines

Slide 53

Slide 53 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 54

Slide 54 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 55

Slide 55 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 56

Slide 56 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 57

Slide 57 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Injecting Chaos to Amazon EC2 using AWS System Manager https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

Slide 58

Slide 58 text

https://github.com/adhorn/chaos-ssm-documents

Slide 59

Slide 59 text

SSM Run (send) Command $ aws ssm send-command --document-name "cpu-stress" --document-version "1" --targets '[{"Key":"InstanceIds","Values":[ " i-094c8367024633d96 ","i-04d0976f9fb658c23"]}]’ --parameters '{"duration":["60"],"cpu":["0"]}’ --timeout-seconds 600 --max-concurrency "50" --max-errors "0" --output-s3-bucket-name "adhorn-chaos-ssm-output" --region eu-west-1

Slide 60

Slide 60 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big challenges to chaos engineering • Chaos Engineering won’t make your system more robust, People will. • Chaos Engineering won’t replace __all__ the rest (test, quality, …) • Chaos Engineering is NOT the only way to learn from failure • Rollbacks are HARD because of state. • Your systems will continue to fail, sorry.

Slide 61

Slide 61 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

Slide 62

Slide 62 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Changing culture takes time! Be patient…

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! Adrian Hornsby https://medium.com/@adhorn adhorn