Building Resilient Applications using Chaos Engineering on AWS

Slide 1

Slide 1 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building Resilient Applications using Chaos Engineering on AWS Adrian Hornsby Principal Technical Evangelist Amazon Web Services

Slide 2

Slide 2 text

Can you afford to lose $100,000?

Slide 3

Slide 3 text

Because that is the average amount of one hour of downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

• A volunteer firefighter • Created GameDay in 2006 to purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon

Slide 6

Slide 6 text

“Simian Army to keep our cloud safe, secure, and highly available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Rise of the monkeys https://github.com/Netflix/SimianArmy

Slide 7

Slide 7 text

Chaos Engineering formalized by Netflix (mid-2015 ) principlesofchaos.org

Slide 8

Slide 8 text

Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.

Slide 9

Slide 9 text

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Chaos Engineering A scientific method

Slide 10

Slide 10 text

Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Operations Infrastructure Application Software

Slide 13

Slide 13 text

https://aws.amazon.com/wellarchitected

Slide 14

Slide 14 text

M ore Inform ation https://aws.amazon.com/builders-library

Slide 15

Slide 15 text

Slide 16

Slide 16 text

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 17

Slide 17 text

What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

Slide 18

Slide 18 text

What is steady state? • Business + Ops Metric https://medium.com/netflix-techblog/

Slide 19

Slide 19 text

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 20

Slide 20 text

What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 23

Slide 23 text

Failure injection • Start small & build confidence • Application level (exceptions, errors, …) • Host level (services, processes, …) • Resource attacks (CPU, memory, IO, …) • Network attacks (dependencies, latency, packet loss…) • AZ attack • Region attack • People attack

Slide 24

Slide 24 text

Rules of thumbs • Start very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back (corrupt or incorrect data)

Slide 25

Slide 25 text

Blast radius • How many customers? • What functionality? • How many locations?

Slide 26

Slide 26 text

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 27

Slide 27 text

Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

Slide 28

Slide 28 text

Postmortems – COE (Correction of Errors) • What happened? • What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.)

Slide 29

Slide 29 text

Tools Processes Culture Technology

Slide 30

Slide 30 text

Two rules to remember ALWAYS!

Slide 31

Slide 31 text

DON’T blame that one person …

Slide 32

Slide 32 text

There is no isolated ‘cause’ of an accident.

Slide 33

Slide 33 text

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos Engineering

Slide 34

Slide 34 text

Fix it!

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Start simple and local!! $ docker stop 94a214bbeebd

Slide 37

Slide 37 text

DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

Slide 38

Slide 38 text

Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t 60s https://kernel.ubuntu.com/~cking/stress-ng/

Slide 39

Slide 39 text

Adding latency to the network $ tc qdisc add dev eth0 root netem delay 300ms

Slide 40

Slide 40 text

Blocks DNS resolution $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP

Slide 41

Slide 41 text

Other fun things to do • Fill up disk • Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with /etc/hosts

Slide 42

Slide 42 text

https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army

Slide 43

Slide 43 text

The Chaos Toolkit • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

ToxiProxy • HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy

Slide 50

Slide 50 text

https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

Slide 51

Slide 51 text

Pumba https://github.com/alexei-led/pumba/

Slide 52

Slide 52 text

Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

Slide 53

Slide 53 text

Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

Slide 54

Slide 54 text

APIs J

Slide 55

Slide 55 text

❯ aws lambda put-function-concurrency --function-name -- reserved-concurrent-executions 0

Slide 56

Slide 56 text

Injecting Chaos to AWS Lambda $ pip install chaos-lambda

Slide 57

Slide 57 text

https://medium.com/@adhorn/injecting-chaos-to-aws-lambda-functions-using-lambda-layers-2963f996e0ba Injecting Chaos to AWS Lambda functions using Lambda Layers

Slide 58

Slide 58 text

https://github.com/adhorn/aws-lambda-chaos-injection

Slide 59

Slide 59 text

https://github.com/gunnargrosch/failure-lambda

Slide 60

Slide 60 text

Injecting Chaos to Amazon EC2 using AWS System Manager https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

Slide 61

Slide 61 text

SSM Run (send) Command $ aws ssm send-command --document-name "cpu-stress" --document-version "1" --targets '[{"Key":"InstanceIds","Values":[ " i-094c8367024633d96 ","i-04d0976f9fb658c23"]}]’ --parameters '{"duration":["60"],"cpu":["0"]}’ --timeout-seconds 600 --max-concurrency "50" --max-errors "0" --output-s3-bucket-name "adhorn-chaos-ssm-output" --region eu-west-1

Slide 62

Slide 62 text

https://github.com/adhorn/chaos-ssm-documents

Slide 63

Slide 63 text

https://github.com/adhorn/aws-chaos-scripts

Slide 64

Slide 64 text

Slide 65

Slide 65 text

Big challenges to chaos engineering • Chaos Engineering won’t make your system more robust, People will. • Chaos Engineering won’t replace __all__ the rest (test, quality, …) • Chaos Engineering is NOT the only way to learn from failure • Rollbacks are HARD because of state. • Your systems will continue to fail, sorry. • Starting is perceived as hard!

Slide 66

Slide 66 text

Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.