Building Resilient Applications using Chaos Engineering on AWS

© 2020, Amazon Web Services, Inc. or its affiliates. All
rights reserved. Building Resilient Applications using Chaos Engineering on AWS Adrian Hornsby Principal Technical Evangelist Amazon Web Services

Can you afford to lose $100,000?

Because that is the average amount of one hour of
downtime reported by an ITIC study this year. Source: Information Technology Intelligence Consulting Research

• A volunteer firefighter • Created GameDay in 2006 to
purposefully create regular major failures. • Founded Chef, the Velocity Web Performance & Operations Conference. Jesse Robbins, “Master of Disaster” GameDay at Amazon

“Simian Army to keep our cloud safe, secure, and highly
available.” - 2011 Netflix blog Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Rise of the monkeys https://github.com/Netflix/SimianArmy

Chaos Engineering formalized by Netflix (mid-2015 ) principlesofchaos.org

Chaos engineering is NOT about breaking things randomly without a
purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Chaos Engineering A
scientific method

Break your systems on purpose. Find out their weaknesses and
fix them before they break when least expected.

rights reserved. Chaos Engineering

Operations Infrastructure Application Software

https://aws.amazon.com/wellarchitected

M ore Inform ation https://aws.amazon.com/builders-library

STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE Phases of Chaos
Engineering

What is steady state? • ”normal” behavior of your system
https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

What is steady state? • Business + Ops Metric https://medium.com/netflix-techblog/

Engineering

What if…? “What if this load balancer breaks?” “What if
Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

Engineering

Failure injection • Start small & build confidence • Application
level (exceptions, errors, …) • Host level (services, processes, …) • Resource attacks (CPU, memory, IO, …) • Network attacks (dependencies, latency, packet loss…) • AZ attack • Region attack • People attack

Rules of thumbs • Start very small • As close
as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back (corrupt or incorrect data)

Blast radius • How many customers? • What functionality? •
How many locations?

Engineering

Quantifying the result of the experiment • Time to detect?
• Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

Postmortems – COE (Correction of Errors) • What happened? •
What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • especially metrics and graphs • What lessons did you learn? • What corrective actions are you taking? • Actions items • Related items (trouble tickets etc.)

Tools Processes Culture Technology

Two rules to remember ALWAYS!

DON’T blame that one person …

There is no isolated ‘cause’ of an accident.

Engineering

Fix it!

rights reserved. Monkeys

Start simple and local!! $ docker stop 94a214bbeebd

DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t
60s https://kernel.ubuntu.com/~cking/stress-ng/

Adding latency to the network $ tc qdisc add dev
eth0 root netem delay 300ms

Blocks DNS resolution $ iptables -A INPUT -p tcp -m
tcp --dport 53 -j DROP

Other fun things to do • Fill up disk •
Network packet loss (using traffic-shaping) • Network packet corruption (using traffic-shaping) • Kills random processes • Detach (force) all EBS volumes • Mess with /etc/hosts

https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly
• slows down performances • checks conformity • breaks an entire region • Integrates with spinnaker (CI/CD) Simian Army

The Chaos Toolkit • Simplifying Adoption of Chaos Engineering •
An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines

ToxiProxy • HTTP API • Build for Automated testing in
mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy

https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

Pumba https://github.com/alexei-led/pumba/

Fault Injection Queries for Amazon Aurora SQL commands issued to
simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

Fault Injection Queries for Amazon Aurora SQL commands issued to
simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

APIs J

❯ aws lambda put-function-concurrency --function-name <value> -- reserved-concurrent-executions 0

Injecting Chaos to AWS Lambda $ pip install chaos-lambda

https://medium.com/@adhorn/injecting-chaos-to-aws-lambda-functions-using-lambda-layers-2963f996e0ba Injecting Chaos to AWS Lambda functions using Lambda Layers

https://github.com/adhorn/aws-lambda-chaos-injection

https://github.com/gunnargrosch/failure-lambda

Injecting Chaos to Amazon EC2 using AWS System Manager https://medium.com/@adhorn/injecting-chaos-to-amazon-ec2-using-amazon-system-manager-ca95ee7878f5

SSM Run (send) Command $ aws ssm send-command --document-name "cpu-stress"
--document-version "1" --targets '[{"Key":"InstanceIds","Values":[ " i-094c8367024633d96 ","i-04d0976f9fb658c23"]}]’ --parameters '{"duration":["60"],"cpu":["0"]}’ --timeout-seconds 600 --max-concurrency "50" --max-errors "0" --output-s3-bucket-name "adhorn-chaos-ssm-output" --region eu-west-1

https://github.com/adhorn/chaos-ssm-documents

https://github.com/adhorn/aws-chaos-scripts

Big challenges to chaos engineering • Chaos Engineering won’t make
your system more robust, People will. • Chaos Engineering won’t replace __all__ the rest (test, quality, …) • Chaos Engineering is NOT the only way to learn from failure • Rollbacks are HARD because of state. • Your systems will continue to fail, sorry. • Starting is perceived as hard!

Big challenges to chaos engineering Mostly Cultural • no time
or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

Tools Processes Culture Technology

Building Resilient Applications using Chaos En...

Building Resilient Applications using Chaos Engineering on AWS

More Decks by Adrian Hornsby

Other Decks in Technology

Featured

Transcript