Creating Resiliency Through Destruction

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. S U M M I T Creating Resiliency Through Destruction Adrian Hornsby Sr. Technical Evangelist Amazon Web Services adhorn

rights reserved. S U M M I T Can you guess what will happen?

rights reserved. S U M M I T Distributed Systems are hard Amazon Twitter Netflix

rights reserved. S U M M I T Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

rights reserved. S U M M I T Resiliency: Ability for a system to handle and eventually recover from unexpected conditions

rights reserved. S U M M I T Resiliency at work

rights reserved. S U M M I T How do we build resilient software systems?

rights reserved. S U M M I T People Application Network & Data Infrastructure

rights reserved. S U M M I T Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

rights reserved. S U M M I T GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

rights reserved. S U M M I T Chaos engineering https://github.com/Netflix/SimianArmy

rights reserved. S U M M I T Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attack • Human attack

rights reserved. S U M M I T Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

rights reserved. S U M M I T Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.

rights reserved. S U M M I T Steady State Hypothesis Design & Run Experiment Fix Build Resilient Systems Verify & Learn

rights reserved. S U M M I T Build Resilient Systems

rights reserved. S U M M I T

rights reserved. S U M M I T Cascading and Overload Failures

Operations Infrastructure Application Software

Shameless plug - https://medium.com/@adhorn

rights reserved. S U M M I T https://aws.amazon.com/wellarchitected

rights reserved. S U M M I T Steady State

rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

rights reserved. S U M M I T Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).

rights reserved. S U M M I T Hypothesis

rights reserved. S U M M I T What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

rights reserved. S U M M I T Disclaimer! Don’t make an hypothesis that you know will break you!

rights reserved. S U M M I T Design & Run Experiment

rights reserved. S U M M I T Designing experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization

rights reserved. S U M M I T Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back (corrupt or incorrect data)

rights reserved. S U M M I T Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with ..

rights reserved. S U M M I T Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

rights reserved. S U M M I T Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly
• slows down performances • checks conformity • breaks an entire region • More …

rights reserved. S U M M I T The Chaos Toolkit • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines

rights reserved. S U M M I T ToxiProxy • HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy

rights reserved. S U M M I T https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

rights reserved. S U M M I T https://github.com/asobti/kube-monkey

rights reserved. S U M M I T Pumba https://github.com/alexei-led/pumba/

rights reserved. S U M M I T https://blog.thundra.io/chaos-test-your-lambda-functions-with-thundra

rights reserved. S U M M I T https://medium.com/@adhorn/injecting-chaos-to-aws-lambda-functions-using-lambda-layers-2963f996e0ba Injecting Chaos to AWS Lambda functions using Lambda Layers

rights reserved. S U M M I T Verify & Learn

rights reserved. S U M M I T Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

rights reserved. S U M M I T PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of …

rights reserved. S U M M I T Rules to remember! There is no isolated ‘cause’ of an accident.

rights reserved. S U M M I T More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

rights reserved. S U M M I T DON’T blame that one person …

rights reserved. S U M M I T Fix

rights reserved. S U M M I T Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

rights reserved. S U M M I T Big challenges to chaos engineering • Chaos Engineering won’t make your system more robust, People will. • Chaos Engineering won’t replace __all__ the rest (test, quality, …) • Chaos Engineering is NOT the only way to learn from failure • Rollbacks are HARD because of state. • Your systems will continue to fail, sorry.

rights reserved. S U M M I T Changing culture takes time! Be patient…

rights reserved. S U M M I T More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy • https://medium.com/@adhorn

Creating Resiliency Through Destruction

Creating Resiliency Through Destruction

More Decks by Adrian Hornsby

Other Decks in Programming

Featured

Transcript