Chaos Engineering: Why breaking things should be practiced

© 2018, Amazon Web Services, Inc. or its Affiliates. All
rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Adrian Hornsby, Cloud Architecture Evangelist @adhorn Chaos Engineering: Why Breaking Things Should Be Practiced.

© 2018, Amazon Web Services, Inc. or its affiliates. All
rights reserved. https://xkcd.com/1428/

rights reserved. Complex systems Amazon Twitter Netflix

rights reserved. Partial failure mode

rights reserved. Resiliency: Ability for a system to handle and eventually recover from unexpected conditions

rights reserved. People Application Network & Data Infrastructure

rights reserved. Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

rights reserved. GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

rights reserved. Chaos engineering https://github.com/Netflix/SimianArmy

Failure injection • Start small & build confidence • Application
level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks • “Paul” attack https://www.gremlin.com https://github.com/Netflix/SimianArmy https://chaostoolkit.org

rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org

rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

rights reserved.

rights reserved. “Chaos doesn’t cause problems. It reveals them.” Nora Jones Senior Chaos Engineer, Netflix

rights reserved. Steady State Hypothesis Design & Run Experiment Verify & Learn Fix

rights reserved. Steady State

rights reserved. What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

rights reserved. What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

rights reserved. Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).

rights reserved. Hypothesis

rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!

rights reserved. Design & Run Experiment

rights reserved. Designing experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization

rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!

rights reserved. Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with ..

rights reserved. Verify & Learn

rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

rights reserved. DON’T blame that one person …

rights reserved. PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of … NOT ENOUGH

rights reserved. More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

rights reserved. Rules to remember! 1. Failure requires multiple faults 2. There is no isolated ‘cause’ of an accident. 3. There are multiple contributors to accidents.

rights reserved. Fix

rights reserved. Availability in parallel A = 1 – (1 – Ax)2 Part X Part X

rights reserved. Availability in parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds

rights reserved. Multi-AZ architecture Region Availability zone a Availability zone b Availability zone c Application

rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B Stateless Services

rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs

rights reserved. Auto Scaling group Service A Availability zone 1 Auto Scaling group AWS Region Service A Availability zone 2 Service B Service B Auto-scaling

rights reserved. Decoupling with async pattern Listener Pub-Sub Queue Queue A A B B

rights reserved. API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} {JobID: 0001, Result: bar} Cache node Worker Instance Worker Instance Queue API Instance API Instance API Instance

rights reserved. Push Notification User Worker Instance Worker Instance Queue API Instance API Instance Cache node Fetch results API Instance

rights reserved. Degrade & prioritize traffic with queues Worker Instance Worker Instance API Instance API Instance API Instance High Priority Queue Low Priority Queue

rights reserved. Read / Write Sharding DB Instance DB instance read replica DB instance read replica DB instance read replica Instance Instance Instance

rights reserved. Database Federation Users DB Products DB Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica

rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A Instance Instance Instance DB Instance DB instance read replica DB Instance DB instance read replica DB Instance DB instance read replica

rights reserved. Transient state does not belong in the database.

rights reserved. Cascading Failures

rights reserved. Let’s talk about timeouts & retries!

rights reserved. Users App DB Conn Pool INSERT INSERT INSERT INSERT What happens if the DB “slows down”? Timeout client side Timeout backend side ? ?

rights reserved. User 1 App DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = Not implemented Retry INSERT Retry INSERT ERROR: Failed to get connection from pool Retry

http://docs.python-requests.org/en/master/user/advanced/#timeouts

rights reserved. @timeout_decorator.timeout(5, timeout_exception=StopIteration) def timed_get(url): return requests.get(url) https://pypi.org/project/timeout-decorator/

rights reserved. Timeouts

rights reserved. How else could we have prevented the error? User 1 DB Conn Pool INSERT Retry INSERT Retry INSERT Retry ERROR: Failed to get connection from pool

rights reserved. User 1 DB Conn Pool INSERT Timeout client side = 10s Timeout backend side = 10s Wait 2s before Retry INSERT INSERT Wait 4s before Retry Wait 8s before Retry Wait 16s before Retry Backing off between retries Releasing connections Backoff

rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ Simple Exponential Backoff is not enough: Add Jitter

rights reserved. Example: add jitter 0-1000ms def get_item(self, url, n=1): MAX_TRIES = 12 try: res = requests.get(url) except: if n > MAX_TRIES: return None n += 1 time.sleep((2 ** n) + (random.randint(0, 1000) / 1000.0)) return self.get_item(url, n) else: return res

rights reserved. @backoff.on_exception(backoff.full_jitter, max_time=60) def poll_for_message(queue): return queue.get() https://pypi.org/project/backoff/ As of version 1.2, the default jitter function backoff.full_jitter implements the ‘Full Jitter’ algorithm as defined in the AWS Architecture Blog’s Exponential Backoff And Jitter post.

rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.

rights reserved. Service Degradation & Fallbacks

rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit

rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158

rights reserved. Fire Drills

rights reserved. Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

rights reserved. Changing culture takes time! Be patient…

rights reserved. Amazon Confidential and Trademark © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Confidential and Trademark Thanks you! @adhorn https://medium.com/@adhorn

Did We Scan Your Badge? Remember to opt-in to AWS
communications and you will receive a post-event email with a link to: • AWS Developer Workshop Slides • $200 in AWS Credits

Chaos Engineering: Why breaking things should b...

Chaos Engineering: Why breaking things should be practiced

More Decks by Adrian Hornsby

Other Decks in Programming

Featured

Transcript