Chaos engineering - why breaking things should be practised.

© 2018, Amazon Web Services, Inc. or its Affiliates. All
rights reserved. Adrian Hornsby, Cloud Architecture Evangelist @ AWS @adhorn Chaos Engineering: Why Breaking Things Should Be Practiced.

rights reserved. https://xkcd.com/1428/

rights reserved. Looks familiar?

rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

rights reserved. … at the Edge

rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

rights reserved. Netflix 2013 https://medium.com/netflix-techblog

rights reserved. Chaos Monkeys

rights reserved.

rights reserved. What “really” is Chaos Engineering?

rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org

rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!

rights reserved.

rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix

rights reserved. Phases of Chaos Engineering

rights reserved. Steady State Hypothesis Design & Run Experiment Verify & Learn Fix

rights reserved. Steady State

rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

What is Steady State? • ”normal” behavior of your system
• Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).

rights reserved. Steady State Important: • Know the value range of Healthy State!

rights reserved. Hypothesis

rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

rights reserved. Important! Don’t make an hypothesis that you know will break you!

rights reserved. Netflix team members Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri, suggest the following inputs for Chaos experiments: • Simulating the failure of an entire region or datacenter. • Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production. • Injecting latency between services for a select percentage of traffic over a predetermined period of time. • Function-based chaos (runtime injection): Randomly causing functions to throw exceptions. • Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions. • Time travel: Forcing system clocks out of sync with each other. • Executing a routine in driver code emulating I/O errors. • Maxing out CPU cores on an Elasticsearch cluster.

rights reserved. Design & Run Experiment

rights reserved. Designing Experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization

rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!

rights reserved. New Version Users Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)

rights reserved.

rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

rights reserved. DON’T blame that one person …

rights reserved. PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of …

rights reserved. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More questions to ask. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

rights reserved. Fix

rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (microservices) that chaos engineering tests show is not as resilient to failures as originally predicted.

rights reserved. Before breaking things …

rights reserved. People Application Network & Data Infrastructure

rights reserved. Patterns for Resilient Architectures Infrastructure

rights reserved. Availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds

rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)

rights reserved. Availability in Series Component Availability Downtime X 99% (2-nines) 3 days 15 hours Y 99.99% (4-nines) 52 minutes X and Y Combined 98.99% 3 days 16 hours 33 minutes

rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds

rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application

rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs

rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!

rights reserved. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong

rights reserved. Patterns for Resilient Architectures Network & Data

rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!

rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.

rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result

rights reserved. Decoupling with async pattern A Queue B A Queue B Listener Pub-Sub

rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }

rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User

rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica

rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance

rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A App Instance App Instance App Instance

rights reserved. Transient state does not belong in the database.

rights reserved. Patterns for Resilient Architectures Application

rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-Scaling Group User

rights reserved. Cascading Failures

rights reserved. Timeouts

rights reserved. Retries & Exponential Backoff

rights reserved. No jitter With jitter https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

rights reserved. Idempotent operation No additional effect if it is called more than once with the same input parameters.

rights reserved. Service Degradation & Fallbacks

rights reserved. Avoiding Cascading Failures • Timeouts • Retries & Exponential Backoff • Idempotent operations • Exception Handling • Rejection • Intermittent and transient errors • Service degradation & fallbacks

rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips. Producer Circuit Breaker Consumer Connection Monitoring Timeouts Breaking Circuit

rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158

rights reserved. Patterns for Resilient Architectures People

rights reserved. Fire Drills

rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy

rights reserved. Changing Culture takes time! Be patient…

rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy

rights reserved. Thanks you! @adhorn https://medium.com/@adhorn

Chaos engineering - why breaking things should ...

Chaos engineering - why breaking things should be practised.

More Decks by Adrian Hornsby

Other Decks in Technology

Featured

Transcript