Slide 1

Slide 1 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adrian Hornsby, Cloud Architecture Evangelist @ AWS @adhorn Chaos Engineering: Why Breaking Things Should Be Practiced.

Slide 2

Slide 2 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Looks familiar?

Slide 3

Slide 3 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The micro-services architecture

Slide 4

Slide 4 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

Slide 5

Slide 5 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. … at the Edge

Slide 6

Slide 6 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jesse Robbins GameDay: Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

Slide 7

Slide 7 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Netflix 2013 https://medium.com/netflix-techblog

Slide 8

Slide 8 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chaos Monkeys

Slide 9

Slide 9 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. https://bit.ly/2uKOJMQ

Slide 10

Slide 10 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What “really” is Chaos Engineering?

Slide 11

Slide 11 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” http://principlesofchaos.org

Slide 12

Slide 12 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

Slide 13

Slide 13 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building Confidence Through Testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

Slide 14

Slide 14 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failure Injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attacks!

Slide 15

Slide 15 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Slide 16

Slide 16 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “CHAOS DOESN’T CAUSE PROBLEMS. IT REVEALS THEM.” Nora Jones Senior Chaos Engineer, Netflix

Slide 17

Slide 17 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before breaking things …

Slide 18

Slide 18 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. People Application Network & Data Infrastructure

Slide 19

Slide 19 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Patterns for Resilient Architectures Infrastructure

Slide 20

Slide 20 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Availability Downtime per year 99% (2-nines) 3 days 15 hours 99.99% (4-nines) 52 minutes 99.999% (5-nines) 5 minutes 99.9999% (6-nines) 31 seconds

Slide 21

Slide 21 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. System Availability Availability = Normal Operation Time Total Time MTBF** MTBF** + MTTR* = * Mean Time To Repair (MTTR) **Mean Time Between Failure (MTBF)

Slide 22

Slide 22 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability in Parallel Component Availability Downtime X 99% (2-nines) 3 days 15 hours Two X in parallel 99.99% (4-nines) 52 minutes Three X in parallel 99.9999% (6-nines) 31 seconds

Slide 23

Slide 23 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Availability Zone 1 Availability Zone 2 Availability Zone n Multi-AZ Support Instance Failure Application

Slide 24

Slide 24 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto-Scaling • Compute efficiency • Node failure • Traffic spikes • Performance bugs

Slide 25

Slide 25 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Infrastructure as Code • Template of the infrastructure in code. • Version controlled infrastructure. • Repeatable template. • Testable infrastructure. • Automate it!

Slide 26

Slide 26 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Immutable Infrastructure • No updates on live systems • Always start from a new resource being provisioned • Deploy the new software • Test in different environments (dev, staging) • Deploy to prod (inactive) • Change references (DNS or Load Balancer) • Keep old version around (inactive) • Fast rollback if things go wrong

Slide 27

Slide 27 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Patterns for Resilient Architectures Network & Data

Slide 28

Slide 28 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CAP Theorem Consistency Availability Partition Tolerance Data is consistent. All nodes see the same state. Every request is non-failing. Service still responds as expected if some nodes crash. Distributed System In the presence of a network partition, you must choose between consistency and availability!

Slide 29

Slide 29 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Eventual Consistency … if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Availability An eventually consistent system can return any value before it converges!! https://en.wikipedia.org/wiki/Eventual_consistency Distributed System Every request is non-failing.

Slide 30

Slide 30 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Process A Process B Process A Process B Synchronous Asynchronous Waiting Working Continues get or fetch result Get result

Slide 31

Slide 31 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Message passing for async. patterns A Queue B A Queue B Listener Pub-Sub

Slide 32

Slide 32 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Web Instances Worker Instance Worker Instance Queue API Instance API Instance API Instance API: {DO foo} PUT JOB: {JobID: 0001, Task: DO foo} API: {JobID: 0001} GET JOB: {JobID: 0001, Task: DO foo} Cache Result: { JobID: 0001, Result: bar }

Slide 33

Slide 33 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Worker Instance Worker Instance Queue API Instance API Instance API Instance Cache Amazon SNS Push Notification User

Slide 34

Slide 34 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read / Write Sharding RDS DB Instance Read Replica App Instance App Instance App Instance RDS DB Instance Master (Multi-AZ) RDS DB Instance Read Replica RDS DB Instance Read Replica

Slide 35

Slide 35 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Federation Users DB Products DB App Instance App Instance App Instance

Slide 36

Slide 36 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Database Sharding User ShardID 002345 A 002346 B 002347 C 002348 B 002349 A C B A App Instance App Instance App Instance

Slide 37

Slide 37 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Patterns for Resilient Architectures Application

Slide 38

Slide 38 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stateless Services AZ1 AZ2 AWS Region Data Store Cache Auto-Scaling Group User

Slide 39

Slide 39 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient state does not belong in the database.

Slide 40

Slide 40 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cascading Failures a.k.a the Punisher! • Timeouts • Retries & Exponential Backoff • Idempotent operations • Exception Handling • Rejection • Intermittent and transient errors • Service degradation & fallbacks

Slide 41

Slide 41 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Timeouts

Slide 42

Slide 42 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Retries & Exponential Backoff

Slide 43

Slide 43 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Service Degradation & Fallbacks

Slide 44

Slide 44 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Circuit Breaker • Wrap a protected function call in a circuit breaker object, which monitors for failures. • If failures reach a certain threshold, the circuit breaker trips.

Slide 45

Slide 45 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Non-blocking UI https://medium.com/@sophie_paxtonUX/stop-getting-in-my-way-non-blocking-ux-5cbbfe0f0158

Slide 46

Slide 46 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Patterns for Resilient Architectures People

Slide 47

Slide 47 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fire Drills

Slide 48

Slide 48 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. “It is not failure itself that holds you back; it is the fear of failure that paralyses you.” Brian Tracy

Slide 49

Slide 49 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Phases of Chaos Engineering

Slide 50

Slide 50 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Hypothesis Design Experiment Verify & Learn Fix

Slide 51

Slide 51 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Steady State? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

Slide 52

Slide 52 text

What is Steady State? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Slide 53

Slide 53 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).

Slide 54

Slide 54 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Steady State Important: • Know the value range of Healthy State!

Slide 55

Slide 55 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

Slide 56

Slide 56 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Disclaimer! Don’t make an hypothesis that you know will break you!

Slide 57

Slide 57 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Designing Experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization

Slide 58

Slide 58 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!

Slide 59

Slide 59 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New Version Users Canary deployment Old Version 99% Users 1% Users Start with .. Dynamic Routing (Route53)

Slide 60

Slide 60 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

Slide 61

Slide 61 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DON’T blame that one person …

Slide 62

Slide 62 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of …

Slide 63

Slide 63 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More questions to ask. • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

Slide 64

Slide 64 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Fix

Slide 65

Slide 65 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Challenges to Chaos Engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro- services) that chaos engineering tests show is not as resilient to failures as originally predicted.

Slide 66

Slide 66 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Changing Culture takes time! Be patient…

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy

Slide 69

Slide 69 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thanks you! @adhorn https://medium.com/@adhorn