Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Creating Resiliency Through Destruction Adrian Hornsby Sr. Technical Evangelist Amazon Web Services adhorn

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Can you guess what will happen?

Slide 3

Slide 3 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Distributed Systems are hard Amazon Twitter Netflix

Slide 4

Slide 4 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

Slide 5

Slide 5 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Resiliency: Ability for a system to handle and eventually recover from unexpected conditions

Slide 6

Slide 6 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Resiliency at work

Slide 7

Slide 7 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T How do we build resilient software systems?

Slide 8

Slide 8 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T People Application Network & Data Infrastructure

Slide 9

Slide 9 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

Slide 10

Slide 10 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

Slide 11

Slide 11 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Chaos engineering https://github.com/Netflix/SimianArmy

Slide 12

Slide 12 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attack • Human attack

Slide 13

Slide 13 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

Slide 14

Slide 14 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.

Slide 15

Slide 15 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering

Slide 16

Slide 16 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Steady State Hypothesis Design & Run Experiment Fix Build Resilient Systems Verify & Learn

Slide 17

Slide 17 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Build Resilient Systems

Slide 18

Slide 18 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T

Slide 19

Slide 19 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Cascading and Overload Failures

Slide 20

Slide 20 text

Operations Infrastructure Application Software

Slide 21

Slide 21 text

Shameless plug - https://medium.com/@adhorn

Slide 22

Slide 22 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T https://aws.amazon.com/wellarchitected

Slide 23

Slide 23 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Steady State

Slide 24

Slide 24 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

Slide 25

Slide 25 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Slide 26

Slide 26 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).

Slide 27

Slide 27 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Hypothesis

Slide 28

Slide 28 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

Slide 29

Slide 29 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Disclaimer! Don’t make an hypothesis that you know will break you!

Slide 30

Slide 30 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Design & Run Experiment

Slide 31

Slide 31 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Designing experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization

Slide 32

Slide 32 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP! • Careful with state that can’t be rolled back (corrupt or incorrect data)

Slide 33

Slide 33 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with ..

Slide 34

Slide 34 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 35

Slide 35 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.FaultInjectionQueries.html

Slide 36

Slide 36 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Fault Injection Queries for Amazon Aurora SQL commands issued to simulate: • A crash of the master instance or an Aurora Replica • A failure of an Aurora Replica • A disk failure • Disk congestion ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE [ IN DISK index | NODE index ] FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

Slide 37

Slide 37 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to DDoS yourself ~ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

Slide 38

Slide 38 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Come late to work ~ tc qdisc add dev eth0 root netem delay 200ms

Slide 39

Slide 39 text

https://github.com/Netflix/SimianArmy Set of scheduled agent: • shuts down services randomly • slows down performances • checks conformity • breaks an entire region • More …

Slide 40

Slide 40 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T The Chaos Toolkit • Simplifying Adoption of Chaos Engineering • An Open API to Chaos Engineering • Open source extensions for • Infrastructure/Platform Fault Injections • Application Fault Injections • Observability • Integrates easily into CI/CD pipelines

Slide 41

Slide 41 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T

Slide 42

Slide 42 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T

Slide 43

Slide 43 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T

Slide 44

Slide 44 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T

Slide 45

Slide 45 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T ToxiProxy • HTTP API • Build for Automated testing in mind • Not for production environment • Fast • Toxics for: • Timeouts, latency, connections and bandwidth limitation, etc.. • CLI • Stable and well tested (used for 3 years at Shopify) • Open Source: https://github.com/Shopify/toxiproxy

Slide 46

Slide 46 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T https://atscaleconference.com/videos/resiliency-testing-with-toxiproxy/

Slide 47

Slide 47 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T https://github.com/asobti/kube-monkey

Slide 48

Slide 48 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Pumba https://github.com/alexei-led/pumba/

Slide 49

Slide 49 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T https://blog.thundra.io/chaos-test-your-lambda-functions-with-thundra

Slide 50

Slide 50 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T https://medium.com/@adhorn/injecting-chaos-to-aws-lambda-functions-using-lambda-layers-2963f996e0ba Injecting Chaos to AWS Lambda functions using Lambda Layers

Slide 51

Slide 51 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Verify & Learn

Slide 52

Slide 52 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

Slide 53

Slide 53 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of …

Slide 54

Slide 54 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Rules to remember! There is no isolated ‘cause’ of an accident.

Slide 55

Slide 55 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

Slide 56

Slide 56 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T DON’T blame that one person …

Slide 57

Slide 57 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Fix

Slide 58

Slide 58 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Fix

Slide 59

Slide 59 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

Slide 60

Slide 60 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Big challenges to chaos engineering • Chaos Engineering won’t make your system more robust, People will. • Chaos Engineering won’t replace __all__ the rest (test, quality, …) • Chaos Engineering is NOT the only way to learn from failure • Rollbacks are HARD because of state. • Your systems will continue to fail, sorry.

Slide 61

Slide 61 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Changing culture takes time! Be patient…

Slide 62

Slide 62 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T

Slide 63

Slide 63 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy • https://medium.com/@adhorn

Slide 64

Slide 64 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 65

Slide 65 text

Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby https://medium.com/@adhorn