Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Chaos Engineering: Why breaking things should be practiced Adrian Hornsby Sr. Technical Evangelist Amazon Web Services Qais Ammouri Head of Technology Almosafer @adhorn

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Been there?

Slide 3

Slide 3 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Distributed Systems are hard Amazon Twitter Netflix

Slide 4

Slide 4 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Failures are a given and everything will eventually fail over time. Werner Vogels CTO – Amazon.com “ “

Slide 5

Slide 5 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Resiliency: Ability for a system to handle and eventually recover from unexpected conditions

Slide 6

Slide 6 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Partial failure mode

Slide 7

Slide 7 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T How do we build resilient software systems?

Slide 8

Slide 8 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T People Application Network & Data Infrastructure

Slide 9

Slide 9 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Building confidence through testing Unit testing of components: • Tested in isolation to ensure function meets expectations. Functional testing of integrations: • Each execution path tested to assure expected results. Is it enough???

Slide 10

Slide 10 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T GameDay at Amazon Creating Resiliency Through Destruction https://www.youtube.com/watch?v=zoz0ZjfrQ9s

Slide 11

Slide 11 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Chaos engineering https://github.com/Netflix/SimianArmy

Slide 12

Slide 12 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Failure injection • Start small & build confidence • Application level • Host failure • Resource attacks (CPU, memory, …) • Network attacks (dependencies, latency, …) • Region attack • Human attack https://www.gremlin.com https://github.com/Netflix/SimianArmy https://chaostoolkit.org

Slide 13

Slide 13 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Break your systems on purpose. Find out their weaknesses and fix them before they break when least expected.

Slide 14

Slide 14 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Chaos engineering is NOT about breaking things randomly without a purpose, chaos engineering is about breaking things in a controlled environment and through well- planned experiments in order to build confidence in your application to withstand turbulent conditions.

Slide 15

Slide 15 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering

Slide 16

Slide 16 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Steady State Hypothesis Design & Run Experiment Fix Build Resilient Systems Verify & Learn

Slide 17

Slide 17 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Build Resilient Systems

Slide 18

Slide 18 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 19

Slide 19 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Our sales were less than 1 million SAR 2012 It all started from a handful of people between Riyadh and Egypt. In 2012, Almosafer started between Egypt and Riyadh with focus on hotels through social media and call center.

Slide 20

Slide 20 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T We grew to 70 employees & our sales reached to 74 million SAR 2015 Al Tayyar Travel Group (now Seera Group) acquired 60% of Almosafer…

Slide 21

Slide 21 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T We grew to 1000+ employees & Our sales exceeded 1.3 billion SAR 2018 Crossing the billion line. Becoming the largest OTA in Saudi, fully acquired by Seera Group In 2018, Almosafer became largest OTA in Saudi Arabia in the flight market.

Slide 22

Slide 22 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Largest KSA Client and First in EKS in the MENA

Slide 23

Slide 23 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Before Chaos Engineering

Slide 24

Slide 24 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 25

Slide 25 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 26

Slide 26 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 27

Slide 27 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring (Eagle Eye) Tech Capabilities Culture

Slide 28

Slide 28 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start with people ● Try to avoid the word “Chaos” when talking to your business .

Slide 29

Slide 29 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start with people ● Try to avoid the word “Chaos” when talking your business . ● Embrace failure, and fix it.

Slide 30

Slide 30 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start with people ● Try to avoid the word “Chaos” with your business . ● Embrace failure, and fix it. ● Replace: “If it fails” with “when it fails”.

Slide 31

Slide 31 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start with people ● Try to avoid the word “Chaos” when talking your business . ● Embrace failure, and fix it. ● Replace: “If it fails” with “when it fails”. ● Everything fails, at least once!

Slide 32

Slide 32 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start with people ● Try to avoid the word “Chaos” when talking your business . ● Embrace failure, and fix it. ● Replace: “If it fails” with “when it fails”. ● Everything fails, at least once! ● Do fire drills, at least once a month.

Slide 33

Slide 33 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in Almosafer ● Monitor everything, or die trying.

Slide 34

Slide 34 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in Almosafer ● Monitor everything, or die trying . ● Architect with failure in mind, it is not an edge case.

Slide 35

Slide 35 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in Almosafer ● Monitor everything, or die trying . ● Architect with failure in mind, it is not an edge case. ● Resiliency starts in the frontend, avoid blocking UI.

Slide 36

Slide 36 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in Almosafer ● Monitor everything, or die trying . ● Architect with failure in mind, it is not an edge case. ● Resiliency starts in the frontend, avoid blocking UI. ● Automation testing is not a “nice to have” it is a “Must have”.

Slide 37

Slide 37 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Resiliency in Almosafer ● Monitor everything, or die trying . ● Architect with failure in mind, it is not an edge case. ● Resiliency starts in the frontend, avoid blocking UI. ● Automation testing is not a luxury product. ● Use circuit breaking - timeouts, retries and fallbacks.

Slide 38

Slide 38 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redundancy is fundamental. ● Don’t put your eggs in the same basket be multiregional and multi AZs .

Slide 39

Slide 39 text

SUMMIT © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is Next?

Slide 40

Slide 40 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 41

Slide 41 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Steady State

Slide 42

Slide 42 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system https://www.elastic.co/blog/timelion-tutorial-from-zero-to-hero

Slide 43

Slide 43 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T What is steady state? • ”normal” behavior of your system • Business Metric https://medium.com/netflix-techblog/sps-the-pulse-of-netflix-streaming-ae4db0e05f8a

Slide 44

Slide 44 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Business metrics at work Amazon: 100 ms of extra load time caused a 1% drop in sales (Greg Linden). Google: 500 ms of extra load time caused 20% fewer searches (Marissa Mayer). Yahoo!: 400 ms of extra load time caused a 5–9% increase in the number of people who clicked “back” before the page even loaded (Nicole Sullivan).

Slide 45

Slide 45 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Hypothesis

Slide 46

Slide 46 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T What if…? “What if this load balancer breaks?” “What if Redis becomes slow?” “What if a host on Cassandra goes away?” ”What if latency increases by 300ms?” ”What if the database stops?” Make it everyone’s problem!

Slide 47

Slide 47 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Disclaimer! Don’t make an hypothesis that you know will break you!

Slide 48

Slide 48 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Design & Run Experiment

Slide 49

Slide 49 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Designing experiment • Pick hypothesis • Scope the experiment • Identify metrics • Notify the organization

Slide 50

Slide 50 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Rules of thumbs • Start with very small • As close as possible to production • Minimize the blast radius. • Have an emergency STOP!

Slide 51

Slide 51 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Running Chaos Experiment Users Canary deployment Normal Version 99% Users 1% Users Start with ..

Slide 52

Slide 52 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Verify & Learn

Slide 53

Slide 53 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Quantifying the result of the experiment • Time to detect? • Time for notification? And escalation? • Time to public notification? • Time for graceful degradation to kick-in? • Time for self healing to happen? • Time to recovery – partial and full? • Time to all-clear and stable?

Slide 54

Slide 54 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T PostMortems – COE (Correction of Errors) The 5 WHYs Outage Because of … Because of … Because of … Because of … NOT ENOUGH

Slide 55

Slide 55 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T More questions to ask • Can you clarify if there were any preceding events? • Why would they believe acting in this way was the best course of action to deliver the desired outcome? • Is there another failure mode that could present here? • What decisions or events prior to this made this work before? • Why stop there – are there places to dig deeper that could shine a light more on this? • Did others step in to help, to advise, or to intercede?

Slide 56

Slide 56 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Rules to remember! 1. Failure requires multiple faults 2. There is no isolated ‘cause’ of an accident. 3. There are multiple contributors to accidents.

Slide 57

Slide 57 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T DON’T blame that one person …

Slide 58

Slide 58 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Fix

Slide 59

Slide 59 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Fix

Slide 60

Slide 60 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Big challenges to chaos engineering Mostly Cultural • no time or flexibility to simulate disasters. • teams already spending all of its time fixing things. • can be very political. • might force deep conversations. • deeply invested in a specific technical roadmap (micro-services) that chaos engineering tests show is not as resilient to failures as originally predicted.

Slide 61

Slide 61 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T Changing culture takes time! Be patient…

Slide 62

Slide 62 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T

Slide 63

Slide 63 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. S U M M I T More Resources • https://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf • https://www.gremlin.com • https://queue.acm.org/detail.cfm?id=2353017 • https://softwareengineeringdaily.com/ • https://github.com/dastergon/awesome-sre • https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf • https://medium.com/@NetflixTechBlog • http://principlesofchaos.org • https://speakerdeck.com/tammybutow/chaos-engineering-bootcamp • https://github.com/adhorn/awesome-chaos-engineering • https://www.infoq.com/presentations/netflix-chaos-microservices • http://royal.pingdom.com/wp-content/uploads/2015/04/pingdom_uptime_cheat_sheet.pdf • http://willgallego.com/2018/04/02/no-seriously-root-cause-is-a-fallacy • https://medium.com/@adhorn

Slide 64

Slide 64 text

S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 65

Slide 65 text

Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby @adhorn https://medium.com/@adhorn