Slide 1

Slide 1 text

Forming Failure Hypothesis September 26, 2019

Slide 2

Slide 2 text

2 Subbu Allamaraju @sallamar Expedia Group See h=ps://www.subbu.org for slides.

Slide 3

Slide 3 text

3 “Chaos Engineering is the discipline of experimenIng on a system in order to build confidence in the system’s capability to withstand turbulent condiIons in producIon.” Photo by Jilbert Ebrahimi on Unsplash From Principles of Chaos Engineering (h=ps://principlesofchaos.org/)

Slide 4

Slide 4 text

3 4 1 Stable Ok. Back to Stable Your assumed fault boundary Actual fault boundary

Slide 5

Slide 5 text

Q 5 What is the system? How do you form a hypothesis? How do you ensure system safety? Why should anyone listen to you?

Slide 6

Slide 6 text

6 Cloud adopIon

Slide 7

Slide 7 text

7 Cloud adopIon How to build resilience?

Slide 8

Slide 8 text

8 Chaos engineering Cloud adopIon How to build resilience?

Slide 9

Slide 9 text

9 Chaos engineering Cloud dopIon to build ience? Not everyone likes you to a=ack their apps

Slide 10

Slide 10 text

10 Chaos engineering oud pIon build nce? Randomly killing servers uncovers trivial issues only Not everyone likes you to a=ack their apps

Slide 11

Slide 11 text

11 Chaos gineering Randomly killing servers uncovers trivial issues only Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures

Slide 12

Slide 12 text

12 Chaos gineering Randomly killing servers uncovers trivial issues only Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt

Slide 13

Slide 13 text

13 Chaos gineering Randomly killing servers uncovers trivial issues only Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt

Slide 14

Slide 14 text

14 Null hypothesis Chaos engineering has nothing to do system’s capability to withstand turbulent condiIons in producIon.

Slide 15

Slide 15 text

15 How is the system behaving in producIon today? How do we make the system withstand turbulent condiIons? Photo by Hush Naidoo on Unsplash

Slide 16

Slide 16 text

16 “as designed” “as it is” Biased by your expectaIon of how the system is supposed to work The real world Metrics Alerts Logs Docs Diagram s Cod e Incident s

Slide 17

Slide 17 text

17 Let’s observe the real world

Slide 18

Slide 18 text

[CATE GORY NAME 1. Changes are contribuIng to majority of impact 18

Slide 19

Slide 19 text

2. Second/higher order effects are hard to troubleshoot ReIred App Big App Tech Debt Another Big App 19

Slide 20

Slide 20 text

20 3. We don’t understand where a failure stops ContribuIng to cascading failures and long recovery

Slide 21

Slide 21 text

1.  Improve release safety through progressive delivery 2.  Ensure Ighter fault domain boundaries in the “as designed” state 3.  Implement safety in the “as designed” state 4.  Only then pick what to test 21

Slide 22

Slide 22 text

22 These observaIons are relevant but not as much as the act of learning from incidents. Because the ”as it is” state might tell you what to do.

Slide 23

Slide 23 text

23 Pick the most criIcal areas But how to prioriIze such work? ArIculate value

Slide 24

Slide 24 text

24 Randomly killing servers uncovers trivial issues only veryone likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt Learn from incidents

Slide 25

Slide 25 text

25 domly killing servers overs trivial issues only likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt Learn from incidents Make value based decisions

Slide 26

Slide 26 text

26 How to learn from incidents?

Slide 27

Slide 27 text

27 How does it feel when you learn from incidents?

Slide 28

Slide 28 text

28 1.  Developed mental models of how the system works when it does, and doesn’t when it doesn’t. 2.  You’re not chasing symptoms but are beginning to understand the system as a whole.

Slide 29

Slide 29 text

29 3.  You start to understand role of people, processes and tools for success as well as failure. 4.  You are able to arIculate the value of hygiene investments.

Slide 30

Slide 30 text

30 Lessons learned

Slide 31

Slide 31 text

31 1. Learn from incidents. 2. There is no lesson 2.

Slide 32

Slide 32 text

32 @sallamar – h=ps://www.subbu.org Source: h=ps://www.trover.com/li/vCu0/nZk3 Thank you