Slide 1

Slide 1 text

Confidential + Proprietary Confidential + Proprietary Chaos Engineering For People Systems Dave Rensin @drensin [email protected]

Slide 2

Slide 2 text

Confidential + Proprietary Audience Participation!

Slide 3

Slide 3 text

Confidential + Proprietary Why Chaos Engineering? ●  Traditional testing assumes we know the properties of a system. ●  Large distributed systems exhibit emergent properties. ●  Therefore, we have to experiment to find out how our systems really work!

Slide 4

Slide 4 text

Confidential + Proprietary A discipline for systematically minimizing bad luck.

Slide 5

Slide 5 text

Confidential + Proprietary “Good luck is when opportunity meets preparation, while bad luck is when lack of preparation meets reality.” -  Eliyahu Goldratt

Slide 6

Slide 6 text

Confidential + Proprietary Companies are Distributed Systems (Most of the complexity comes from the humans, not the machines)

Slide 7

Slide 7 text

Confidential + Proprietary ●  semi-autonomous units of execution with inconsistent outputs and opaque system internals. ●  Buggy biological microservices

Slide 8

Slide 8 text

Confidential + Proprietary “Errare humanum est, sed perseverare diabolicum.” -  Seneca

Slide 9

Slide 9 text

Confidential + Proprietary Confidential + Proprietary 1.  Once a week pick a random person on your team. The lucky person gets a “staycation”. 2.  They stay at work, but: a.  Cannot answer any work questions b.  Cannot have any work conversations c.  Should set and OOO message for email/etc 3.  Totally cool to still be social / have lunch / etc, just don’t talk about work 4.  Have a 3rd party ‘proctor’ who can decide if/when you need to break glass and end the experiment. Rules The Wheel of Staycation

Slide 10

Slide 10 text

Confidential + Proprietary Confidential + Proprietary 1.  How much impact did the sudden absence have? a.  Could the team manage? b.  What bits of tribal knowledge were unexpectedly lost? 2.  Once a month review all the staycation tests to look for SPOF patterns 3.  The team should notice the absence, but be able to work around it effectively. 4.  If you need to break glass, then you have a SPOF and you need to fix that. If the team feels no impact, then it might be time for the person to find a new project/team. Goals The Wheel of Staycation

Slide 11

Slide 11 text

Confidential + Proprietary Confidential + Proprietary 1.  Select 20% of the team at random 2.  For one work week (5 days), they cannot answer any work email in less than 1 hour. 3.  The proctor decides if/when you need to break glass. Rules Tortoise Time

Slide 12

Slide 12 text

Confidential + Proprietary Confidential + Proprietary 1.  How long did the team manage before the latency became unbearable? (Hint: probably not more than 2 days.) 2.  How quickly did the senders fall back to “alternate” sources -- including thin air? 3.  The goal is to expose hidden layers of your business that are particularly latency sensitive. Goals Tortoise Time

Slide 13

Slide 13 text

Confidential + Proprietary Confidential + Proprietary 1.  Once a month pick 1-2 people at random 2.  For one work day they will give wrong answers. a.  The proctor picks the % per person b.  Answers must be incorrect but plausible. c.  Keep a list of wrong answers and correct them the next work day. 3.  Each email for that day begins with a disclaimer: “Today, I am the Designated Liar and have been randomly selected to be buggy. If you ask me a question today, some of my answers will be intentionally incorrect. Can you tell which ones?” Rules Liar Liar!

Slide 14

Slide 14 text

Confidential + Proprietary Confidential + Proprietary 1.  This is a fuzz testing exercise 2.  Are recipients able to discern correct / incorrect answers? Could they have? a.  If not, then you’ve found an information SPOF and need to fix that b.  If so, were your answers plausible enough? 3.  The goal is to test the principle of Nullius in Verba. (The motto of the Royal Society since 1660. Means of “not any in words” -- ie. take nobody’s word for it.) Goals Liar Liar!

Slide 15

Slide 15 text

Confidential + Proprietary Confidential + Proprietary 1.  1938 radio adaptation of H.G. Wells story. Caused a minor mass-panic when people thought is was real! 2.  Simulate the most existentially threatening event you can think of for your company. a.  Massive security breach b.  Regulatory failure c.  Major customer meltdown 3.  Only the bare minimum # of people can know it’s a simulation. a.  CEO b.  Head of PR c.  Legal d.  Proctor Rules War of the Worlds

Slide 16

Slide 16 text

Confidential + Proprietary Confidential + Proprietary 1.  Will people “do the right thing” in the face of an existential threat? 2.  Do people panic? 3.  Does it leak to Twitter / press? 4.  The goal is to make sure that the company can react calmly and ethically to the worst possible news. Goals War of the Worlds

Slide 17

Slide 17 text

Confidential + Proprietary Buy-In != All-in

Slide 18

Slide 18 text

Confidential + Proprietary X-Func #FTW

Slide 19

Slide 19 text

Confidential + Proprietary You Can Do This

Slide 20

Slide 20 text

Confidential + Proprietary In Conclusion @drensin [email protected]