Upgrade to Pro — share decks privately, control downloads, hide ads and more …


Chaos Conf
September 26, 2019


Dave Rensin, Google
The rise of highly distributed computing systems based on microservices has made predicting and debugging our products more complex than ever. In response, Chaos Engineering has developed as a way to discover, diagnose, and debug the inevitable emergent properties (and problems) that come with this new reality.

What about our human systems? Can we apply the techniques of chaos engineering to build better teams? Happier employees? More successful companies? Dave thinks so and wants to convince you, too. Come hear him try!

In this keynote, Dave will share his experiences building stronger systems, teams, and companies at Google over the last 5 years.

Chaos Conf

September 26, 2019

More Decks by Chaos Conf

Other Decks in Technology


  1. Confidential + Proprietary Confidential + Proprietary Chaos Engineering For People

    Systems Dave Rensin @drensin [email protected]
  2. Confidential + Proprietary Audience Participation!

  3. Confidential + Proprietary Why Chaos Engineering? •  Traditional testing assumes

    we know the properties of a system. •  Large distributed systems exhibit emergent properties. •  Therefore, we have to experiment to find out how our systems really work!
  4. Confidential + Proprietary A discipline for systematically minimizing bad luck.

  5. Confidential + Proprietary “Good luck is when opportunity meets preparation,

    while bad luck is when lack of preparation meets reality.” -  Eliyahu Goldratt
  6. Confidential + Proprietary Companies are Distributed Systems (Most of the

    complexity comes from the humans, not the machines)
  7. Confidential + Proprietary •  semi-autonomous units of execution with inconsistent

    outputs and opaque system internals. •  Buggy biological microservices
  8. Confidential + Proprietary “Errare humanum est, sed perseverare diabolicum.” - 

  9. Confidential + Proprietary Confidential + Proprietary 1.  Once a week

    pick a random person on your team. The lucky person gets a “staycation”. 2.  They stay at work, but: a.  Cannot answer any work questions b.  Cannot have any work conversations c.  Should set and OOO message for email/etc 3.  Totally cool to still be social / have lunch / etc, just don’t talk about work 4.  Have a 3rd party ‘proctor’ who can decide if/when you need to break glass and end the experiment. Rules The Wheel of Staycation
  10. Confidential + Proprietary Confidential + Proprietary 1.  How much impact

    did the sudden absence have? a.  Could the team manage? b.  What bits of tribal knowledge were unexpectedly lost? 2.  Once a month review all the staycation tests to look for SPOF patterns 3.  The team should notice the absence, but be able to work around it effectively. 4.  If you need to break glass, then you have a SPOF and you need to fix that. If the team feels no impact, then it might be time for the person to find a new project/team. Goals The Wheel of Staycation
  11. Confidential + Proprietary Confidential + Proprietary 1.  Select 20% of

    the team at random 2.  For one work week (5 days), they cannot answer any work email in less than 1 hour. 3.  The proctor decides if/when you need to break glass. Rules Tortoise Time
  12. Confidential + Proprietary Confidential + Proprietary 1.  How long did

    the team manage before the latency became unbearable? (Hint: probably not more than 2 days.) 2.  How quickly did the senders fall back to “alternate” sources -- including thin air? 3.  The goal is to expose hidden layers of your business that are particularly latency sensitive. Goals Tortoise Time
  13. Confidential + Proprietary Confidential + Proprietary 1.  Once a month

    pick 1-2 people at random 2.  For one work day they will give wrong answers. a.  The proctor picks the % per person b.  Answers must be incorrect but plausible. c.  Keep a list of wrong answers and correct them the next work day. 3.  Each email for that day begins with a disclaimer: “Today, I am the Designated Liar and have been randomly selected to be buggy. If you ask me a question today, some of my answers will be intentionally incorrect. Can you tell which ones?” Rules Liar Liar!
  14. Confidential + Proprietary Confidential + Proprietary 1.  This is a

    fuzz testing exercise 2.  Are recipients able to discern correct / incorrect answers? Could they have? a.  If not, then you’ve found an information SPOF and need to fix that b.  If so, were your answers plausible enough? 3.  The goal is to test the principle of Nullius in Verba. (The motto of the Royal Society since 1660. Means of “not any in words” -- ie. take nobody’s word for it.) Goals Liar Liar!
  15. Confidential + Proprietary Confidential + Proprietary 1.  1938 radio adaptation

    of H.G. Wells story. Caused a minor mass-panic when people thought is was real! 2.  Simulate the most existentially threatening event you can think of for your company. a.  Massive security breach b.  Regulatory failure c.  Major customer meltdown 3.  Only the bare minimum # of people can know it’s a simulation. a.  CEO b.  Head of PR c.  Legal d.  Proctor Rules War of the Worlds
  16. Confidential + Proprietary Confidential + Proprietary 1.  Will people “do

    the right thing” in the face of an existential threat? 2.  Do people panic? 3.  Does it leak to Twitter / press? 4.  The goal is to make sure that the company can react calmly and ethically to the worst possible news. Goals War of the Worlds
  17. Confidential + Proprietary Buy-In != All-in

  18. Confidential + Proprietary X-Func #FTW

  19. Confidential + Proprietary You Can Do This

  20. Confidential + Proprietary In Conclusion @drensin [email protected]