Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KEYNOTE: CHAOS ENGINEERING FOR PEOPLE SYSTEMS

Chaos Conf
September 26, 2019

KEYNOTE: CHAOS ENGINEERING FOR PEOPLE SYSTEMS

Dave Rensin, Google
The rise of highly distributed computing systems based on microservices has made predicting and debugging our products more complex than ever. In response, Chaos Engineering has developed as a way to discover, diagnose, and debug the inevitable emergent properties (and problems) that come with this new reality.

What about our human systems? Can we apply the techniques of chaos engineering to build better teams? Happier employees? More successful companies? Dave thinks so and wants to convince you, too. Come hear him try!

In this keynote, Dave will share his experiences building stronger systems, teams, and companies at Google over the last 5 years.

Chaos Conf

September 26, 2019
Tweet

More Decks by Chaos Conf

Other Decks in Technology

Transcript

  1. Confidential + Proprietary
    Confidential + Proprietary
    Chaos Engineering For People Systems
    Dave Rensin
    @drensin
    [email protected]

    View full-size slide

  2. Confidential + Proprietary
    Audience Participation!

    View full-size slide

  3. Confidential + Proprietary
    Why Chaos Engineering?
    ●  Traditional testing assumes we know the properties of a
    system.
    ●  Large distributed systems exhibit emergent properties.
    ●  Therefore, we have to experiment to find out how our
    systems really work!

    View full-size slide

  4. Confidential + Proprietary
    A discipline for systematically minimizing bad
    luck.

    View full-size slide

  5. Confidential + Proprietary
    “Good luck is when opportunity meets
    preparation, while bad luck is when lack of
    preparation meets reality.”
    -  Eliyahu Goldratt

    View full-size slide

  6. Confidential + Proprietary
    Companies are
    Distributed Systems
    (Most of the complexity comes from the humans,
    not the machines)

    View full-size slide

  7. Confidential + Proprietary
    ●  semi-autonomous units of execution with inconsistent outputs and
    opaque system internals.
    ●  Buggy biological microservices

    View full-size slide

  8. Confidential + Proprietary
    “Errare humanum est, sed perseverare
    diabolicum.”
    -  Seneca

    View full-size slide

  9. Confidential + Proprietary
    Confidential + Proprietary
    1.  Once a week pick a random person on
    your team. The lucky person gets a
    “staycation”.
    2.  They stay at work, but:
    a.  Cannot answer any work
    questions
    b.  Cannot have any work
    conversations
    c.  Should set and OOO message for
    email/etc
    3.  Totally cool to still be social / have
    lunch / etc, just don’t talk about work
    4.  Have a 3rd party ‘proctor’ who can
    decide if/when you need to break glass
    and end the experiment.
    Rules
    The Wheel of
    Staycation

    View full-size slide

  10. Confidential + Proprietary
    Confidential + Proprietary
    1.  How much impact did the sudden
    absence have?
    a.  Could the team manage?
    b.  What bits of tribal knowledge were
    unexpectedly lost?
    2.  Once a month review all the staycation
    tests to look for SPOF patterns
    3.  The team should notice the absence,
    but be able to work around it effectively.
    4.  If you need to break glass, then you
    have a SPOF and you need to fix that.
    If the team feels no impact, then it might
    be time for the person to find a new
    project/team.
    Goals
    The Wheel of
    Staycation

    View full-size slide

  11. Confidential + Proprietary
    Confidential + Proprietary
    1.  Select 20% of the team at random
    2.  For one work week (5 days), they
    cannot answer any work email in less
    than 1 hour.
    3.  The proctor decides if/when you need to
    break glass.
    Rules
    Tortoise Time

    View full-size slide

  12. Confidential + Proprietary
    Confidential + Proprietary
    1.  How long did the team manage before
    the latency became unbearable? (Hint:
    probably not more than 2 days.)
    2.  How quickly did the senders fall back to
    “alternate” sources -- including thin air?
    3.  The goal is to expose hidden layers of
    your business that are particularly
    latency sensitive.
    Goals
    Tortoise Time

    View full-size slide

  13. Confidential + Proprietary
    Confidential + Proprietary
    1.  Once a month pick 1-2 people at
    random
    2.  For one work day they will give wrong
    answers.
    a.  The proctor picks the % per
    person
    b.  Answers must be incorrect but
    plausible.
    c.  Keep a list of wrong answers and
    correct them the next work day.
    3.  Each email for that day begins with a
    disclaimer:
    “Today, I am the Designated Liar and
    have been randomly selected to be
    buggy. If you ask me a question today,
    some of my answers will be intentionally
    incorrect. Can you tell which ones?”
    Rules
    Liar Liar!

    View full-size slide

  14. Confidential + Proprietary
    Confidential + Proprietary
    1.  This is a fuzz testing exercise
    2.  Are recipients able to discern correct /
    incorrect answers? Could they have?
    a.  If not, then you’ve found an
    information SPOF and need to fix
    that
    b.  If so, were your answers plausible
    enough?
    3.  The goal is to test the principle of Nullius
    in Verba. (The motto of the Royal
    Society since 1660. Means of “not any
    in words” -- ie. take nobody’s word for
    it.)
    Goals
    Liar Liar!

    View full-size slide

  15. Confidential + Proprietary
    Confidential + Proprietary
    1.  1938 radio adaptation of H.G. Wells
    story. Caused a minor mass-panic when
    people thought is was real!
    2.  Simulate the most existentially
    threatening event you can think of for
    your company.
    a.  Massive security breach
    b.  Regulatory failure
    c.  Major customer meltdown
    3.  Only the bare minimum # of people can
    know it’s a simulation.
    a.  CEO
    b.  Head of PR
    c.  Legal
    d.  Proctor
    Rules
    War of the
    Worlds

    View full-size slide

  16. Confidential + Proprietary
    Confidential + Proprietary
    1.  Will people “do the right thing” in the
    face of an existential threat?
    2.  Do people panic?
    3.  Does it leak to Twitter / press?
    4.  The goal is to make sure that the
    company can react calmly and ethically
    to the worst possible news.
    Goals
    War of the
    Worlds

    View full-size slide

  17. Confidential + Proprietary
    Buy-In != All-in

    View full-size slide

  18. Confidential + Proprietary
    X-Func #FTW

    View full-size slide

  19. Confidential + Proprietary
    You Can Do This

    View full-size slide

  20. Confidential + Proprietary
    In Conclusion
    @drensin
    [email protected]

    View full-size slide