Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FORMING FAILURE HYPOTHESES

Chaos Conf
September 26, 2019

FORMING FAILURE HYPOTHESES

Subbu Allamaraju, Expedia

At Expedia, Subbu is leading a large-scale migration of Expedia’s travel platforms from enterprise data centers to a highly available architecture on the cloud. Before joining Expedia, as a Distinguished Engineer at eBay Inc., Subbu helped build private cloud infrastructure and platforms for eBay and PayPal.

Chaos Conf

September 26, 2019
Tweet

More Decks by Chaos Conf

Other Decks in Technology

Transcript

  1. 3 “Chaos Engineering is the discipline of experimenIng on a

    system in order to build confidence in the system’s capability to withstand turbulent condiIons in producIon.” Photo by Jilbert Ebrahimi on Unsplash From Principles of Chaos Engineering (h=ps://principlesofchaos.org/)
  2. 3 4 1 Stable Ok. Back to Stable Your assumed

    fault boundary Actual fault boundary
  3. Q 5 What is the system? How do you form

    a hypothesis? How do you ensure system safety? Why should anyone listen to you?
  4. 10 Chaos engineering oud pIon build nce? Randomly killing servers

    uncovers trivial issues only Not everyone likes you to a=ack their apps
  5. 11 Chaos gineering Randomly killing servers uncovers trivial issues only

    Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures
  6. 12 Chaos gineering Randomly killing servers uncovers trivial issues only

    Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt
  7. 13 Chaos gineering Randomly killing servers uncovers trivial issues only

    Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt
  8. 14 Null hypothesis Chaos engineering has nothing to do system’s

    capability to withstand turbulent condiIons in producIon.
  9. 15 How is the system behaving in producIon today? How

    do we make the system withstand turbulent condiIons? Photo by Hush Naidoo on Unsplash
  10. 16 “as designed” “as it is” Biased by your expectaIon

    of how the system is supposed to work The real world Metrics Alerts Logs Docs Diagram s Cod e Incident s
  11. 20 3. We don’t understand where a failure stops ContribuIng

    to cascading failures and long recovery
  12. 1.  Improve release safety through progressive delivery 2.  Ensure Ighter

    fault domain boundaries in the “as designed” state 3.  Implement safety in the “as designed” state 4.  Only then pick what to test 21
  13. 22 These observaIons are relevant but not as much as

    the act of learning from incidents. Because the ”as it is” state might tell you what to do.
  14. 24 Randomly killing servers uncovers trivial issues only veryone likes

    you to a=ack their apps You can’t/won’t test more serious failures Self-doubt Learn from incidents
  15. 25 domly killing servers overs trivial issues only likes you

    to a=ack their apps You can’t/won’t test more serious failures Self-doubt Learn from incidents Make value based decisions
  16. 28 1.  Developed mental models of how the system works

    when it does, and doesn’t when it doesn’t. 2.  You’re not chasing symptoms but are beginning to understand the system as a whole.
  17. 29 3.  You start to understand role of people, processes

    and tools for success as well as failure. 4.  You are able to arIculate the value of hygiene investments.