FORMING FAILURE HYPOTHESES

777bc656cb5c276519c2d52951d6ebca?s=47 Chaos Conf
September 26, 2019

FORMING FAILURE HYPOTHESES

Subbu Allamaraju, Expedia

At Expedia, Subbu is leading a large-scale migration of Expedia’s travel platforms from enterprise data centers to a highly available architecture on the cloud. Before joining Expedia, as a Distinguished Engineer at eBay Inc., Subbu helped build private cloud infrastructure and platforms for eBay and PayPal.

777bc656cb5c276519c2d52951d6ebca?s=128

Chaos Conf

September 26, 2019
Tweet

Transcript

  1. Forming Failure Hypothesis September 26, 2019

  2. 2 Subbu Allamaraju @sallamar Expedia Group See h=ps://www.subbu.org for slides.

  3. 3 “Chaos Engineering is the discipline of experimenIng on a

    system in order to build confidence in the system’s capability to withstand turbulent condiIons in producIon.” Photo by Jilbert Ebrahimi on Unsplash From Principles of Chaos Engineering (h=ps://principlesofchaos.org/)
  4. 3 4 1 Stable Ok. Back to Stable Your assumed

    fault boundary Actual fault boundary
  5. Q 5 What is the system? How do you form

    a hypothesis? How do you ensure system safety? Why should anyone listen to you?
  6. 6 Cloud adopIon

  7. 7 Cloud adopIon How to build resilience?

  8. 8 Chaos engineering Cloud adopIon How to build resilience?

  9. 9 Chaos engineering Cloud dopIon to build ience? Not everyone

    likes you to a=ack their apps
  10. 10 Chaos engineering oud pIon build nce? Randomly killing servers

    uncovers trivial issues only Not everyone likes you to a=ack their apps
  11. 11 Chaos gineering Randomly killing servers uncovers trivial issues only

    Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures
  12. 12 Chaos gineering Randomly killing servers uncovers trivial issues only

    Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt
  13. 13 Chaos gineering Randomly killing servers uncovers trivial issues only

    Not everyone likes you to a=ack their apps You can’t/won’t test more serious failures Self-doubt
  14. 14 Null hypothesis Chaos engineering has nothing to do system’s

    capability to withstand turbulent condiIons in producIon.
  15. 15 How is the system behaving in producIon today? How

    do we make the system withstand turbulent condiIons? Photo by Hush Naidoo on Unsplash
  16. 16 “as designed” “as it is” Biased by your expectaIon

    of how the system is supposed to work The real world Metrics Alerts Logs Docs Diagram s Cod e Incident s
  17. 17 Let’s observe the real world

  18. [CATE GORY NAME 1. Changes are contribuIng to majority of

    impact 18
  19. 2. Second/higher order effects are hard to troubleshoot ReIred App

    Big App Tech Debt Another Big App 19
  20. 20 3. We don’t understand where a failure stops ContribuIng

    to cascading failures and long recovery
  21. 1.  Improve release safety through progressive delivery 2.  Ensure Ighter

    fault domain boundaries in the “as designed” state 3.  Implement safety in the “as designed” state 4.  Only then pick what to test 21
  22. 22 These observaIons are relevant but not as much as

    the act of learning from incidents. Because the ”as it is” state might tell you what to do.
  23. 23 Pick the most criIcal areas But how to prioriIze

    such work? ArIculate value
  24. 24 Randomly killing servers uncovers trivial issues only veryone likes

    you to a=ack their apps You can’t/won’t test more serious failures Self-doubt Learn from incidents
  25. 25 domly killing servers overs trivial issues only likes you

    to a=ack their apps You can’t/won’t test more serious failures Self-doubt Learn from incidents Make value based decisions
  26. 26 How to learn from incidents?

  27. 27 How does it feel when you learn from incidents?

  28. 28 1.  Developed mental models of how the system works

    when it does, and doesn’t when it doesn’t. 2.  You’re not chasing symptoms but are beginning to understand the system as a whole.
  29. 29 3.  You start to understand role of people, processes

    and tools for success as well as failure. 4.  You are able to arIculate the value of hygiene investments.
  30. 30 Lessons learned

  31. 31 1. Learn from incidents. 2. There is no lesson 2.

  32. 32 @sallamar – h=ps://www.subbu.org Source: h=ps://www.trover.com/li/vCu0/nZk3 Thank you