Save 37% off PRO during our Black Friday Sale! »

The Case for Chaos: Thinking About Failure Holistically

The Case for Chaos: Thinking About Failure Holistically

54ca38ae2f0bd2c3afbaf5235b376740?s=128

Pat Higgins

May 15, 2018
Tweet

Transcript

  1. T H I N K I N G A B

    O U T FA I L U R E H O L I S T I C A L LY T H E C A S E F O R C H A O S :
  2. None
  3. ~ W H O A M I Patrick Higgins 


    @higgyCodes UI Engineer @ Gremlin
  4. ~ W H O A M I • From Sydney,

    Australia • Former Salt Lake Citizen • Lives in San Francisco
  5. O U T L I N E • Chaos Engineering

    • GameDays • Holistic Failure Mitigation
  6. C H A O S E N G I N

    E E R I N G • “Thoughtful, planned experiments designed to reveal the weaknesses in our system” - Kolton Andrus • Like a vaccine, we inject harm into our system to help build immunity.
  7. C H A O S I S A P R

    A C T I C E
  8. W H Y C H A O S E N

    G I N E E R ? • The motivations are different depending on role: • Business case - avoiding costly downtime • On call case - avoiding 3am pages • Engineering - service availability
  9. – M AT H I A S L A F

    E L D T “The lesson we should learn and remember is that sooner or later, all complex systems will fail.”
  10. D O W N T I M E I S

    C O S T LY • Prevents sales • Affects customer trust • Contributes to engineer burnout
  11. P R E R E Q U I S I

    T E S F O R C H A O S • Have a High Severity Incident Management (SEV) Program • Have sufficient monitoring to observe effects • Alerts and paging, that notify a human during a SEV
  12. C H A O S E N G I N

    E E R I N G L I F E C Y C L E
  13. W O R D O F WA R N I

    N G • Never run a chaos experiment (in production) if you know it will cause severe damage.
  14. C H A O S M I T I G

    AT I O N I S M U LT I FA C E T E D • People get better at mitigating failure. • Product is engineered with failure in mind.
  15. G A M E D AY S

  16. – H O M I N G L I Dedicated

    time for teams to collaboratively focus on using Chaos Engineering practices to reveal weaknesses in your services W H AT I S A G A M E D AY ?
  17. W H O S H O U L D PA

    R T I C I PAT E ?
  18. W H Y E V E RY B O D

    Y ? • Everybody benefits from observing failure • Encourages cross-organization collaboration • Find your champions across the company • Encourages varied perspectives
  19. T H I N G S T O R E

    M E M B E R
  20. M Y F I R S T G A M

    E D AY • Gremlin holds Failure Fridays • Degradation of my features in the UI was less than desirable • Mapped out the critical failures, dropped tickets into tech debt, dealt with the tickets gradually as time allowed.
  21. H O L I S T I C FA I

    L U R E M I T I G AT I O N
  22. C H A O S E N G I N

    E E R I N G A N D U I • Graceful Degradation in UI implementation • Critical User Paths • Auxiliary Paths • Sometimes the two are mixed
  23. None
  24. None
  25. None
  26. C H A O S E N G I N

    E E R I N G A N D U I • End-to-End testing of failure scenarios is not enough. • OSS Developer tooling around failure mitigation in UI is underdeveloped. • Tooling is regularly company specific.
  27. C H A O S E N G I N

    E E R I N G A N D P R O D U C T • Mapping out potential alternative states (reroute, retry) • Product specs that include comprehensive failure scenarios are rare
  28. R E S O U R C E S AWESOME

    CHAOS ENGINEERING dastergon/awesome-chaos-engineering
  29. R E S O U R C E S GAME

    DAY RESOURCES gremlin.com/gameday
  30. G E T I N V O LV E D

    CHAOS COMMUNITY SLACK gremlin.com/slack
  31. SLC CHAOS ENGINEERING MEETUP meetup.com/Salt-Lake-City-Chaos-Engineering- Community/ G E T I

    N V O LV E D
  32. G E T I N V O LV E D

    CHAOS CONF (SF) September 28th, 2018 chaosconf.io
  33. T H A N K S ! Patrick Higgins 


    @higgyCodes UI Engineer @ Gremlin