Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Case for Chaos: Thinking About Failure Holistically

The Case for Chaos: Thinking About Failure Holistically

Pat Higgins

May 15, 2018
Tweet

More Decks by Pat Higgins

Other Decks in Technology

Transcript

  1. T H I N K I N G A B

    O U T FA I L U R E H O L I S T I C A L LY T H E C A S E F O R C H A O S :
  2. ~ W H O A M I Patrick Higgins 


    @higgyCodes UI Engineer @ Gremlin
  3. ~ W H O A M I • From Sydney,

    Australia • Former Salt Lake Citizen • Lives in San Francisco
  4. O U T L I N E • Chaos Engineering

    • GameDays • Holistic Failure Mitigation
  5. C H A O S E N G I N

    E E R I N G • “Thoughtful, planned experiments designed to reveal the weaknesses in our system” - Kolton Andrus • Like a vaccine, we inject harm into our system to help build immunity.
  6. C H A O S I S A P R

    A C T I C E
  7. W H Y C H A O S E N

    G I N E E R ? • The motivations are different depending on role: • Business case - avoiding costly downtime • On call case - avoiding 3am pages • Engineering - service availability
  8. – M AT H I A S L A F

    E L D T “The lesson we should learn and remember is that sooner or later, all complex systems will fail.”
  9. D O W N T I M E I S

    C O S T LY • Prevents sales • Affects customer trust • Contributes to engineer burnout
  10. P R E R E Q U I S I

    T E S F O R C H A O S • Have a High Severity Incident Management (SEV) Program • Have sufficient monitoring to observe effects • Alerts and paging, that notify a human during a SEV
  11. C H A O S E N G I N

    E E R I N G L I F E C Y C L E
  12. W O R D O F WA R N I

    N G • Never run a chaos experiment (in production) if you know it will cause severe damage.
  13. C H A O S M I T I G

    AT I O N I S M U LT I FA C E T E D • People get better at mitigating failure. • Product is engineered with failure in mind.
  14. – H O M I N G L I Dedicated

    time for teams to collaboratively focus on using Chaos Engineering practices to reveal weaknesses in your services W H AT I S A G A M E D AY ?
  15. W H O S H O U L D PA

    R T I C I PAT E ?
  16. W H Y E V E RY B O D

    Y ? • Everybody benefits from observing failure • Encourages cross-organization collaboration • Find your champions across the company • Encourages varied perspectives
  17. T H I N G S T O R E

    M E M B E R
  18. M Y F I R S T G A M

    E D AY • Gremlin holds Failure Fridays • Degradation of my features in the UI was less than desirable • Mapped out the critical failures, dropped tickets into tech debt, dealt with the tickets gradually as time allowed.
  19. H O L I S T I C FA I

    L U R E M I T I G AT I O N
  20. C H A O S E N G I N

    E E R I N G A N D U I • Graceful Degradation in UI implementation • Critical User Paths • Auxiliary Paths • Sometimes the two are mixed
  21. C H A O S E N G I N

    E E R I N G A N D U I • End-to-End testing of failure scenarios is not enough. • OSS Developer tooling around failure mitigation in UI is underdeveloped. • Tooling is regularly company specific.
  22. C H A O S E N G I N

    E E R I N G A N D P R O D U C T • Mapping out potential alternative states (reroute, retry) • Product specs that include comprehensive failure scenarios are rare
  23. R E S O U R C E S AWESOME

    CHAOS ENGINEERING dastergon/awesome-chaos-engineering
  24. R E S O U R C E S GAME

    DAY RESOURCES gremlin.com/gameday
  25. G E T I N V O LV E D

    CHAOS COMMUNITY SLACK gremlin.com/slack
  26. G E T I N V O LV E D

    CHAOS CONF (SF) September 28th, 2018 chaosconf.io
  27. T H A N K S ! Patrick Higgins 


    @higgyCodes UI Engineer @ Gremlin