Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Optimist's Guide to Chaos Engineering

Andy Fleener
February 06, 2017

An Optimist's Guide to Chaos Engineering

Chaos Engineering ya or nah? Let’s discuss how CE can fit into your organization. Failure injection is not just for unicorns, and you don’t need a Simian Army to put resilience into your systems. Failure injection should be as simple as purposefully creating opportunities to learn about the system.

Andy Fleener

February 06, 2017
Tweet

More Decks by Andy Fleener

Other Decks in Technology

Transcript

  1. @andyfleener ABOUT ME ▸ Senior Software Engineer ▸ Ruby Developer

    for 9 years ▸ In Operations for last 4 years ▸ Complexity / System Safety Nerd
  2. @andyfleener REAL BUGS I’VE SEEN IN THE WILD ▸ Ruby’s

    Timeout Library doesn’t actually timeout sometimes ▸ Node Framework miss handle’s an invalid url and crashes ▸ Ruby App server flips it’s shit when a stream is closed early Creative Commons Image: Matteo X
  3. @andyfleener THESE BUGS WERE AWESOME ▸ I learned a shitload

    about how each of these parts of an application work ▸ We tracked down the problem and worked through solutions for each ▸ These are the kinds of challenges that get me excited to go to work Creative Commons Image: Neil Moralee
  4. @andyfleener SYSTEMS THINKING ▸ It can be very counterintuitive ▸

    Success and Failure are not easily defined ▸ Emergent properties of a system are the feedback required to close the loop Creative Commons Image: Liam Ross
  5. @andyfleener THREE TRUTHS ABOUT COMPLEX SYSTEMS ▸ They are inherently

    unsafe ▸ Failure is a normal state of the system ▸ Failure can actually make a system stronger Creative Commons Image: Bjoern von Thuelen
  6. @andyfleener CHAOS ENGINEERING IS THE DISCIPLINE OF EXPERIMENTING ON A

    DISTRIBUTED SYSTEM IN ORDER TO BUILD CONFIDENCE IN THE SYSTEM’S CAPABILITY TO WITHSTAND TURBULENT CONDITIONS IN PRODUCTION. PRINCIPLES OF CHAOS ENGINEERING
  7. @andyfleener BASIC PRINCIPLES ▸ Build a Hypothesis around Steady State

    Behavior ▸ Vary Real-world Events ▸ Learn from the changes to the Steady State Creative Commons Image: Hamed Sabe
  8. @andyfleener BUILD A HYPOTHESIS AROUND STEADY STATE BEHAVIOR ▸ Attempt

    to understand as much of the system’s steady normal running state as you can ▸ Network is the backbone of distributed systems and it is designed to withstand failure ▸ Measure outcomes of the system by the value it provides ▸ APM is generally a good way to understanding Steady State Creative Commons Image: Seabamirum
  9. @andyfleener VARY REAL-WORLD EVENTS ▸ Variables should reflect real world

    events ▸ Prioritize by both frequency and potential impact ▸ Systems that resist failure tend to fail catastrophically ▸ Rule #1 Never Trust the Network ▸ Chaos Variables can be any event capable of disrupting steady state Creative Commons Image: Leo Fung
  10. @andyfleener LEARN FROM THE CHANGES TO THE STEADY STATE ▸

    How did the system react to variable change? ▸ Some times it’s obvious ▸ Hopefully more frequently your system can tolerate the failure Creative Commons Image: Simon_sees
  11. @andyfleener WHY WOULD YOU DO THAT? ▸ It’s impossible to

    create a duplicate complex system ▸ Failure in a production like environment won’t be exactly the same ▸ Fake it till you make it ▸ Do not run a Chaos Experiment on something you never tested ▸ Running experiments on production like systems is better than not running them at all Creative Commons Image: Toms River Fire Dept
  12. @andyfleener USE TOOLS THAT EXIST ▸ Toxiproxy by Shopify ▸

    Comcast by tylertreat ▸ Chaos Monkey by Netflix Creative Commons Image: Toms River Fire Dept
  13. @andyfleener DISTRIBUTED SYSTEMS NEED TO TALK TO EACH OTHER, SLOW

    THAT DOWN AND I BET YOU’LL SEE SOMETHING INTERESTING
  14. @andyfleener IS THE KEY TO CREATING NOT JUST RELIABLE SYSTEMS

    BUT RESILIENT ONES ADDING CAPACITY TO FAIL
  15. @andyfleener ORGANIZATIONS WHICH DESIGN SYSTEMS ARE CONSTRAINED TO PRODUCE DESIGNS

    WHICH ARE COPIES OF THE COMMUNICATION STRUCTURES OF THESE ORGANIZATIONS CONWAY’S LAW
  16. @andyfleener CHAOS ENGINEERING AN ORGANIZATION ▸ Resilient Teams create Resilient

    Systems ▸ Teams are like services ▸ Communication is the Network of Teams Creative Commons Image: Meenakshi Madhavan