Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Increasing confidence in your system before the next outage

Increasing confidence in your system before the next outage

As our systems are growing more complex, we’ve come up with a standard set of ideas to help manage the dependencies between components. Ideas like timeouts, concurrency bounds, and circuit breakers let us control these interactions.

However, these ideas are complex too. Each of them has a variety of knobs to tune, and they’re rarely exercised in steady-state. So there’s a lot of opportunity for them to behave poorly when called upon, and especially when a few of them kick in at once.

My talk will show how to use chaos engineering to increase confidence in your usage and tuning of timeouts, concurrency bounds, and circuit breakers. You can use these ideas to understand your system better and hopefully make future incidents more manageable.

Matt Jacobs

October 08, 2019
Tweet

More Decks by Matt Jacobs

Other Decks in Programming

Transcript

  1. 3

  2. 4

  3. 13

  4. • Fail Recommendations Service at 10% • Expectation is that

    overall service will fail at 10%, since we haven’t done any work to remediate (yet)
  5. 35

  6. • Now fallback data from S3 is being served whenever

    a Users OR Recommendations failure is encountered • Fail Recommendations Service at 10% • Expectation is that customers get 90% personalized responses, 10% unpersonalized responses, and no errors
  7. • What’s the maximum we want the customer to wait?

    • How many resources can we spend per-request waiting?
  8. 47

  9. • Add 7500ms to 100% of Recommendations Service • Expectation

    is that resources will be spent waiting for response - some will succeed and some will fail
  10. • Add concurrency bounds to users and recommendations • Add

    7500ms to 100% of Recommendations Service • Expectation is that resources will be spent waiting for response - but service will remain healthy. • No errors, but fallback logic should get hit
  11. • Explicit state transitions are a useful way to think

    about the problem. • Can fail even faster than timeouts • Manual triggering of circuit open/closed is really valuable for operators • Relieves pressure on downstreams
  12. • Lot of configuration to set and validate • If

    timeouts are already being used, that should be sufficient for user-facing latency • If concurrency-bounds are already being used, that should be sufficient for system stability • Can introduce many more failures that would happen naturally ◦ • Which failures open? ◦ ◦
  13. 71

  14. • UI resiliency - what happens when certain endpoints fail

    or are slow? • Finding hidden dependencies • Testing your monitoring • Testing your incident response • Testing your autoscaling rules • Testing your region evacuation strategy • Onboarding new SREs • (many more)