Increasing confidence in your system before the next outage

recommendations-demo

• Fail Recommendations Service at 10% • Expectation is that
overall service will fail at 10%, since we haven’t done any work to remediate (yet)

• Now fallback data from S3 is being served whenever
a Users OR Recommendations failure is encountered • Fail Recommendations Service at 10% • Expectation is that customers get 90% personalized responses, 10% unpersonalized responses, and no errors

• •

• What’s the maximum we want the customer to wait?
• How many resources can we spend per-request waiting?

• Threads ◦ • File Descriptors • Locks held •
Memory • CPU scheduler

• Add 7500ms to 100% of Recommendations Service • Expectation
is that resources will be spent waiting for response - some will succeed and some will fail

• Add concurrency bounds to users and recommendations • Add
7500ms to 100% of Recommendations Service • Expectation is that resources will be spent waiting for response - but service will remain healthy. • No errors, but fallback logic should get hit

• •

• Explicit state transitions are a useful way to think
about the problem. • Can fail even faster than timeouts • Manual triggering of circuit open/closed is really valuable for operators • Relieves pressure on downstreams

• Lot of configuration to set and validate • If
timeouts are already being used, that should be sufficient for user-facing latency • If concurrency-bounds are already being used, that should be sufficient for system stability • Can introduce many more failures that would happen naturally ◦ • Which failures open? ◦ ◦

• ◦ ◦ ◦

• UI resiliency - what happens when certain endpoints fail
or are slow? • Finding hidden dependencies • Testing your monitoring • Testing your incident response • Testing your autoscaling rules • Testing your region evacuation strategy • Onboarding new SREs • (many more)

• Gremlin.com • https://github.com/dastergon/awesome-chaos-engineering •

Increasing confidence in your system before the...

Increasing confidence in your system before the next outage

Matt Jacobs

More Decks by Matt Jacobs

Other Decks in Programming

Featured

Transcript

3

4

13

recommendations-demo

• Fail Recommendations Service at 10% • Expectation is that

•

•

35

• Now fallback data from S3 is being served whenever

• •

• What’s the maximum we want the customer to wait?

• Threads ◦ • File Descriptors • Locks held •

47

•

• Add 7500ms to 100% of Recommendations Service • Expectation

• Add concurrency bounds to users and recommendations • Add

• •

•

• Explicit state transitions are a useful way to think

• Lot of conﬁguration to set and validate • If

• ◦ ◦ ◦

71

• UI resiliency - what happens when certain endpoints fail

• Gremlin.com • https://github.com/dastergon/awesome-chaos-engineering •