Save 37% off PRO during our Black Friday Sale! »

Increasing confidence in your system before the next outage

Increasing confidence in your system before the next outage

As our systems are growing more complex, we’ve come up with a standard set of ideas to help manage the dependencies between components. Ideas like timeouts, concurrency bounds, and circuit breakers let us control these interactions.

However, these ideas are complex too. Each of them has a variety of knobs to tune, and they’re rarely exercised in steady-state. So there’s a lot of opportunity for them to behave poorly when called upon, and especially when a few of them kick in at once.

My talk will show how to use chaos engineering to increase confidence in your usage and tuning of timeouts, concurrency bounds, and circuit breakers. You can use these ideas to understand your system better and hopefully make future incidents more manageable.

73f44a0e40795d368079870b69ab7ef1?s=128

Matt Jacobs

October 08, 2019
Tweet

Transcript

  1. None
  2. None
  3. 3

  4. 4

  5. None
  6. None
  7. None
  8. None
  9. None
  10. None
  11. None
  12. None
  13. 13

  14. None
  15. None
  16. None
  17. recommendations-demo

  18. None
  19. None
  20. None
  21. None
  22. None
  23. • Fail Recommendations Service at 10% • Expectation is that

    overall service will fail at 10%, since we haven’t done any work to remediate (yet)
  24. None
  25. None
  26. None
  27. None
  28. None
  29. None
  30. None
  31. None
  32. None
  33. 35

  34. • Now fallback data from S3 is being served whenever

    a Users OR Recommendations failure is encountered • Fail Recommendations Service at 10% • Expectation is that customers get 90% personalized responses, 10% unpersonalized responses, and no errors
  35. None
  36. None
  37. None
  38. • •

  39. None
  40. None
  41. • What’s the maximum we want the customer to wait?

    • How many resources can we spend per-request waiting?
  42. • Threads ◦ • File Descriptors • Locks held •

    Memory • CPU scheduler
  43. None
  44. None
  45. 47

  46. None
  47. None
  48. None
  49. • Add 7500ms to 100% of Recommendations Service • Expectation

    is that resources will be spent waiting for response - some will succeed and some will fail
  50. None
  51. None
  52. None
  53. None
  54. None
  55. None
  56. None
  57. • Add concurrency bounds to users and recommendations • Add

    7500ms to 100% of Recommendations Service • Expectation is that resources will be spent waiting for response - but service will remain healthy. • No errors, but fallback logic should get hit
  58. None
  59. None
  60. None
  61. None
  62. • •

  63. • Explicit state transitions are a useful way to think

    about the problem. • Can fail even faster than timeouts • Manual triggering of circuit open/closed is really valuable for operators • Relieves pressure on downstreams
  64. • Lot of configuration to set and validate • If

    timeouts are already being used, that should be sufficient for user-facing latency • If concurrency-bounds are already being used, that should be sufficient for system stability • Can introduce many more failures that would happen naturally ◦ • Which failures open? ◦ ◦
  65. • ◦ ◦ ◦

  66. None
  67. 71

  68. • UI resiliency - what happens when certain endpoints fail

    or are slow? • Finding hidden dependencies • Testing your monitoring • Testing your incident response • Testing your autoscaling rules • Testing your region evacuation strategy • Onboarding new SREs • (many more)
  69. None
  70. • Gremlin.com • https://github.com/dastergon/awesome-chaos-engineering •