Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Micro-services resiliency patterns

5e6bcf291601ee2e0faf35b30a839cb6?s=47 Marc Aubé
February 25, 2021

Micro-services resiliency patterns

Running a distributed application is HARD! You have to deal with unreliable remote services, network partitions, request timeouts, slow responses... the sources of possible failures are endless. There are patterns that you can apply in your service to save the day and make sure it's not your status page that goes red.

5e6bcf291601ee2e0faf35b30a839cb6?s=128

Marc Aubé

February 25, 2021
Tweet

Transcript

  1. None
  2. Our goals today are.. 1. improve stability 2. reduce the

    blast radius 3. make the system self-healing 4. sleep better at night
  3. Just a tiny bit of background...1 1 https://mcfunley.com/choose-boring-technology

  4. None
  5. None
  6. None
  7. The first law of distributed systems is don't distribute your

    system2 2 https://martinfowler.com/articles/distributed-objects-microservices.html
  8. What is your cost of unavailability?

  9. Service Level Agreement/Objectives (SLA/ SLOs)3 Availability Downtime (Year) Downtime (Month)

    99% (2 nines) 3.65 days 7.31 hours 99.9% (3 nines) 8.77 hours 43.83 minutes 99.99% (4 nines) 52.60 minutes 4.38 minutes 99.999% (5 nines) 6.26 minutes 26.30 seconds 3 https://en.wikipedia.org/wiki/High_availability
  10. The normal state of a large distributed system may very

    well be "a little bit broken"
  11. The technical patterns

  12. 1. Fail Fast (Timeout)

  13. Set an explicit timeout on any remote call.

  14. Timeouts ‣ connection timeout (establishing a connection) ‣ request timeout

    (receiving a response) ‣ value based on production data (e.g. p99)
  15. If there's not a lot of latency variation, keep a

    healthy margin.
  16. What's your fallback ?

  17. 2. Retry

  18. Source of failure ‣ dropped packet ‣ change in network

    topology ‣ partial failure (small %) ‣ transient failure (short time)
  19. Ask yourself... 1. Can I even retry the call? (side

    effects, etc) 2. Am I amplifying the problem? 3. How many times is reasonable? 4. How long am I prepared to try?
  20. from tenacity import retry @retry(stop=stop_after_attempt(5)) def never_gonna_give_you_up(): if random.choice([True, False]):

    raise Exception return 42
  21. from tenacity import retry @retry(stop=stop_after_attempt(5)) def never_gonna_give_you_up(): if random.choice([True, False]):

    raise Exception return 42
  22. The multiplicative effect of retries

  23. Not all failures are worth retrying

  24. A note on background tasks and retries...

  25. ‣ make 5 connection attempts ‣ exponential wait time (powers

    or 2) ‣ 4 seconds ‣ 8 seconds ‣ 16 seconds ‣ 32 seconds ‣ will wait up to 60s ‣ repeated for each models in the collection
  26. 3. Circuit Breaker4 4 http://martinfowler.com/bliki/CircuitBreaker.html

  27. Circuit breaker in a nutshell ‣ wraps a remote call

    ‣ counts the number of failure ‣ trips when the failures reach a given threshold ‣ can retry the remote service after a grace period (half-open)
  28. Circuit breaker states 4 4 http://martinfowler.com/bliki/CircuitBreaker.html

  29. from circuitbreaker import circuit @circuit def external_call(): response = requests.get('https://unreliable-api.com/')

    return json.loads(response.content)
  30. from circuitbreaker import circuit @circuit def external_call(): response = requests.get('https://unreliable-api.com/')

    return json.loads(response.content)
  31. from circuitbreaker import circuit @circuit( # How many times it

    can fail failure_threshold=10, # How long the breaker stays open recovery_timeout=30, # Only catch this error expected_exception=ConnectionError, ) def external_call(): response = requests.get('https://unreliable-api.com/') return json.loads(response.content)
  32. When to use a circuit breaker?

  33. Circuit breakers can significantly increase time to recovery.

  34. 4. Bulkheads

  35. Bulkhead architecture 5 5 https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead

  36. Bulkhead architecture 5 5 https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead

  37. Each compartment in a bulkhead is a complete, independent instance

    of the service.
  38. Bulkheads can be implemented on the client-side too! ‣ partition

    service instances into different groups ‣ divide available connections between clients ‣ if one client overloads a service instance, others aren't impacted ‣ high-priority clients can be assigned more resources
  39. Bulkheads also work for data ‣ partitions / shards ‣

    seperate tenant schema ‣ seperate database
  40. Consider combining bulkheads with retries, circuit breakers and throttling.

  41. 5. Throttling

  42. Implement multiple throttling strategies ‣ burst rate limit 6 6

    Sliding window algorithm
  43. Implement multiple throttling strategies ‣ burst rate limit 6 ‣

    sustained rate limit 6 Sliding window algorithm
  44. Implement multiple throttling strategies ‣ burst rate limit 6 ‣

    sustained rate limit ‣ base those limits on real usage data 6 Sliding window algorithm
  45. Build and test your throttling mechanism before you need it

    - Younger me
  46. 6. Set trip wires

  47. 1. Collect data 2. Identify KPIs ‣ what moves when

    you service is about to fail? ‣ response time / execution time ‣ CPU usage ‣ memory usage ‣ queue backlog ‣ number of connections 3. Configure monitoring and alerting
  48. Learn the failure modes of the tools you use.

  49. Resources ‣ The Amazon Builder's Library (free!) ‣ Azure's list

    of Cloud Design Patterns (free!) ‣ Release It!, an excellent book by Michael T. Nygard ‣ The series of SRE books by Google (free!)
  50. Conclusion

  51. Slides

  52. Thank you!