Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Micro-services resiliency patterns

Marc Aubé
February 25, 2021

Micro-services resiliency patterns

Running a distributed application is HARD! You have to deal with unreliable remote services, network partitions, request timeouts, slow responses... the sources of possible failures are endless. There are patterns that you can apply in your service to save the day and make sure it's not your status page that goes red.

Marc Aubé

February 25, 2021
Tweet

More Decks by Marc Aubé

Other Decks in Programming

Transcript

  1. Our goals today are.. 1. improve stability 2. reduce the

    blast radius 3. make the system self-healing 4. sleep better at night
  2. The first law of distributed systems is don't distribute your

    system2 2 https://martinfowler.com/articles/distributed-objects-microservices.html
  3. Service Level Agreement/Objectives (SLA/ SLOs)3 Availability Downtime (Year) Downtime (Month)

    99% (2 nines) 3.65 days 7.31 hours 99.9% (3 nines) 8.77 hours 43.83 minutes 99.99% (4 nines) 52.60 minutes 4.38 minutes 99.999% (5 nines) 6.26 minutes 26.30 seconds 3 https://en.wikipedia.org/wiki/High_availability
  4. Timeouts ‣ connection timeout (establishing a connection) ‣ request timeout

    (receiving a response) ‣ value based on production data (e.g. p99)
  5. Source of failure ‣ dropped packet ‣ change in network

    topology ‣ partial failure (small %) ‣ transient failure (short time)
  6. Ask yourself... 1. Can I even retry the call? (side

    effects, etc) 2. Am I amplifying the problem? 3. How many times is reasonable? 4. How long am I prepared to try?
  7. ‣ make 5 connection attempts ‣ exponential wait time (powers

    or 2) ‣ 4 seconds ‣ 8 seconds ‣ 16 seconds ‣ 32 seconds ‣ will wait up to 60s ‣ repeated for each models in the collection
  8. Circuit breaker in a nutshell ‣ wraps a remote call

    ‣ counts the number of failure ‣ trips when the failures reach a given threshold ‣ can retry the remote service after a grace period (half-open)
  9. from circuitbreaker import circuit @circuit( # How many times it

    can fail failure_threshold=10, # How long the breaker stays open recovery_timeout=30, # Only catch this error expected_exception=ConnectionError, ) def external_call(): response = requests.get('https://unreliable-api.com/') return json.loads(response.content)
  10. Bulkheads can be implemented on the client-side too! ‣ partition

    service instances into different groups ‣ divide available connections between clients ‣ if one client overloads a service instance, others aren't impacted ‣ high-priority clients can be assigned more resources
  11. Bulkheads also work for data ‣ partitions / shards ‣

    seperate tenant schema ‣ seperate database
  12. Implement multiple throttling strategies ‣ burst rate limit 6 ‣

    sustained rate limit 6 Sliding window algorithm
  13. Implement multiple throttling strategies ‣ burst rate limit 6 ‣

    sustained rate limit ‣ base those limits on real usage data 6 Sliding window algorithm
  14. 1. Collect data 2. Identify KPIs ‣ what moves when

    you service is about to fail? ‣ response time / execution time ‣ CPU usage ‣ memory usage ‣ queue backlog ‣ number of connections 3. Configure monitoring and alerting
  15. Resources ‣ The Amazon Builder's Library (free!) ‣ Azure's list

    of Cloud Design Patterns (free!) ‣ Release It!, an excellent book by Michael T. Nygard ‣ The series of SRE books by Google (free!)