Micro-services resiliency patterns

Our goals today are.. 1. improve stability 2. reduce the
blast radius 3. make the system self-healing 4. sleep better at night

Just a tiny bit of background...1 1 https://mcfunley.com/choose-boring-technology

The ﬁrst law of distributed systems is don't distribute your
system2 2 https://martinfowler.com/articles/distributed-objects-microservices.html

What is your cost of unavailability?

Service Level Agreement/Objectives (SLA/ SLOs)3 Availability Downtime (Year) Downtime (Month)
99% (2 nines) 3.65 days 7.31 hours 99.9% (3 nines) 8.77 hours 43.83 minutes 99.99% (4 nines) 52.60 minutes 4.38 minutes 99.999% (5 nines) 6.26 minutes 26.30 seconds 3 https://en.wikipedia.org/wiki/High_availability

The normal state of a large distributed system may very
well be "a little bit broken"

The technical patterns

1. Fail Fast (Timeout)

Set an explicit timeout on any remote call.

Timeouts ‣ connection timeout (establishing a connection) ‣ request timeout
(receiving a response) ‣ value based on production data (e.g. p99)

If there's not a lot of latency variation, keep a
healthy margin.

What's your fallback ?

2. Retry

Source of failure ‣ dropped packet ‣ change in network
topology ‣ partial failure (small %) ‣ transient failure (short time)

Ask yourself... 1. Can I even retry the call? (side
effects, etc) 2. Am I amplifying the problem? 3. How many times is reasonable? 4. How long am I prepared to try?

from tenacity import retry @retry(stop=stop_after_attempt(5)) def never_gonna_give_you_up(): if random.choice([True, False]):
raise Exception return 42

The multiplicative effect of retries

Not all failures are worth retrying

A note on background tasks and retries...

‣ make 5 connection attempts ‣ exponential wait time (powers
or 2) ‣ 4 seconds ‣ 8 seconds ‣ 16 seconds ‣ 32 seconds ‣ will wait up to 60s ‣ repeated for each models in the collection

3. Circuit Breaker4 4 http://martinfowler.com/bliki/CircuitBreaker.html

Circuit breaker in a nutshell ‣ wraps a remote call
‣ counts the number of failure ‣ trips when the failures reach a given threshold ‣ can retry the remote service after a grace period (half-open)

Circuit breaker states 4 4 http://martinfowler.com/bliki/CircuitBreaker.html

from circuitbreaker import circuit @circuit def external_call(): response = requests.get('https://unreliable-api.com/')
return json.loads(response.content)

from circuitbreaker import circuit @circuit( # How many times it
can fail failure_threshold=10, # How long the breaker stays open recovery_timeout=30, # Only catch this error expected_exception=ConnectionError, ) def external_call(): response = requests.get('https://unreliable-api.com/') return json.loads(response.content)

When to use a circuit breaker?

Circuit breakers can signiﬁcantly increase time to recovery.

4. Bulkheads

Bulkhead architecture 5 5 https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead

Each compartment in a bulkhead is a complete, independent instance
of the service.

Bulkheads can be implemented on the client-side too! ‣ partition
service instances into different groups ‣ divide available connections between clients ‣ if one client overloads a service instance, others aren't impacted ‣ high-priority clients can be assigned more resources

Bulkheads also work for data ‣ partitions / shards ‣
seperate tenant schema ‣ seperate database

Consider combining bulkheads with retries, circuit breakers and throttling.

5. Throttling

Implement multiple throttling strategies ‣ burst rate limit 6 6
Sliding window algorithm

Implement multiple throttling strategies ‣ burst rate limit 6 ‣
sustained rate limit 6 Sliding window algorithm

Implement multiple throttling strategies ‣ burst rate limit 6 ‣
sustained rate limit ‣ base those limits on real usage data 6 Sliding window algorithm

Build and test your throttling mechanism before you need it
- Younger me

6. Set trip wires

1. Collect data 2. Identify KPIs ‣ what moves when
you service is about to fail? ‣ response time / execution time ‣ CPU usage ‣ memory usage ‣ queue backlog ‣ number of connections 3. Conﬁgure monitoring and alerting

Learn the failure modes of the tools you use.

Resources ‣ The Amazon Builder's Library (free!) ‣ Azure's list
of Cloud Design Patterns (free!) ‣ Release It!, an excellent book by Michael T. Nygard ‣ The series of SRE books by Google (free!)

Conclusion

Slides

Thank you!

Micro-services resiliency patterns

Micro-services resiliency patterns

More Decks by Marc Aubé

Other Decks in Programming

Featured

Transcript