Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Our goals today are.. 1. improve stability 2. reduce the blast radius 3. make the system self-healing 4. sleep better at night

Slide 3

Slide 3 text

Just a tiny bit of background...1 1 https://mcfunley.com/choose-boring-technology

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

The first law of distributed systems is don't distribute your system2 2 https://martinfowler.com/articles/distributed-objects-microservices.html

Slide 8

Slide 8 text

What is your cost of unavailability?

Slide 9

Slide 9 text

Service Level Agreement/Objectives (SLA/ SLOs)3 Availability Downtime (Year) Downtime (Month) 99% (2 nines) 3.65 days 7.31 hours 99.9% (3 nines) 8.77 hours 43.83 minutes 99.99% (4 nines) 52.60 minutes 4.38 minutes 99.999% (5 nines) 6.26 minutes 26.30 seconds 3 https://en.wikipedia.org/wiki/High_availability

Slide 10

Slide 10 text

The normal state of a large distributed system may very well be "a little bit broken"

Slide 11

Slide 11 text

The technical patterns

Slide 12

Slide 12 text

1. Fail Fast (Timeout)

Slide 13

Slide 13 text

Set an explicit timeout on any remote call.

Slide 14

Slide 14 text

Timeouts ‣ connection timeout (establishing a connection) ‣ request timeout (receiving a response) ‣ value based on production data (e.g. p99)

Slide 15

Slide 15 text

If there's not a lot of latency variation, keep a healthy margin.

Slide 16

Slide 16 text

What's your fallback ?

Slide 17

Slide 17 text

2. Retry

Slide 18

Slide 18 text

Source of failure ‣ dropped packet ‣ change in network topology ‣ partial failure (small %) ‣ transient failure (short time)

Slide 19

Slide 19 text

Ask yourself... 1. Can I even retry the call? (side effects, etc) 2. Am I amplifying the problem? 3. How many times is reasonable? 4. How long am I prepared to try?

Slide 20

Slide 20 text

from tenacity import retry @retry(stop=stop_after_attempt(5)) def never_gonna_give_you_up(): if random.choice([True, False]): raise Exception return 42

Slide 21

Slide 21 text

from tenacity import retry @retry(stop=stop_after_attempt(5)) def never_gonna_give_you_up(): if random.choice([True, False]): raise Exception return 42

Slide 22

Slide 22 text

The multiplicative effect of retries

Slide 23

Slide 23 text

Not all failures are worth retrying

Slide 24

Slide 24 text

A note on background tasks and retries...

Slide 25

Slide 25 text

‣ make 5 connection attempts ‣ exponential wait time (powers or 2) ‣ 4 seconds ‣ 8 seconds ‣ 16 seconds ‣ 32 seconds ‣ will wait up to 60s ‣ repeated for each models in the collection

Slide 26

Slide 26 text

3. Circuit Breaker4 4 http://martinfowler.com/bliki/CircuitBreaker.html

Slide 27

Slide 27 text

Circuit breaker in a nutshell ‣ wraps a remote call ‣ counts the number of failure ‣ trips when the failures reach a given threshold ‣ can retry the remote service after a grace period (half-open)

Slide 28

Slide 28 text

Circuit breaker states 4 4 http://martinfowler.com/bliki/CircuitBreaker.html

Slide 29

Slide 29 text

from circuitbreaker import circuit @circuit def external_call(): response = requests.get('https://unreliable-api.com/') return json.loads(response.content)

Slide 30

Slide 30 text

from circuitbreaker import circuit @circuit def external_call(): response = requests.get('https://unreliable-api.com/') return json.loads(response.content)

Slide 31

Slide 31 text

from circuitbreaker import circuit @circuit( # How many times it can fail failure_threshold=10, # How long the breaker stays open recovery_timeout=30, # Only catch this error expected_exception=ConnectionError, ) def external_call(): response = requests.get('https://unreliable-api.com/') return json.loads(response.content)

Slide 32

Slide 32 text

When to use a circuit breaker?

Slide 33

Slide 33 text

Circuit breakers can significantly increase time to recovery.

Slide 34

Slide 34 text

4. Bulkheads

Slide 35

Slide 35 text

Bulkhead architecture 5 5 https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead

Slide 36

Slide 36 text

Bulkhead architecture 5 5 https://docs.microsoft.com/en-us/azure/architecture/patterns/bulkhead

Slide 37

Slide 37 text

Each compartment in a bulkhead is a complete, independent instance of the service.

Slide 38

Slide 38 text

Bulkheads can be implemented on the client-side too! ‣ partition service instances into different groups ‣ divide available connections between clients ‣ if one client overloads a service instance, others aren't impacted ‣ high-priority clients can be assigned more resources

Slide 39

Slide 39 text

Bulkheads also work for data ‣ partitions / shards ‣ seperate tenant schema ‣ seperate database

Slide 40

Slide 40 text

Consider combining bulkheads with retries, circuit breakers and throttling.

Slide 41

Slide 41 text

5. Throttling

Slide 42

Slide 42 text

Implement multiple throttling strategies ‣ burst rate limit 6 6 Sliding window algorithm

Slide 43

Slide 43 text

Implement multiple throttling strategies ‣ burst rate limit 6 ‣ sustained rate limit 6 Sliding window algorithm

Slide 44

Slide 44 text

Implement multiple throttling strategies ‣ burst rate limit 6 ‣ sustained rate limit ‣ base those limits on real usage data 6 Sliding window algorithm

Slide 45

Slide 45 text

Build and test your throttling mechanism before you need it - Younger me

Slide 46

Slide 46 text

6. Set trip wires

Slide 47

Slide 47 text

1. Collect data 2. Identify KPIs ‣ what moves when you service is about to fail? ‣ response time / execution time ‣ CPU usage ‣ memory usage ‣ queue backlog ‣ number of connections 3. Configure monitoring and alerting

Slide 48

Slide 48 text

Learn the failure modes of the tools you use.

Slide 49

Slide 49 text

Resources ‣ The Amazon Builder's Library (free!) ‣ Azure's list of Cloud Design Patterns (free!) ‣ Release It!, an excellent book by Michael T. Nygard ‣ The series of SRE books by Google (free!)

Slide 50

Slide 50 text

Conclusion

Slide 51

Slide 51 text

Slides

Slide 52

Slide 52 text

Thank you!