Slide 1

Slide 1 text

Graceful degradation when services fail Remote Calls != Local Calls

Slide 2

Slide 2 text

@danriti Dan Riti Senior Software Engineer @ Dan Riti @danriti [email protected] github.com/danriti

Slide 3

Slide 3 text

@danriti Dan Riti Monolithic

Slide 4

Slide 4 text

@danriti Dan Riti Monolithic Services

Slide 5

Slide 5 text

@danriti Dan Riti Dependencies

Slide 6

Slide 6 text

@danriti Dan Riti Points of Failure

Slide 7

Slide 7 text

@danriti Dan Riti Remote Calls != Local Calls

Slide 8

Slide 8 text

@danriti Dan Riti Q: What approaches support graceful degradation when (services, networks, data_stores) fail?

Slide 9

Slide 9 text

@danriti Dan Riti Q: What approaches support graceful degradation when (services, networks, data_stores) fail? A: 1. Timeouts 2. Circuit Breaker Pattern 3. Retries 4. Bulkhead Pattern

Slide 10

Slide 10 text

@danriti Dan Riti 1. Timeouts ○ Forcing an error when a dependency is unhealthy 2. Circuit Breaker Pattern ○ Prevent operations when a dependency is unhealthy 3. Retries ○ Forcing extra attempts where extra latency is acceptable if a recovery provides more value 4. Bulkhead Pattern ○ Partitioning a system to enforce the principle of damage containment

Slide 11

Slide 11 text

@danriti Dan Riti Time Service User Service Web App

Slide 12

Slide 12 text

@danriti Dan Riti Time Service User Service Web App RESTful RESTful

Slide 13

Slide 13 text

@danriti Dan Riti Time Service User Service Web App RESTful RESTful HTTP HTTP

Slide 14

Slide 14 text

@danriti Dan Riti Time Service Response { "time": "2015-09-11T17:33:48.940483" } User Service Response { "name": "Dan Riti" }

Slide 15

Slide 15 text

@danriti Dan Riti

Slide 16

Slide 16 text

@danriti Dan Riti web app

Slide 17

Slide 17 text

@danriti Dan Riti web app time service

Slide 18

Slide 18 text

@danriti Dan Riti web app time service user service

Slide 19

Slide 19 text

@danriti Dan Riti Q: How should we degrade when the time_service is unavailable?

Slide 20

Slide 20 text

@danriti Dan Riti Q: How should we degrade when the time_service is unavailable? A: Present “Unavailable” to user Give up on requests after 3 seconds Provide fault isolation

Slide 21

Slide 21 text

@danriti Dan Riti https://github.com/danriti/short-circuit

Slide 22

Slide 22 text

@danriti Dan Riti “Your code can't just wait forever for a response that might never come sooner or late, it needs to give up. Hope is not a design method.” - Michael T. Nygard, Release It! Timeouts

Slide 23

Slide 23 text

@danriti Dan Riti

Slide 24

Slide 24 text

@danriti Dan Riti response = requests.get('http://localhost:3001/time')

Slide 25

Slide 25 text

@danriti Dan Riti response = requests.get('http://localhost:3001/time', timeout=3.0)

Slide 26

Slide 26 text

@danriti Dan Riti RTFM

Slide 27

Slide 27 text

@danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The time service is unhealthy Web App

Slide 28

Slide 28 text

@danriti Dan Riti 3.01 s

Slide 29

Slide 29 text

@danriti Dan Riti 3.01 s Broken Pipe

Slide 30

Slide 30 text

@danriti Dan Riti Timeouts are not perfect Easy to get started with Provides some fault isolation Response bound to timeout value Still applying load to unhealthy service(s)

Slide 31

Slide 31 text

@danriti Dan Riti Circuit Breaker Pattern

Slide 32

Slide 32 text

@danriti Dan Riti Circuit Breaker Pattern ● “Allow one subsystem (an electrical circuit) to fail (excessive current draw) without destroying the entire system (the house)” ● “Once the danger has passed, the circuit breaker can be reset to restore full function to the system” ● “This differs from retries, in that circuit breakers exist to prevent operations rather then re-execute them” - Michael T. Nygard, Release It!

Slide 33

Slide 33 text

@danriti Dan Riti

Slide 34

Slide 34 text

@danriti Dan Riti Circuit Breaker Pattern ● Release It! by Michael T. Nygard (2007) ○ https://pragprog.com/book/mnee/release-it ● Netflix (2011) ○ http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html ● Martin Fowler (2014) ○ http://martinfowler.com/bliki/CircuitBreaker.html

Slide 35

Slide 35 text

@danriti Dan Riti

Slide 36

Slide 36 text

@danriti Dan Riti time_breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30)

Slide 37

Slide 37 text

@danriti Dan Riti time_breaker = pybreaker.CircuitBreaker( fail_max=3, reset_timeout=30)

Slide 38

Slide 38 text

@danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The system is healthy Web App

Slide 39

Slide 39 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed

Slide 40

Slide 40 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed

Slide 41

Slide 41 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed

Slide 42

Slide 42 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure)

Slide 43

Slide 43 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure)

Slide 44

Slide 44 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)

Slide 45

Slide 45 text

@danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The time service is unhealthy Web App

Slide 46

Slide 46 text

@danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The circuit breaker is open Web App

Slide 47

Slide 47 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)

Slide 48

Slide 48 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open

Slide 49

Slide 49 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open 8. GET / Prevented 50 ms Open

Slide 50

Slide 50 text

@danriti Dan Riti 6 ms

Slide 51

Slide 51 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open

Slide 52

Slide 52 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open

Slide 53

Slide 53 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open

Slide 54

Slide 54 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open

Slide 55

Slide 55 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open

Slide 56

Slide 56 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open

Slide 57

Slide 57 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open

Slide 58

Slide 58 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 13. GET / Prevented 50 ms Open

Slide 59

Slide 59 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open

Slide 60

Slide 60 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open

Slide 61

Slide 61 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed

Slide 62

Slide 62 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed

Slide 63

Slide 63 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed

Slide 64

Slide 64 text

@danriti Dan Riti 14 ms

Slide 65

Slide 65 text

@danriti Dan Riti Timeouts + Circuit Breaker Pattern ● Graceful degradation of user experience ● Fail fast and rapidly recover ● Reduces load on unhealthy service ● Avoid unhealthy service affecting the system ● (Bonus) Interface to monitor and measure points of integration in a system

Slide 66

Slide 66 text

@danriti Dan Riti Timeouts + Circuit Breaker Pattern ● Under provisioned services can cause “flapping” ○ Service is overwhelmed due to load ○ Circuit breaker trips to “open” ○ Service no longer receiving requests, so it “recovers” ○ Circuit breaker “closes” ● Important to understand service performance constraints

Slide 67

Slide 67 text

@danriti Dan Riti Parameters for Timeouts + Circuit Breaker ● What do you present during failure? ● How many times do you accept failure? (max_fails) ● How long until you attempt reset? (reset_timeout) ● How long will you wait? (timeout)

Slide 68

Slide 68 text

@danriti Dan Riti

Slide 69

Slide 69 text

@danriti Dan Riti

Slide 70

Slide 70 text

@danriti Dan Riti

Slide 71

Slide 71 text

@danriti Dan Riti

Slide 72

Slide 72 text

@danriti Dan Riti Holiday Weekend

Slide 73

Slide 73 text

@danriti Dan Riti Circuit Breaker Libraries ● Python - https://github.com/danielfm/pybreaker ● Go - https://github.com/rubyist/circuitbreaker ● Java - https://github.com/Netflix/Hystrix ● Ruby - https://github.com/wsargent/circuit_breaker ● .NET - https://github.com/michael-wolfenden/Polly ● Javascript - https://github.com/yammer/circuit-breaker-js ● PHP - https://github.com/ejsmont-artur/php-circuit-breaker

Slide 74

Slide 74 text

@danriti Dan Riti https://github.com/Netflix/Hystrix/wiki/Dashboard

Slide 75

Slide 75 text

@danriti Dan Riti Thank You @danriti

Slide 76

Slide 76 text

@danriti Dan Riti Backup

Slide 77

Slide 77 text

@danriti Dan Riti Source: https://github.com/Netflix/Hystrix/wiki

Slide 78

Slide 78 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Skipped 50 ms Open 8. GET / Skipped 50 ms Open

Slide 79

Slide 79 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open

Slide 80

Slide 80 text

@danriti Dan Riti # Request Time Service User Service Circuit Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed

Slide 81

Slide 81 text

@danriti Dan Riti Fallacies of Distributed Computing 1. The network is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

Slide 82

Slide 82 text

@danriti Dan Riti