Remote Calls != Local Calls @ PyCon 2016

Remote Calls != Local Calls @ PyCon 2016

30 minute talk for PyCon (2016) on graceful degradation when services fail.

https://us.pycon.org/2016/schedule/presentation/2027/

Video of the talk can be found here:

https://www.youtube.com/watch?v=dY-SkuENZP8

If you want to see speaker notes, see the original Google presentation:

https://docs.google.com/presentation/d/1ZyM9Mo9NlRvT6QuVsU2anfofyZv2XY5IP2iQCaSCfZQ/edit#slide=id.gc7ab45e34_0_0

1f0e10cc9e14ada9536aa47b4b565f71?s=128

Dan Riti

May 31, 2016
Tweet

Transcript

  1. 9.

    @danriti Dan Riti Q: What approaches support graceful degradation when

    (services, networks, data_stores) fail? A: 1. Timeouts 2. Circuit Breaker Pattern 3. Retries 4. Bulkhead Pattern
  2. 10.

    @danriti Dan Riti 1. Timeouts ◦ Forcing an error when

    a dependency is unhealthy 2. Circuit Breaker Pattern ◦ Prevent operations when a dependency is unhealthy 3. Retries ◦ Forcing extra attempts where extra latency is acceptable if a recovery provides more value 4. Bulkhead Pattern ◦ Partitioning a system to enforce the principle of damage containment
  3. 19.
  4. 20.

    @danriti Dan Riti Q: How should we degrade when the

    time_service is unavailable? A: Present “Unavailable” to user Give up on requests after 3 seconds Provide fault isolation
  5. 21.

    @danriti Dan Riti “Your code can't just wait forever for

    a response that might never come sooner or late, it needs to give up. Hope is not a design method.” - Michael T. Nygard, Release It! Timeouts
  6. 25.

    @danriti Dan Riti def get_time(): try: response = requests.get('http://localhost:3001/time', timeout=3.0)

    except requests.exceptions.Timeout: return 'Unavailable' return response.json().get('datetime')
  7. 30.

    @danriti Dan Riti Timeouts are not perfect Easy to get

    started with Provides some fault isolation Response bound to timeout value Still applying load to unhealthy service(s)
  8. 32.

    @danriti Dan Riti Circuit Breaker Pattern • “Allow one subsystem

    (an electrical circuit) to fail (excessive current draw) without destroying the entire system (the house)” • “Once the danger has passed, the circuit breaker can be reset to restore full function to the system” • “This differs from retries, in that circuit breakers exist to prevent operations rather then re-execute them” - Michael T. Nygard, Release It!
  9. 34.

    @danriti Dan Riti Circuit Breaker Pattern • Release It! by

    Michael T. Nygard (2007) ◦ https://pragprog.com/book/mnee/release-it • Netflix (2011) ◦ http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html • Martin Fowler (2014) ◦ http://martinfowler.com/bliki/CircuitBreaker.html
  10. 41.

    @danriti Dan Riti time_breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30) @time_breaker def get_time():

    ... # signal a failure to the circuit breaker raise pybreaker.CircuitBreakerError ...
  11. 43.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed
  12. 44.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed
  13. 45.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed
  14. 46.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
  15. 47.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
  16. 48.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  17. 51.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  18. 52.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open
  19. 53.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open 8. GET / Prevented 50 ms Open
  20. 55.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open
  21. 56.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open
  22. 57.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  23. 58.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open
  24. 59.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open
  25. 60.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  26. 61.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  27. 62.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open
  28. 63.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open
  29. 64.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  30. 65.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed
  31. 66.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed
  32. 67.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed
  33. 69.

    @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Graceful

    degradation of user experience • Fail fast and rapidly recover • Reduces load on unhealthy service • Avoid unhealthy service affecting the system • (Bonus) Interface to monitor and measure points of integration in a system
  34. 70.

    @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Under

    provisioned services can cause “flapping” ◦ Service is overwhelmed due to load ◦ Circuit breaker trips to “open” ◦ Service no longer receiving requests, so it “recovers” ◦ Circuit breaker “closes” • Important to understand service performance constraints
  35. 71.

    @danriti Dan Riti Parameters for Timeouts + Circuit Breaker •

    What do you present during failure? • How many times do you accept failure? (max_fails) • How long until you attempt reset? (reset_timeout) • How long will you wait? (timeout)
  36. 77.

    @danriti Dan Riti Circuit Breaker Libraries • Python - https://github.com/danielfm/pybreaker

    • Go - https://github.com/rubyist/circuitbreaker • Java - https://github.com/Netflix/Hystrix • Ruby - https://github.com/wsargent/circuit_breaker • .NET - https://github.com/michael-wolfenden/Polly • Javascript - https://github.com/yammer/circuit-breaker-js • PHP - https://github.com/ejsmont-artur/php-circuit-breaker
  37. 81.

    @danriti Dan Riti When should a retry be used? Does

    the benefit of obtaining a response from a service outweigh potentially increasing load on the service?
  38. 82.

    @danriti Dan Riti Retry Considerations • Limit the number of

    retries per request • Introduce delay between retry attempts ◦ Exponential backoff ◦ Randomized jitter 1 1 https://www.awsarchitectureblog.com/2015/03/backoff.html
  39. 85.

    @danriti Dan Riti from retrying import retry @retry(stop_max_attempt_number=3, wait_exponential_multiplier=1000, #

    2^N * 1000ms wait_jitter_max=500) # 500ms def get_user(): ... # signal a failure to the retry decorator raise Exception ...
  40. 91.
  41. 93.

    @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

    1st attempt failed, so make 2nd attempt Makes 4 attempts and then bubbles error up a level
  42. 95.

    @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

    Makes 4 attempts and then bubbles error up a level 2nd attempt failed, so make 3rd attempt
  43. 97.

    @danriti Dan Riti Retry Strategies • Use clear response codes

    ◦ Separate retriable and non retriable errors ◦ Return a specific specific status when overloaded • Retry budgets ◦ Per-request retry budget ◦ Per-client retry budget ◦ Server-wide retry budget • Monitor your retry rates
  44. 98.

    @danriti Dan Riti Retries • Effective when applied responsibly •

    Harmful when applied irresponsibly • Implement retry strategies ◦ Use clear response codes ◦ Retry budgets ◦ Monitor your retry rates
  45. 103.

    @danriti Dan Riti Fallacies of Distributed Computing 1. The network

    is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
  46. 105.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Skipped 50 ms Open 8. GET / Skipped 50 ms Open
  47. 106.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  48. 107.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed