Remote Calls != Local Calls @ PyCon 2016

Remote Calls != Local Calls @ PyCon 2016

30 minute talk for PyCon (2016) on graceful degradation when services fail.

https://us.pycon.org/2016/schedule/presentation/2027/

Video of the talk can be found here:

https://www.youtube.com/watch?v=dY-SkuENZP8

If you want to see speaker notes, see the original Google presentation:

https://docs.google.com/presentation/d/1ZyM9Mo9NlRvT6QuVsU2anfofyZv2XY5IP2iQCaSCfZQ/edit#slide=id.gc7ab45e34_0_0

1f0e10cc9e14ada9536aa47b4b565f71?s=128

Dan Riti

May 31, 2016
Tweet

Transcript

  1. Graceful degradation when services fail Remote Calls != Local Calls

  2. @danriti Dan Riti Senior Software Engineer @ Dan Riti @danriti

    dmriti@gmail.com github.com/danriti
  3. @danriti Dan Riti Monolithic

  4. @danriti Dan Riti Monolithic Services

  5. @danriti Dan Riti Dependencies

  6. @danriti Dan Riti Points of Failure

  7. @danriti Dan Riti Remote Calls != Local Calls

  8. @danriti Dan Riti Q: What approaches support graceful degradation when

    (services, networks, data_stores) fail?
  9. @danriti Dan Riti Q: What approaches support graceful degradation when

    (services, networks, data_stores) fail? A: 1. Timeouts 2. Circuit Breaker Pattern 3. Retries 4. Bulkhead Pattern
  10. @danriti Dan Riti 1. Timeouts ◦ Forcing an error when

    a dependency is unhealthy 2. Circuit Breaker Pattern ◦ Prevent operations when a dependency is unhealthy 3. Retries ◦ Forcing extra attempts where extra latency is acceptable if a recovery provides more value 4. Bulkhead Pattern ◦ Partitioning a system to enforce the principle of damage containment
  11. @danriti Dan Riti Time Service User Service Web App

  12. @danriti Dan Riti Time Service User Service Web App RESTful

    RESTful
  13. @danriti Dan Riti Time Service User Service Web App RESTful

    RESTful HTTP HTTP
  14. @danriti Dan Riti Time Service Response { "time": "2015-09-11T17:33:48.940483" }

    User Service Response { "name": "Dan Riti" }
  15. @danriti Dan Riti

  16. @danriti Dan Riti web app

  17. @danriti Dan Riti web app time service

  18. @danriti Dan Riti web app time service user service

  19. @danriti Dan Riti Q: How should we degrade when the

    time_service is unavailable?
  20. @danriti Dan Riti Q: How should we degrade when the

    time_service is unavailable? A: Present “Unavailable” to user Give up on requests after 3 seconds Provide fault isolation
  21. @danriti Dan Riti “Your code can't just wait forever for

    a response that might never come sooner or late, it needs to give up. Hope is not a design method.” - Michael T. Nygard, Release It! Timeouts
  22. @danriti Dan Riti

  23. @danriti Dan Riti response = requests.get('http://localhost:3001/time')

  24. @danriti Dan Riti response = requests.get('http://localhost:3001/time', timeout=3.0)

  25. @danriti Dan Riti def get_time(): try: response = requests.get('http://localhost:3001/time', timeout=3.0)

    except requests.exceptions.Timeout: return 'Unavailable' return response.json().get('datetime')
  26. @danriti Dan Riti RTFM

  27. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The time

    service is unhealthy Web App
  28. @danriti Dan Riti 3.01 s

  29. @danriti Dan Riti 3.01 s Broken Pipe

  30. @danriti Dan Riti Timeouts are not perfect Easy to get

    started with Provides some fault isolation Response bound to timeout value Still applying load to unhealthy service(s)
  31. @danriti Dan Riti Circuit Breaker Pattern

  32. @danriti Dan Riti Circuit Breaker Pattern • “Allow one subsystem

    (an electrical circuit) to fail (excessive current draw) without destroying the entire system (the house)” • “Once the danger has passed, the circuit breaker can be reset to restore full function to the system” • “This differs from retries, in that circuit breakers exist to prevent operations rather then re-execute them” - Michael T. Nygard, Release It!
  33. @danriti Dan Riti

  34. @danriti Dan Riti Circuit Breaker Pattern • Release It! by

    Michael T. Nygard (2007) ◦ https://pragprog.com/book/mnee/release-it • Netflix (2011) ◦ http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html • Martin Fowler (2014) ◦ http://martinfowler.com/bliki/CircuitBreaker.html
  35. @danriti Dan Riti

  36. @danriti Dan Riti Closed

  37. @danriti Dan Riti Closed Open

  38. @danriti Dan Riti Closed Open Half-open

  39. @danriti Dan Riti time_breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30)

  40. @danriti Dan Riti time_breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30)

  41. @danriti Dan Riti time_breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30) @time_breaker def get_time():

    ... # signal a failure to the circuit breaker raise pybreaker.CircuitBreakerError ...
  42. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The system

    is healthy Web App
  43. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed
  44. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed
  45. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed
  46. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
  47. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
  48. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  49. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The time

    service is unhealthy Web App
  50. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The circuit

    breaker is open Web App
  51. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  52. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open
  53. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open 8. GET / Prevented 50 ms Open
  54. @danriti Dan Riti ~50 ms

  55. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open
  56. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open
  57. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  58. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open
  59. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open
  60. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  61. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  62. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open
  63. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open
  64. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  65. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed
  66. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed
  67. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed
  68. @danriti Dan Riti ~50 ms

  69. @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Graceful

    degradation of user experience • Fail fast and rapidly recover • Reduces load on unhealthy service • Avoid unhealthy service affecting the system • (Bonus) Interface to monitor and measure points of integration in a system
  70. @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Under

    provisioned services can cause “flapping” ◦ Service is overwhelmed due to load ◦ Circuit breaker trips to “open” ◦ Service no longer receiving requests, so it “recovers” ◦ Circuit breaker “closes” • Important to understand service performance constraints
  71. @danriti Dan Riti Parameters for Timeouts + Circuit Breaker •

    What do you present during failure? • How many times do you accept failure? (max_fails) • How long until you attempt reset? (reset_timeout) • How long will you wait? (timeout)
  72. @danriti Dan Riti

  73. @danriti Dan Riti

  74. @danriti Dan Riti

  75. @danriti Dan Riti

  76. @danriti Dan Riti Holiday Weekend

  77. @danriti Dan Riti Circuit Breaker Libraries • Python - https://github.com/danielfm/pybreaker

    • Go - https://github.com/rubyist/circuitbreaker • Java - https://github.com/Netflix/Hystrix • Ruby - https://github.com/wsargent/circuit_breaker • .NET - https://github.com/michael-wolfenden/Polly • Javascript - https://github.com/yammer/circuit-breaker-js • PHP - https://github.com/ejsmont-artur/php-circuit-breaker
  78. @danriti Dan Riti https://github.com/Netflix/Hystrix/wiki/Dashboard

  79. @danriti Dan Riti Retries

  80. @danriti Dan Riti Retries If at first you don’t succeed,

    attempt the operation again
  81. @danriti Dan Riti When should a retry be used? Does

    the benefit of obtaining a response from a service outweigh potentially increasing load on the service?
  82. @danriti Dan Riti Retry Considerations • Limit the number of

    retries per request • Introduce delay between retry attempts ◦ Exponential backoff ◦ Randomized jitter 1 1 https://www.awsarchitectureblog.com/2015/03/backoff.html
  83. @danriti Dan Riti from retrying import retry @retry(stop_max_attempt_number=3, wait_exponential_multiplier=1000, wait_jitter_max=500)

    def get_user(): ...
  84. @danriti Dan Riti from retrying import retry @retry(stop_max_attempt_number=3, wait_exponential_multiplier=1000, #

    2^N * 1000ms wait_jitter_max=500) # 500ms def get_user(): ...
  85. @danriti Dan Riti from retrying import retry @retry(stop_max_attempt_number=3, wait_exponential_multiplier=1000, #

    2^N * 1000ms wait_jitter_max=500) # 500ms def get_user(): ... # signal a failure to the retry decorator raise Exception ...
  86. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The network

    to user service is unhealthy Web App
  87. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The network

    to user service is unhealthy Web App
  88. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The network

    to user service is healthy Web App
  89. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The network

    to user service is healthy Web App
  90. @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

  91. @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

    Makes 4 attempts and then bubbles error up a level
  92. @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

    1st attempt failed, so make 2nd attempt
  93. @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

    1st attempt failed, so make 2nd attempt Makes 4 attempts and then bubbles error up a level
  94. @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

    2nd attempt failed, so make 3rd attempt
  95. @danriti Dan Riti Combinational Retry Explosion Database Backend Frontend Javascript

    Makes 4 attempts and then bubbles error up a level 2nd attempt failed, so make 3rd attempt
  96. @danriti Dan Riti Combinational Retry Explosion 4 attempts ^ 3

    levels = 64 attempts
  97. @danriti Dan Riti Retry Strategies • Use clear response codes

    ◦ Separate retriable and non retriable errors ◦ Return a specific specific status when overloaded • Retry budgets ◦ Per-request retry budget ◦ Per-client retry budget ◦ Server-wide retry budget • Monitor your retry rates
  98. @danriti Dan Riti Retries • Effective when applied responsibly •

    Harmful when applied irresponsibly • Implement retry strategies ◦ Use clear response codes ◦ Retry budgets ◦ Monitor your retry rates
  99. @danriti Dan Riti Resources

  100. @danriti Dan Riti Thank You @danriti

  101. @danriti Dan Riti Backup

  102. @danriti Dan Riti https://github.com/danriti/short-circuit

  103. @danriti Dan Riti Fallacies of Distributed Computing 1. The network

    is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
  104. @danriti Dan Riti Source: https://github.com/Netflix/Hystrix/wiki

  105. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Skipped 50 ms Open 8. GET / Skipped 50 ms Open
  106. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  107. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed
  108. @danriti Dan Riti