$30 off During Our Annual Pro Sale. View Details »

Remote Calls != Local Calls @ PyCon 2016

Remote Calls != Local Calls @ PyCon 2016

30 minute talk for PyCon (2016) on graceful degradation when services fail.

https://us.pycon.org/2016/schedule/presentation/2027/

Video of the talk can be found here:

https://www.youtube.com/watch?v=dY-SkuENZP8

If you want to see speaker notes, see the original Google presentation:

https://docs.google.com/presentation/d/1ZyM9Mo9NlRvT6QuVsU2anfofyZv2XY5IP2iQCaSCfZQ/edit#slide=id.gc7ab45e34_0_0

Dan Riti

May 31, 2016
Tweet

More Decks by Dan Riti

Other Decks in Programming

Transcript

  1. Graceful degradation when services fail
    Remote Calls != Local Calls

    View Slide

  2. @danriti
    Dan Riti
    Senior Software Engineer @
    Dan Riti
    @danriti
    [email protected]
    github.com/danriti

    View Slide

  3. @danriti
    Dan Riti
    Monolithic

    View Slide

  4. @danriti
    Dan Riti
    Monolithic Services

    View Slide

  5. @danriti
    Dan Riti
    Dependencies

    View Slide

  6. @danriti
    Dan Riti
    Points of Failure

    View Slide

  7. @danriti
    Dan Riti
    Remote Calls != Local Calls

    View Slide

  8. @danriti
    Dan Riti
    Q:
    What approaches support graceful degradation when
    (services, networks, data_stores) fail?

    View Slide

  9. @danriti
    Dan Riti
    Q:
    What approaches support graceful degradation when
    (services, networks, data_stores) fail?
    A:
    1. Timeouts
    2. Circuit Breaker Pattern
    3. Retries
    4. Bulkhead Pattern

    View Slide

  10. @danriti
    Dan Riti
    1. Timeouts
    ○ Forcing an error when a dependency is unhealthy
    2. Circuit Breaker Pattern
    ○ Prevent operations when a dependency is unhealthy
    3. Retries
    ○ Forcing extra attempts where extra latency is acceptable if a
    recovery provides more value
    4. Bulkhead Pattern
    ○ Partitioning a system to enforce the principle of damage
    containment

    View Slide

  11. @danriti
    Dan Riti
    Time Service
    User Service
    Web App

    View Slide

  12. @danriti
    Dan Riti
    Time Service
    User Service
    Web App
    RESTful
    RESTful

    View Slide

  13. @danriti
    Dan Riti
    Time Service
    User Service
    Web App
    RESTful
    RESTful
    HTTP
    HTTP

    View Slide

  14. @danriti
    Dan Riti
    Time Service Response
    {
    "time": "2015-09-11T17:33:48.940483"
    }
    User Service Response
    {
    "name": "Dan Riti"
    }

    View Slide

  15. @danriti
    Dan Riti

    View Slide

  16. @danriti
    Dan Riti
    web app

    View Slide

  17. @danriti
    Dan Riti
    web app
    time service

    View Slide

  18. @danriti
    Dan Riti
    web app
    time service
    user service

    View Slide

  19. @danriti
    Dan Riti
    Q:
    How should we degrade when the time_service is
    unavailable?

    View Slide

  20. @danriti
    Dan Riti
    Q:
    How should we degrade when the time_service is
    unavailable?
    A:
    Present “Unavailable” to user
    Give up on requests after 3 seconds
    Provide fault isolation

    View Slide

  21. @danriti
    Dan Riti
    “Your code can't just wait forever for a response
    that might never come sooner or late, it needs to
    give up. Hope is not a design method.”
    - Michael T. Nygard, Release It!
    Timeouts

    View Slide

  22. @danriti
    Dan Riti

    View Slide

  23. @danriti
    Dan Riti
    response = requests.get('http://localhost:3001/time')

    View Slide

  24. @danriti
    Dan Riti
    response = requests.get('http://localhost:3001/time',
    timeout=3.0)

    View Slide

  25. @danriti
    Dan Riti
    def get_time():
    try:
    response = requests.get('http://localhost:3001/time',
    timeout=3.0)
    except requests.exceptions.Timeout:
    return 'Unavailable'
    return response.json().get('datetime')

    View Slide

  26. @danriti
    Dan Riti
    RTFM

    View Slide

  27. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The time service is unhealthy
    Web App

    View Slide

  28. @danriti
    Dan Riti
    3.01 s

    View Slide

  29. @danriti
    Dan Riti
    3.01 s
    Broken Pipe

    View Slide

  30. @danriti
    Dan Riti
    Timeouts are not perfect
    Easy to get started with
    Provides some fault isolation
    Response bound to timeout value
    Still applying load to unhealthy service(s)

    View Slide

  31. @danriti
    Dan Riti
    Circuit Breaker Pattern

    View Slide

  32. @danriti
    Dan Riti
    Circuit Breaker Pattern
    ● “Allow one subsystem (an electrical circuit) to fail (excessive current
    draw) without destroying the entire system (the house)”
    ● “Once the danger has passed, the circuit breaker can be reset to
    restore full function to the system”
    ● “This differs from retries, in that circuit breakers exist to prevent
    operations rather then re-execute them”
    - Michael T. Nygard, Release It!

    View Slide

  33. @danriti
    Dan Riti

    View Slide

  34. @danriti
    Dan Riti
    Circuit Breaker Pattern
    ● Release It! by Michael T. Nygard (2007)
    ○ https://pragprog.com/book/mnee/release-it
    ● Netflix (2011)
    ○ http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
    ● Martin Fowler (2014)
    ○ http://martinfowler.com/bliki/CircuitBreaker.html

    View Slide

  35. @danriti
    Dan Riti

    View Slide

  36. @danriti
    Dan Riti
    Closed

    View Slide

  37. @danriti
    Dan Riti
    Closed
    Open

    View Slide

  38. @danriti
    Dan Riti
    Closed
    Open
    Half-open

    View Slide

  39. @danriti
    Dan Riti
    time_breaker = pybreaker.CircuitBreaker(fail_max=3,
    reset_timeout=30)

    View Slide

  40. @danriti
    Dan Riti
    time_breaker = pybreaker.CircuitBreaker(fail_max=3,
    reset_timeout=30)

    View Slide

  41. @danriti
    Dan Riti
    time_breaker = pybreaker.CircuitBreaker(fail_max=3,
    reset_timeout=30)
    @time_breaker
    def get_time():
    ...
    # signal a failure to the circuit breaker
    raise pybreaker.CircuitBreakerError
    ...

    View Slide

  42. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The system is healthy
    Web App

    View Slide

  43. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed

    View Slide

  44. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed

    View Slide

  45. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed

    View Slide

  46. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)

    View Slide

  47. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)

    View Slide

  48. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)

    View Slide

  49. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The time service is unhealthy
    Web App

    View Slide

  50. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The circuit breaker is open
    Web App

    View Slide

  51. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)

    View Slide

  52. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
    7. GET / Prevented 50 ms Open

    View Slide

  53. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
    7. GET / Prevented 50 ms Open
    8. GET / Prevented 50 ms Open

    View Slide

  54. @danriti
    Dan Riti
    ~50 ms

    View Slide

  55. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open

    View Slide

  56. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open

    View Slide

  57. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open

    View Slide

  58. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open

    View Slide

  59. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open

    View Slide

  60. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open
    12. GET / Prevented 50 ms Open

    View Slide

  61. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open
    12. GET / Prevented 50 ms Open

    View Slide

  62. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open

    View Slide

  63. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open

    View Slide

  64. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open

    View Slide

  65. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed

    View Slide

  66. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed
    15. GET / 50 ms 50 ms Closed

    View Slide

  67. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed
    15. GET / 50 ms 50 ms Closed
    16. GET / 50 ms 50 ms Closed

    View Slide

  68. @danriti
    Dan Riti
    ~50 ms

    View Slide

  69. @danriti
    Dan Riti
    Timeouts + Circuit Breaker Pattern
    ● Graceful degradation of user experience
    ● Fail fast and rapidly recover
    ● Reduces load on unhealthy service
    ● Avoid unhealthy service affecting the system
    ● (Bonus) Interface to monitor and measure points of
    integration in a system

    View Slide

  70. @danriti
    Dan Riti
    Timeouts + Circuit Breaker Pattern
    ● Under provisioned services can cause “flapping”
    ○ Service is overwhelmed due to load
    ○ Circuit breaker trips to “open”
    ○ Service no longer receiving requests, so it
    “recovers”
    ○ Circuit breaker “closes”
    ● Important to understand service performance
    constraints

    View Slide

  71. @danriti
    Dan Riti
    Parameters for Timeouts + Circuit Breaker
    ● What do you present during failure?
    ● How many times do you accept failure? (max_fails)
    ● How long until you attempt reset? (reset_timeout)
    ● How long will you wait? (timeout)

    View Slide

  72. @danriti
    Dan Riti

    View Slide

  73. @danriti
    Dan Riti

    View Slide

  74. @danriti
    Dan Riti

    View Slide

  75. @danriti
    Dan Riti

    View Slide

  76. @danriti
    Dan Riti
    Holiday Weekend

    View Slide

  77. @danriti
    Dan Riti
    Circuit Breaker Libraries
    ● Python - https://github.com/danielfm/pybreaker
    ● Go - https://github.com/rubyist/circuitbreaker
    ● Java - https://github.com/Netflix/Hystrix
    ● Ruby - https://github.com/wsargent/circuit_breaker
    ● .NET - https://github.com/michael-wolfenden/Polly
    ● Javascript - https://github.com/yammer/circuit-breaker-js
    ● PHP - https://github.com/ejsmont-artur/php-circuit-breaker

    View Slide

  78. @danriti
    Dan Riti
    https://github.com/Netflix/Hystrix/wiki/Dashboard

    View Slide

  79. @danriti
    Dan Riti
    Retries

    View Slide

  80. @danriti
    Dan Riti
    Retries
    If at first you don’t succeed,
    attempt the operation again

    View Slide

  81. @danriti
    Dan Riti
    When should a retry be used?
    Does the benefit of obtaining a response from a
    service outweigh potentially increasing load on
    the service?

    View Slide

  82. @danriti
    Dan Riti
    Retry Considerations
    ● Limit the number of retries per request
    ● Introduce delay between retry attempts
    ○ Exponential backoff
    ○ Randomized jitter 1
    1 https://www.awsarchitectureblog.com/2015/03/backoff.html

    View Slide

  83. @danriti
    Dan Riti
    from retrying import retry
    @retry(stop_max_attempt_number=3,
    wait_exponential_multiplier=1000,
    wait_jitter_max=500)
    def get_user():
    ...

    View Slide

  84. @danriti
    Dan Riti
    from retrying import retry
    @retry(stop_max_attempt_number=3,
    wait_exponential_multiplier=1000, # 2^N * 1000ms
    wait_jitter_max=500) # 500ms
    def get_user():
    ...

    View Slide

  85. @danriti
    Dan Riti
    from retrying import retry
    @retry(stop_max_attempt_number=3,
    wait_exponential_multiplier=1000, # 2^N * 1000ms
    wait_jitter_max=500) # 500ms
    def get_user():
    ...
    # signal a failure to the retry decorator
    raise Exception
    ...

    View Slide

  86. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The network to user service is
    unhealthy
    Web App

    View Slide

  87. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The network to user service is
    unhealthy
    Web App

    View Slide

  88. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The network to user service is
    healthy
    Web App

    View Slide

  89. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The network to user service is
    healthy
    Web App

    View Slide

  90. @danriti
    Dan Riti
    Combinational Retry Explosion
    Database
    Backend
    Frontend
    Javascript

    View Slide

  91. @danriti
    Dan Riti
    Combinational Retry Explosion
    Database
    Backend
    Frontend
    Javascript
    Makes 4 attempts and then
    bubbles error up a level

    View Slide

  92. @danriti
    Dan Riti
    Combinational Retry Explosion
    Database
    Backend
    Frontend
    Javascript
    1st attempt failed, so
    make 2nd attempt

    View Slide

  93. @danriti
    Dan Riti
    Combinational Retry Explosion
    Database
    Backend
    Frontend
    Javascript
    1st attempt failed, so
    make 2nd attempt
    Makes 4 attempts and then
    bubbles error up a level

    View Slide

  94. @danriti
    Dan Riti
    Combinational Retry Explosion
    Database
    Backend
    Frontend
    Javascript
    2nd attempt failed, so
    make 3rd attempt

    View Slide

  95. @danriti
    Dan Riti
    Combinational Retry Explosion
    Database
    Backend
    Frontend
    Javascript
    Makes 4 attempts and then
    bubbles error up a level
    2nd attempt failed, so
    make 3rd attempt

    View Slide

  96. @danriti
    Dan Riti
    Combinational Retry Explosion
    4 attempts ^ 3 levels = 64 attempts

    View Slide

  97. @danriti
    Dan Riti
    Retry Strategies
    ● Use clear response codes
    ○ Separate retriable and non retriable errors
    ○ Return a specific specific status when overloaded
    ● Retry budgets
    ○ Per-request retry budget
    ○ Per-client retry budget
    ○ Server-wide retry budget
    ● Monitor your retry rates

    View Slide

  98. @danriti
    Dan Riti
    Retries
    ● Effective when applied responsibly
    ● Harmful when applied irresponsibly
    ● Implement retry strategies
    ○ Use clear response codes
    ○ Retry budgets
    ○ Monitor your retry rates

    View Slide

  99. @danriti
    Dan Riti
    Resources

    View Slide

  100. @danriti
    Dan Riti
    Thank You
    @danriti

    View Slide

  101. @danriti
    Dan Riti
    Backup

    View Slide

  102. @danriti
    Dan Riti
    https://github.com/danriti/short-circuit

    View Slide

  103. @danriti
    Dan Riti
    Fallacies of Distributed Computing
    1. The network is reliable.
    2. Latency is zero.
    3. Bandwidth is infinite.
    4. The network is secure.
    5. Topology doesn't change.
    6. There is one administrator.
    7. Transport cost is zero.
    8. The network is homogeneous.
    https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

    View Slide

  104. @danriti
    Dan Riti
    Source: https://github.com/Netflix/Hystrix/wiki

    View Slide

  105. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
    7. GET / Skipped 50 ms Open
    8. GET / Skipped 50 ms Open

    View Slide

  106. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open
    12. GET / Prevented 50 ms Open

    View Slide

  107. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed
    15. GET / 50 ms 50 ms Closed
    16. GET / 50 ms 50 ms Closed

    View Slide

  108. @danriti
    Dan Riti

    View Slide