$30 off During Our Annual Pro Sale. View Details »

Remote Calls != Local Calls @ DevOps Days Boston 2015

Dan Riti
September 15, 2015

Remote Calls != Local Calls @ DevOps Days Boston 2015

20 minute talk for DevOps Days Boston (2015) on graceful degradation when services fail.

http://www.devopsdays.org/events/2015-boston/program/#desc2
https://twitter.com/AppNeta/status/643807883029803008
https://github.com/danriti/short-circuit

If you want to see speaker notes, see the original Google presentation:

https://docs.google.com/presentation/d/1VyEuNQoA149ZUw3DBk19ugE1C4vV6RocrOQHUM9WUQs

Dan Riti

September 15, 2015
Tweet

More Decks by Dan Riti

Other Decks in Programming

Transcript

  1. Graceful degradation when services fail
    Remote Calls != Local Calls

    View Slide

  2. @danriti
    Dan Riti
    Senior Software Engineer @
    Dan Riti
    @danriti
    [email protected]
    github.com/danriti

    View Slide

  3. @danriti
    Dan Riti
    Monolithic

    View Slide

  4. @danriti
    Dan Riti
    Monolithic Services

    View Slide

  5. @danriti
    Dan Riti
    Dependencies

    View Slide

  6. @danriti
    Dan Riti
    Points of Failure

    View Slide

  7. @danriti
    Dan Riti
    Remote Calls != Local Calls

    View Slide

  8. @danriti
    Dan Riti
    Q:
    What approaches support graceful degradation when
    (services, networks, data_stores) fail?

    View Slide

  9. @danriti
    Dan Riti
    Q:
    What approaches support graceful degradation when
    (services, networks, data_stores) fail?
    A:
    1. Timeouts
    2. Circuit Breaker Pattern
    3. Retries
    4. Bulkhead Pattern

    View Slide

  10. @danriti
    Dan Riti
    1. Timeouts
    ○ Forcing an error when a dependency is unhealthy
    2. Circuit Breaker Pattern
    ○ Prevent operations when a dependency is unhealthy
    3. Retries
    ○ Forcing extra attempts where extra latency is acceptable if a
    recovery provides more value
    4. Bulkhead Pattern
    ○ Partitioning a system to enforce the principle of damage
    containment

    View Slide

  11. @danriti
    Dan Riti
    Time Service
    User Service
    Web App

    View Slide

  12. @danriti
    Dan Riti
    Time Service
    User Service
    Web App
    RESTful
    RESTful

    View Slide

  13. @danriti
    Dan Riti
    Time Service
    User Service
    Web App
    RESTful
    RESTful
    HTTP
    HTTP

    View Slide

  14. @danriti
    Dan Riti
    Time Service Response
    {
    "time": "2015-09-11T17:33:48.940483"
    }
    User Service Response
    {
    "name": "Dan Riti"
    }

    View Slide

  15. @danriti
    Dan Riti

    View Slide

  16. @danriti
    Dan Riti
    web app

    View Slide

  17. @danriti
    Dan Riti
    web app
    time service

    View Slide

  18. @danriti
    Dan Riti
    web app
    time service
    user service

    View Slide

  19. @danriti
    Dan Riti
    Q:
    How should we degrade when the time_service is
    unavailable?

    View Slide

  20. @danriti
    Dan Riti
    Q:
    How should we degrade when the time_service is
    unavailable?
    A:
    Present “Unavailable” to user
    Give up on requests after 3 seconds
    Provide fault isolation

    View Slide

  21. @danriti
    Dan Riti
    https://github.com/danriti/short-circuit

    View Slide

  22. @danriti
    Dan Riti
    “Your code can't just wait forever for a response
    that might never come sooner or late, it needs to
    give up. Hope is not a design method.”
    - Michael T. Nygard, Release It!
    Timeouts

    View Slide

  23. @danriti
    Dan Riti

    View Slide

  24. @danriti
    Dan Riti
    response = requests.get('http://localhost:3001/time')

    View Slide

  25. @danriti
    Dan Riti
    response = requests.get('http://localhost:3001/time',
    timeout=3.0)

    View Slide

  26. @danriti
    Dan Riti
    RTFM

    View Slide

  27. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The time service is unhealthy
    Web App

    View Slide

  28. @danriti
    Dan Riti
    3.01 s

    View Slide

  29. @danriti
    Dan Riti
    3.01 s
    Broken Pipe

    View Slide

  30. @danriti
    Dan Riti
    Timeouts are not perfect
    Easy to get started with
    Provides some fault isolation
    Response bound to timeout value
    Still applying load to unhealthy service(s)

    View Slide

  31. @danriti
    Dan Riti
    Circuit Breaker Pattern

    View Slide

  32. @danriti
    Dan Riti
    Circuit Breaker Pattern
    ● “Allow one subsystem (an electrical circuit) to fail (excessive current
    draw) without destroying the entire system (the house)”
    ● “Once the danger has passed, the circuit breaker can be reset to
    restore full function to the system”
    ● “This differs from retries, in that circuit breakers exist to prevent
    operations rather then re-execute them”
    - Michael T. Nygard, Release It!

    View Slide

  33. @danriti
    Dan Riti

    View Slide

  34. @danriti
    Dan Riti
    Circuit Breaker Pattern
    ● Release It! by Michael T. Nygard (2007)
    ○ https://pragprog.com/book/mnee/release-it
    ● Netflix (2011)
    ○ http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
    ● Martin Fowler (2014)
    ○ http://martinfowler.com/bliki/CircuitBreaker.html

    View Slide

  35. @danriti
    Dan Riti

    View Slide

  36. @danriti
    Dan Riti
    time_breaker = pybreaker.CircuitBreaker(fail_max=3,
    reset_timeout=30)

    View Slide

  37. @danriti
    Dan Riti
    time_breaker = pybreaker.CircuitBreaker(
    fail_max=3,
    reset_timeout=30)

    View Slide

  38. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The system is healthy
    Web App

    View Slide

  39. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed

    View Slide

  40. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed

    View Slide

  41. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed

    View Slide

  42. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)

    View Slide

  43. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)

    View Slide

  44. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)

    View Slide

  45. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The time service is unhealthy
    Web App

    View Slide

  46. @danriti
    Dan Riti
    Time Service
    User Service
    https://github.com/danriti/short-circuit
    The circuit breaker is open
    Web App

    View Slide

  47. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)

    View Slide

  48. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
    7. GET / Prevented 50 ms Open

    View Slide

  49. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
    7. GET / Prevented 50 ms Open
    8. GET / Prevented 50 ms Open

    View Slide

  50. @danriti
    Dan Riti
    6 ms

    View Slide

  51. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open

    View Slide

  52. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open

    View Slide

  53. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open

    View Slide

  54. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open

    View Slide

  55. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open

    View Slide

  56. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open
    12. GET / Prevented 50 ms Open

    View Slide

  57. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open
    12. GET / Prevented 50 ms Open

    View Slide

  58. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open

    View Slide

  59. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open

    View Slide

  60. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open

    View Slide

  61. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed

    View Slide

  62. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed
    15. GET / 50 ms 50 ms Closed

    View Slide

  63. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed
    15. GET / 50 ms 50 ms Closed
    16. GET / 50 ms 50 ms Closed

    View Slide

  64. @danriti
    Dan Riti
    14 ms

    View Slide

  65. @danriti
    Dan Riti
    Timeouts + Circuit Breaker Pattern
    ● Graceful degradation of user experience
    ● Fail fast and rapidly recover
    ● Reduces load on unhealthy service
    ● Avoid unhealthy service affecting the system
    ● (Bonus) Interface to monitor and measure points of
    integration in a system

    View Slide

  66. @danriti
    Dan Riti
    Timeouts + Circuit Breaker Pattern
    ● Under provisioned services can cause “flapping”
    ○ Service is overwhelmed due to load
    ○ Circuit breaker trips to “open”
    ○ Service no longer receiving requests, so it
    “recovers”
    ○ Circuit breaker “closes”
    ● Important to understand service performance
    constraints

    View Slide

  67. @danriti
    Dan Riti
    Parameters for Timeouts + Circuit Breaker
    ● What do you present during failure?
    ● How many times do you accept failure? (max_fails)
    ● How long until you attempt reset? (reset_timeout)
    ● How long will you wait? (timeout)

    View Slide

  68. @danriti
    Dan Riti

    View Slide

  69. @danriti
    Dan Riti

    View Slide

  70. @danriti
    Dan Riti

    View Slide

  71. @danriti
    Dan Riti

    View Slide

  72. @danriti
    Dan Riti
    Holiday Weekend

    View Slide

  73. @danriti
    Dan Riti
    Circuit Breaker Libraries
    ● Python - https://github.com/danielfm/pybreaker
    ● Go - https://github.com/rubyist/circuitbreaker
    ● Java - https://github.com/Netflix/Hystrix
    ● Ruby - https://github.com/wsargent/circuit_breaker
    ● .NET - https://github.com/michael-wolfenden/Polly
    ● Javascript - https://github.com/yammer/circuit-breaker-js
    ● PHP - https://github.com/ejsmont-artur/php-circuit-breaker

    View Slide

  74. @danriti
    Dan Riti
    https://github.com/Netflix/Hystrix/wiki/Dashboard

    View Slide

  75. @danriti
    Dan Riti
    Thank You
    @danriti

    View Slide

  76. @danriti
    Dan Riti
    Backup

    View Slide

  77. @danriti
    Dan Riti
    Source: https://github.com/Netflix/Hystrix/wiki

    View Slide

  78. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    1. GET / 50 ms 50 ms Closed
    2. GET / 1 s 50 ms Closed
    3. GET / 2 s 50 ms Closed
    4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
    5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
    6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
    7. GET / Skipped 50 ms Open
    8. GET / Skipped 50 ms Open

    View Slide

  79. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    8. GET / Prevented 50 ms Open
    9. GET / Prevented 50 ms Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    10. GET / 3 s (timeout) 50 ms Half-Open => Open
    11. GET / Prevented 50 ms Open
    12. GET / Prevented 50 ms Open

    View Slide

  80. @danriti
    Dan Riti
    # Request Time Service User Service Circuit Breaker State
    13. GET / Prevented 50 ms Open
    Ops saves the day, so time service is healthy Open
    ~30 seconds passed, so reset timeout triggers Open => Half-Open
    14. GET / 50 ms 50 ms Half-Open => Closed
    15. GET / 50 ms 50 ms Closed
    16. GET / 50 ms 50 ms Closed

    View Slide

  81. @danriti
    Dan Riti
    Fallacies of Distributed Computing
    1. The network is reliable.
    2. Latency is zero.
    3. Bandwidth is infinite.
    4. The network is secure.
    5. Topology doesn't change.
    6. There is one administrator.
    7. Transport cost is zero.
    8. The network is homogeneous.
    https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

    View Slide

  82. @danriti
    Dan Riti

    View Slide