Remote Calls != Local Calls @ DevOps Days Boston 2015

1f0e10cc9e14ada9536aa47b4b565f71?s=47 Dan Riti
September 15, 2015

Remote Calls != Local Calls @ DevOps Days Boston 2015

20 minute talk for DevOps Days Boston (2015) on graceful degradation when services fail.

http://www.devopsdays.org/events/2015-boston/program/#desc2
https://twitter.com/AppNeta/status/643807883029803008
https://github.com/danriti/short-circuit

If you want to see speaker notes, see the original Google presentation:

https://docs.google.com/presentation/d/1VyEuNQoA149ZUw3DBk19ugE1C4vV6RocrOQHUM9WUQs

1f0e10cc9e14ada9536aa47b4b565f71?s=128

Dan Riti

September 15, 2015
Tweet

Transcript

  1. Graceful degradation when services fail Remote Calls != Local Calls

  2. @danriti Dan Riti Senior Software Engineer @ Dan Riti @danriti

    dmriti@gmail.com github.com/danriti
  3. @danriti Dan Riti Monolithic

  4. @danriti Dan Riti Monolithic Services

  5. @danriti Dan Riti Dependencies

  6. @danriti Dan Riti Points of Failure

  7. @danriti Dan Riti Remote Calls != Local Calls

  8. @danriti Dan Riti Q: What approaches support graceful degradation when

    (services, networks, data_stores) fail?
  9. @danriti Dan Riti Q: What approaches support graceful degradation when

    (services, networks, data_stores) fail? A: 1. Timeouts 2. Circuit Breaker Pattern 3. Retries 4. Bulkhead Pattern
  10. @danriti Dan Riti 1. Timeouts ◦ Forcing an error when

    a dependency is unhealthy 2. Circuit Breaker Pattern ◦ Prevent operations when a dependency is unhealthy 3. Retries ◦ Forcing extra attempts where extra latency is acceptable if a recovery provides more value 4. Bulkhead Pattern ◦ Partitioning a system to enforce the principle of damage containment
  11. @danriti Dan Riti Time Service User Service Web App

  12. @danriti Dan Riti Time Service User Service Web App RESTful

    RESTful
  13. @danriti Dan Riti Time Service User Service Web App RESTful

    RESTful HTTP HTTP
  14. @danriti Dan Riti Time Service Response { "time": "2015-09-11T17:33:48.940483" }

    User Service Response { "name": "Dan Riti" }
  15. @danriti Dan Riti

  16. @danriti Dan Riti web app

  17. @danriti Dan Riti web app time service

  18. @danriti Dan Riti web app time service user service

  19. @danriti Dan Riti Q: How should we degrade when the

    time_service is unavailable?
  20. @danriti Dan Riti Q: How should we degrade when the

    time_service is unavailable? A: Present “Unavailable” to user Give up on requests after 3 seconds Provide fault isolation
  21. @danriti Dan Riti https://github.com/danriti/short-circuit

  22. @danriti Dan Riti “Your code can't just wait forever for

    a response that might never come sooner or late, it needs to give up. Hope is not a design method.” - Michael T. Nygard, Release It! Timeouts
  23. @danriti Dan Riti

  24. @danriti Dan Riti response = requests.get('http://localhost:3001/time')

  25. @danriti Dan Riti response = requests.get('http://localhost:3001/time', timeout=3.0)

  26. @danriti Dan Riti RTFM

  27. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The time

    service is unhealthy Web App
  28. @danriti Dan Riti 3.01 s

  29. @danriti Dan Riti 3.01 s Broken Pipe

  30. @danriti Dan Riti Timeouts are not perfect Easy to get

    started with Provides some fault isolation Response bound to timeout value Still applying load to unhealthy service(s)
  31. @danriti Dan Riti Circuit Breaker Pattern

  32. @danriti Dan Riti Circuit Breaker Pattern • “Allow one subsystem

    (an electrical circuit) to fail (excessive current draw) without destroying the entire system (the house)” • “Once the danger has passed, the circuit breaker can be reset to restore full function to the system” • “This differs from retries, in that circuit breakers exist to prevent operations rather then re-execute them” - Michael T. Nygard, Release It!
  33. @danriti Dan Riti

  34. @danriti Dan Riti Circuit Breaker Pattern • Release It! by

    Michael T. Nygard (2007) ◦ https://pragprog.com/book/mnee/release-it • Netflix (2011) ◦ http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html • Martin Fowler (2014) ◦ http://martinfowler.com/bliki/CircuitBreaker.html
  35. @danriti Dan Riti

  36. @danriti Dan Riti time_breaker = pybreaker.CircuitBreaker(fail_max=3, reset_timeout=30)

  37. @danriti Dan Riti time_breaker = pybreaker.CircuitBreaker( fail_max=3, reset_timeout=30)

  38. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The system

    is healthy Web App
  39. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed
  40. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed
  41. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed
  42. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
  43. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
  44. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  45. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The time

    service is unhealthy Web App
  46. @danriti Dan Riti Time Service User Service https://github.com/danriti/short-circuit The circuit

    breaker is open Web App
  47. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  48. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open
  49. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open 8. GET / Prevented 50 ms Open
  50. @danriti Dan Riti 6 ms

  51. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open
  52. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open
  53. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  54. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open
  55. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open
  56. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  57. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  58. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open
  59. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open
  60. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  61. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed
  62. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed
  63. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed
  64. @danriti Dan Riti 14 ms

  65. @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Graceful

    degradation of user experience • Fail fast and rapidly recover • Reduces load on unhealthy service • Avoid unhealthy service affecting the system • (Bonus) Interface to monitor and measure points of integration in a system
  66. @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Under

    provisioned services can cause “flapping” ◦ Service is overwhelmed due to load ◦ Circuit breaker trips to “open” ◦ Service no longer receiving requests, so it “recovers” ◦ Circuit breaker “closes” • Important to understand service performance constraints
  67. @danriti Dan Riti Parameters for Timeouts + Circuit Breaker •

    What do you present during failure? • How many times do you accept failure? (max_fails) • How long until you attempt reset? (reset_timeout) • How long will you wait? (timeout)
  68. @danriti Dan Riti

  69. @danriti Dan Riti

  70. @danriti Dan Riti

  71. @danriti Dan Riti

  72. @danriti Dan Riti Holiday Weekend

  73. @danriti Dan Riti Circuit Breaker Libraries • Python - https://github.com/danielfm/pybreaker

    • Go - https://github.com/rubyist/circuitbreaker • Java - https://github.com/Netflix/Hystrix • Ruby - https://github.com/wsargent/circuit_breaker • .NET - https://github.com/michael-wolfenden/Polly • Javascript - https://github.com/yammer/circuit-breaker-js • PHP - https://github.com/ejsmont-artur/php-circuit-breaker
  74. @danriti Dan Riti https://github.com/Netflix/Hystrix/wiki/Dashboard

  75. @danriti Dan Riti Thank You @danriti

  76. @danriti Dan Riti Backup

  77. @danriti Dan Riti Source: https://github.com/Netflix/Hystrix/wiki

  78. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Skipped 50 ms Open 8. GET / Skipped 50 ms Open
  79. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  80. @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed
  81. @danriti Dan Riti Fallacies of Distributed Computing 1. The network

    is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
  82. @danriti Dan Riti