Remote Calls != Local Calls @ DevOps Days Boston 2015

1f0e10cc9e14ada9536aa47b4b565f71?s=47 Dan Riti
September 15, 2015

Remote Calls != Local Calls @ DevOps Days Boston 2015

20 minute talk for DevOps Days Boston (2015) on graceful degradation when services fail.

http://www.devopsdays.org/events/2015-boston/program/#desc2
https://twitter.com/AppNeta/status/643807883029803008
https://github.com/danriti/short-circuit

If you want to see speaker notes, see the original Google presentation:

https://docs.google.com/presentation/d/1VyEuNQoA149ZUw3DBk19ugE1C4vV6RocrOQHUM9WUQs

1f0e10cc9e14ada9536aa47b4b565f71?s=128

Dan Riti

September 15, 2015
Tweet

Transcript

  1. 9.

    @danriti Dan Riti Q: What approaches support graceful degradation when

    (services, networks, data_stores) fail? A: 1. Timeouts 2. Circuit Breaker Pattern 3. Retries 4. Bulkhead Pattern
  2. 10.

    @danriti Dan Riti 1. Timeouts ◦ Forcing an error when

    a dependency is unhealthy 2. Circuit Breaker Pattern ◦ Prevent operations when a dependency is unhealthy 3. Retries ◦ Forcing extra attempts where extra latency is acceptable if a recovery provides more value 4. Bulkhead Pattern ◦ Partitioning a system to enforce the principle of damage containment
  3. 19.
  4. 20.

    @danriti Dan Riti Q: How should we degrade when the

    time_service is unavailable? A: Present “Unavailable” to user Give up on requests after 3 seconds Provide fault isolation
  5. 22.

    @danriti Dan Riti “Your code can't just wait forever for

    a response that might never come sooner or late, it needs to give up. Hope is not a design method.” - Michael T. Nygard, Release It! Timeouts
  6. 30.

    @danriti Dan Riti Timeouts are not perfect Easy to get

    started with Provides some fault isolation Response bound to timeout value Still applying load to unhealthy service(s)
  7. 32.

    @danriti Dan Riti Circuit Breaker Pattern • “Allow one subsystem

    (an electrical circuit) to fail (excessive current draw) without destroying the entire system (the house)” • “Once the danger has passed, the circuit breaker can be reset to restore full function to the system” • “This differs from retries, in that circuit breakers exist to prevent operations rather then re-execute them” - Michael T. Nygard, Release It!
  8. 34.

    @danriti Dan Riti Circuit Breaker Pattern • Release It! by

    Michael T. Nygard (2007) ◦ https://pragprog.com/book/mnee/release-it • Netflix (2011) ◦ http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html • Martin Fowler (2014) ◦ http://martinfowler.com/bliki/CircuitBreaker.html
  9. 39.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed
  10. 40.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed
  11. 41.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed
  12. 42.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure)
  13. 43.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure)
  14. 44.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  15. 47.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure)
  16. 48.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open
  17. 49.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Prevented 50 ms Open 8. GET / Prevented 50 ms Open
  18. 51.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open
  19. 52.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open
  20. 53.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  21. 54.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open
  22. 55.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open
  23. 56.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  24. 57.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  25. 58.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open
  26. 59.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open
  27. 60.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open
  28. 61.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed
  29. 62.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed
  30. 63.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed
  31. 65.

    @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Graceful

    degradation of user experience • Fail fast and rapidly recover • Reduces load on unhealthy service • Avoid unhealthy service affecting the system • (Bonus) Interface to monitor and measure points of integration in a system
  32. 66.

    @danriti Dan Riti Timeouts + Circuit Breaker Pattern • Under

    provisioned services can cause “flapping” ◦ Service is overwhelmed due to load ◦ Circuit breaker trips to “open” ◦ Service no longer receiving requests, so it “recovers” ◦ Circuit breaker “closes” • Important to understand service performance constraints
  33. 67.

    @danriti Dan Riti Parameters for Timeouts + Circuit Breaker •

    What do you present during failure? • How many times do you accept failure? (max_fails) • How long until you attempt reset? (reset_timeout) • How long will you wait? (timeout)
  34. 73.

    @danriti Dan Riti Circuit Breaker Libraries • Python - https://github.com/danielfm/pybreaker

    • Go - https://github.com/rubyist/circuitbreaker • Java - https://github.com/Netflix/Hystrix • Ruby - https://github.com/wsargent/circuit_breaker • .NET - https://github.com/michael-wolfenden/Polly • Javascript - https://github.com/yammer/circuit-breaker-js • PHP - https://github.com/ejsmont-artur/php-circuit-breaker
  35. 78.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 1. GET / 50 ms 50 ms Closed 2. GET / 1 s 50 ms Closed 3. GET / 2 s 50 ms Closed 4. GET / 3 s (timeout) 50 ms Closed (1 Failure) 5. GET / 3 s (timeout) 50 ms Closed (2 Failure) 6. GET / 3 s (timeout) 50 ms Closed => Open (3 Failure) 7. GET / Skipped 50 ms Open 8. GET / Skipped 50 ms Open
  36. 79.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 8. GET / Prevented 50 ms Open 9. GET / Prevented 50 ms Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 10. GET / 3 s (timeout) 50 ms Half-Open => Open 11. GET / Prevented 50 ms Open 12. GET / Prevented 50 ms Open
  37. 80.

    @danriti Dan Riti # Request Time Service User Service Circuit

    Breaker State 13. GET / Prevented 50 ms Open Ops saves the day, so time service is healthy Open ~30 seconds passed, so reset timeout triggers Open => Half-Open 14. GET / 50 ms 50 ms Half-Open => Closed 15. GET / 50 ms 50 ms Closed 16. GET / 50 ms 50 ms Closed
  38. 81.

    @danriti Dan Riti Fallacies of Distributed Computing 1. The network

    is reliable. 2. Latency is zero. 3. Bandwidth is infinite. 4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous. https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing