Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fault Tolerance in a High Volume, Distributed System

Fault Tolerance in a High Volume, Distributed System

Presentation given at LinkedIn in February 2012 about how the Netflix API achieves fault tolerance.

Ben Christensen

June 27, 2012
Tweet

More Decks by Ben Christensen

Other Decks in Programming

Transcript

  1. Fault Tolerance in a High Volume, Distributed System Ben Christensen

    Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1
  2. Dozens of dependencies. One going down takes everything down. 99.99%30

    = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 2
  3. 3

  4. 4

  5. 5

  6. No single dependency should take down the entire app. Fail

    fast. Fail silent. Fallback. Shed load. 6
  7. TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire())

    { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } Semaphores (Tryable): Limited Concurrency 10
  8. TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire())

    { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } Semaphores (Tryable): Limited Concurrency if (executionSemaphore.tryAcquire()) { } else { } 11
  9. TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire())

    { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } Semaphores (Tryable): Limited Concurrency if (executionSemaphore.tryAcquire()) { } else { return getFallback(); } 12
  10. try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the

    property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } Separate Threads: Limited Concurrency 14
  11. try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the

    property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } Separate Threads: Limited Concurrency if (!threadPool.isQueueSpaceAvailable()) { throw new RejectedExecutionException } } catch (RejectedExecutionException e) { } 15
  12. try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the

    property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } Separate Threads: Limited Concurrency if (!threadPool.isQueueSpaceAvailable()) { throw new RejectedExecutionException } } catch (RejectedExecutionException e) { return getFallback(); } 16
  13. Separate Threads: Timeout public K get() throws CancellationException, InterruptedException, ExecutionException

    { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } Override of Future.get() 17
  14. Separate Threads: Timeout public K get() throws CancellationException, InterruptedException, ExecutionException

    { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } Override of Future.get() try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { } } 18
  15. Separate Threads: Timeout public K get() throws CancellationException, InterruptedException, ExecutionException

    { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } Override of Future.get() try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { return getFallback(); } } 19
  16. if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit

    and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } Circuit Breaker 21
  17. if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit

    and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } Circuit Breaker if (circuitBreaker.allowRequest()) { } else { } 22
  18. if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit

    and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } Circuit Breaker if (circuitBreaker.allowRequest()) { } else { return getFallback(); } 23
  19. 25

  20. Tryable semaphores for “trusted” clients and fallbacks Separate threads for

    “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26
  21. 27

  22. 28

  23. 29

  24. Benefits of Separate Threads Protection from client libraries Lower risk

    to accept new/updated clients Quick recovery from failure Client misconfiguration Client service performance characteristic changes Built-in concurrency 30
  25. Drawbacks of Separate Threads Some computational overhead Load on machine

    can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31
  26. 32

  27. Rolling 10 second counter – 1 second granularity Median Mean

    90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/ (success+latent success+error+timeout+rejected). 34
  28. 47

  29. Questions & More Information Fault Tolerance in a High Volume,

    Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49