Slide 1

Slide 1 text

Fault Tolerance in a High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1

Slide 2

Slide 2 text

Dozens of dependencies. One going down takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 2

Slide 3

Slide 3 text

3

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

No single dependency should take down the entire app. Fail fast. Fail silent. Fallback. Shed load. 6

Slide 7

Slide 7 text

Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 7

Slide 8

Slide 8 text

Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 8

Slide 9

Slide 9 text

Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9

Slide 10

Slide 10 text

TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } Semaphores (Tryable): Limited Concurrency 10

Slide 11

Slide 11 text

TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } Semaphores (Tryable): Limited Concurrency if (executionSemaphore.tryAcquire()) { } else { } 11

Slide 12

Slide 12 text

TryableSemaphore executionSemaphore = getExecutionSemaphore(); // acquire a permit if (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); } } else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback(); } Semaphores (Tryable): Limited Concurrency if (executionSemaphore.tryAcquire()) { } else { return getFallback(); } 12

Slide 13

Slide 13 text

Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 13

Slide 14

Slide 14 text

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } Separate Threads: Limited Concurrency 14

Slide 15

Slide 15 text

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } Separate Threads: Limited Concurrency if (!threadPool.isQueueSpaceAvailable()) { throw new RejectedExecutionException } } catch (RejectedExecutionException e) { } 15

Slide 16

Slide 16 text

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); } ... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command); } catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback(); } Separate Threads: Limited Concurrency if (!threadPool.isQueueSpaceAvailable()) { throw new RejectedExecutionException } } catch (RejectedExecutionException e) { return getFallback(); } 16

Slide 17

Slide 17 text

Separate Threads: Timeout public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } Override of Future.get() 17

Slide 18

Slide 18 text

Separate Threads: Timeout public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } Override of Future.get() try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { } } 18

Slide 19

Slide 19 text

Separate Threads: Timeout public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime); // retrieve the fallback return getFallback(); } } Override of Future.get() try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { return getFallback(); } } 19

Slide 20

Slide 20 text

Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 20

Slide 21

Slide 21 text

if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } Circuit Breaker 21

Slide 22

Slide 22 text

if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } Circuit Breaker if (circuitBreaker.allowRequest()) { } else { } 22

Slide 23

Slide 23 text

if (circuitBreaker.allowRequest()) { return executeCommand(); } else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback(); } Circuit Breaker if (circuitBreaker.allowRequest()) { } else { return getFallback(); } 23

Slide 24

Slide 24 text

Netflix uses all 4 in combination 24

Slide 25

Slide 25 text

25

Slide 26

Slide 26 text

Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 26

Slide 27

Slide 27 text

27

Slide 28

Slide 28 text

28

Slide 29

Slide 29 text

29

Slide 30

Slide 30 text

Benefits of Separate Threads Protection from client libraries Lower risk to accept new/updated clients Quick recovery from failure Client misconfiguration Client service performance characteristic changes Built-in concurrency 30

Slide 31

Slide 31 text

Drawbacks of Separate Threads Some computational overhead Load on machine can be pushed too far ... Benefits outweigh drawbacks when clients are “untrusted” 31

Slide 32

Slide 32 text

32

Slide 33

Slide 33 text

Visualizing Circuits in Realtime (generally sub-second latency) Video available at https://vimeo.com/33576628 33

Slide 34

Slide 34 text

Rolling 10 second counter – 1 second granularity Median Mean 90th 99th 99.5th Latent Error Timeout Rejected Error Percentage (error+timeout+rejected)/ (success+latent success+error+timeout+rejected). 34

Slide 35

Slide 35 text

Netflix DependencyCommand Implementation 35

Slide 36

Slide 36 text

Netflix DependencyCommand Implementation 36

Slide 37

Slide 37 text

Netflix DependencyCommand Implementation 37

Slide 38

Slide 38 text

Netflix DependencyCommand Implementation 38

Slide 39

Slide 39 text

Netflix DependencyCommand Implementation 39

Slide 40

Slide 40 text

Netflix DependencyCommand Implementation 40

Slide 41

Slide 41 text

Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 41

Slide 42

Slide 42 text

Netflix DependencyCommand Implementation 42

Slide 43

Slide 43 text

Netflix DependencyCommand Implementation 43

Slide 44

Slide 44 text

Rolling Number Realtime Stats and Decision Making 44

Slide 45

Slide 45 text

Request Collapsing Take advantage of resiliency to improve efficiency 45

Slide 46

Slide 46 text

Request Collapsing Take advantage of resiliency to improve efficiency 46

Slide 47

Slide 47 text

47

Slide 48

Slide 48 text

Fail fast. Fail silent. Fallback. Shed load. 48

Slide 49

Slide 49 text

Questions & More Information Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen 49