Slide 1

Slide 1 text

Performance and Fault Tolerance for the Netflix API Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/ QCon São Paulo – August 4th 2012 1

Slide 2

Slide 2 text

2 Netflix has over 27 million video streaming customers in 47 countries across North & South America, United Kingdom and Ireland who get unlimited access to movies and TV shows from over 800 different devices for $7.99USD/month (about the same converted price in each countries local currency). In June 2012 Netflix customers streamed over 1 billion hours of content.

Slide 3

Slide 3 text

Discovery Streaming 3 Streaming devices talk to 2 major edge services: the first is the Netflix API that provides functionality related to discovering and browsing content while the second handles the playback of video streams.

Slide 4

Slide 4 text

Netflix API Streaming 4 This presentation focuses on the “Discovery” portion of traffic that the Netflix API handles.

Slide 5

Slide 5 text

5 The Netflix API powers the “Discovery” user experience on the 800+ devices up until a user hits the play button at which point the “Streaming” edge service takes over.

Slide 6

Slide 6 text

Open API Netflix Devices API Request Volume by Audience 6 Traffic to the Netflix API is predominantly focused on serving the discovery UIs of Netflix streaming devices. This means it is primarily an internal API used by Netflix development teams.

Slide 7

Slide 7 text

Netflix API Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 7 The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. More than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems with peaks of over 200k dependency requests per second.

Slide 8

Slide 8 text

Netflix API Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 8 First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies.

Slide 9

Slide 9 text

Dozens of dependencies. One going bad takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 9 Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience.

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

11

Slide 12

Slide 12 text

12 Latency is far worse for system resilience than failure. Failures naturally “fail fast” and shed load whereas latency backs up queues, threads and system resources and if isolation techniques are not used it can cause an entire system to fail.

Slide 13

Slide 13 text

"Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency 13 This is an example of what a system looks like when high latency occurs without load shedding and isolation. Backend latency spiked (from <100ms to >1000ms at the median, >10,000 at the 90th percentile) and saturated all available resources resulting in the HTTP layer rejecting over 80% of requests.

Slide 14

Slide 14 text

No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. 14 It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture. Infrastructure is an aspect of resilience engineering but it can not be relied upon by itself - software must be resilient.

Slide 15

Slide 15 text

15 Netflix uses a combination of aggressive network timeouts, tryable semaphores and thread pools to isolate dependencies and limit impact of both failure and latency.

Slide 16

Slide 16 text

Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 16

Slide 17

Slide 17 text

17 With isolation techniques the application container is now segmented according to how it uses its underlying dependencies instead of using a single shared resource pool to communicate with all of them.

Slide 18

Slide 18 text

18 A single dependency failing will no longer be permitted to take more resources than it was allocated and can have its impact controlled.

Slide 19

Slide 19 text

19 In this case the backend service has become latent and saturates all available threads allocated to it so further requests to it are rejected (the orange line) instead of blocking or using up all available system threads.

Slide 20

Slide 20 text

20

Slide 21

Slide 21 text

30 rps x 0.2 seconds = 6 + breathing room = 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free 21 Requests in queue block user threads thus must be considered part of the resources being allocated to a dependency. Setting a queue to 100 is equivalent to saying 100 incoming requests can block while waiting for this dependency. There is typically not a good reason for having a queue size higher than 5-10. Bursting should be handled through batching and throughput should be accommodated by a large enough thread pool. It is better to increase the thread-pool size rather than the queue as commands executing in the thread-pool receive forward progress whereas items in the queue do not.

Slide 22

Slide 22 text

Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited 22 The Netflix API has ~30 thread pools with 5-20 threads in each. A common question and concern is what impact this has on performance. Here is a sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 60rps per server. Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future. The calling thread median, 90th and 99th percentiles are the last 3 legend values. This example was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher.

Slide 23

Slide 23 text

Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited Cost: 0ms 23 At the median (and lower) there is no cost to having a separate thread.

Slide 24

Slide 24 text

Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited Cost: 3ms 24 At the 90th percentile there is a cost of 3ms for having a separate thread.

Slide 25

Slide 25 text

Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited Cost: 9ms 25 At the 99th percentile there is a cost of 9ms for having a separate thread. Note however that the increase in cost is far smaller than the increase in execution time of the separate thread which jumped from 2 to 28 whereas the cost jumped from 0 to 9. This overhead at the 90th percentile and higher for circuits such as these has been deemed acceptable for the benefits of resilience achieved. For circuits that wrap very low latency requests (such as those primarily hitting in-memory caches) the overhead can be too high and in those cases we choose to use tryable semaphores which do not allow for timeouts but provide most of the resilience benefits without the overhead. The overhead in general though is small enough that we prefer the isolation benefits of a separate thread.

Slide 26

Slide 26 text

Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Time user thread waited Time for thread to execute 26 This is a second sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 75rps per server. As with the first example this was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher. Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future. The calling thread median, 90th and 99th percentiles are the last 3 legend values.

Slide 27

Slide 27 text

Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Time user thread waited Time for thread to execute Cost: 0ms 27 At the median (and lower) there is no cost to having a separate thread.

Slide 28

Slide 28 text

Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Time user thread waited Time for thread to execute Cost: 2ms 28 At the 90th percentile there is a cost of 2ms for having a separate thread.

Slide 29

Slide 29 text

Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Time user thread waited Time for thread to execute Cost: 2ms 29 At the 99th percentile there is a cost of 2ms for having a separate thread.

Slide 30

Slide 30 text

Semaphores Effectively No Cost ~5000rps per instance 30 Semaphore isolation on the other hand is used for dependencies which are very high-volume in-memory lookups that never result in a synchronous network request. The cost is practically zero (atomic compare-and-set counter for semaphore).

Slide 31

Slide 31 text

Netflix DependencyCommand Implementation 31

Slide 32

Slide 32 text

Netflix DependencyCommand Implementation (1) Construct DependencyCommand Object On each dependency invocation its DependencyCommand object will be constructed with the arguments necessary to make the call to the server. For example: DependencyCommand command = new DependencyCommand(arg1, arg2) (2) Execution Synchronously or Asynchronously Execution of the command can then be performed synchronously or asychronously: K value = command.execute() Future value = command.queue() The synchronous call execute() invokes queue().get() unless the command is specified to not run in a thread. (3) Is Circuit Open? Upon execution of the command it first checks with the circuit-breaker to ask "is the circuit open?". If the circuit is open (tripped) then the command will not be executed and flow routed to (8) DependencyCommand.getFallback(). If the circuit is closed then the command will be executed and flow continue to (5) DependencyCommand.run(). (4) Is Thread Pool/Queue Full? If the thread-pool and queue associated with the command is full then the execution will be rejected and immediately routed through fallback (8). If the command does not run within a thread then this logic will be skipped. (5) DependencyCommand.run() The concrete implementation run() method is executed. (5a) Command Timeout The run() method occurs within a thread with a timeout and if it takes too long the thread will throw a TimeoutException. In that case the response is routed through fallback (8) and the eventual run() method response is discarded. If the command does not run within a thread then this logic will not be applicable. 32

Slide 33

Slide 33 text

Netflix DependencyCommand Implementation (6) Is Command Successful? Application flow is routed based on the response from the run() method. (6a) Successful Response If no exceptions are thrown and a response is returned (including a null value) then it proceeds to return the response after some logging and a performance check. (6b) Failed Response When a response throws an exception it will mark it as "failed" which will contribute to potentially tripping the circuit open and it will route application flow to (8) DependencyCommand.getFallback(). (7) Calculate Circuit Health Successes, failures, rejections and timeouts are all reported to the circuit breaker to maintain a rolling set of counters which calculate statistics. These stats are then used to determine when the circuit should "trip" and become open at which point subsequent requests are short-circuited until a period of time passes and requests are permitted again after health checks succeed. (8) DependencyCommand.getFallback() The fallback is performed whenever a command execution fails (an exception is thrown by (5) DependencyCommand.run()) or when it is (3) short-circuited because the circuit is open. The intent of the fallback is to provide a generic response without any network dependency from an in-memory cache or other static logic. (8a) Fallback Not Implemented If DependencyCommand.getFallback() is not implemented then an exception with be thrown and the caller left to deal with it. (8b) Fallback Successful If the fallback returns a response then it will be returned to the caller. (8c) Fallback Failed If DependencyCommand.getFallback() fails and throws an exception then the caller is left to deal with it. This is considered a poor practice to have a fallback implementation that can fail. A fallback should be implemented such that it is not performing any logic that would fail. Semaphores are wrapped around fallback execution to protect against software bugs that do not comply with this principle, particular if the fallback itself tries to perform a network call that can be latent. (9) Return Successful Response If (6a) occurred the successful response will be returned to the caller regardless of whether it was latent or not. 33

Slide 34

Slide 34 text

Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 34

Slide 35

Slide 35 text

Netflix DependencyCommand Implementation 35

Slide 36

Slide 36 text

So, how does it work in the real world? 36

Slide 37

Slide 37 text

Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) 37 This is an example of our monitoring system which provides low-latency (1-2 seconds typically) visibility into the traffic and health of all DependencyCommand circuits across a cluster.

Slide 38

Slide 38 text

last minute latency percentiles Request rate 2 minutes of request rate to show relative changes in traffic circle color and size represent health and traffic volume hosts reporting from cluster Error percentage of last 10 seconds Circuit-breaker status Rolling 10 second counters with 1 second granularity Failures/Exceptions Thread-pool Rejections Thread timeouts Successes Short-circuited (rejected) 38

Slide 39

Slide 39 text

39 This view of the dashboard was captured during a latency monkey simulation to test resilience against latency (http://techblog.netflix.com/2011/07/netflix-simian-army.html) and shows how several of the DependencyCommands degraded in health and showed timeouts, threadpool rejections, short-circuiting and failures. The DependencyCommands of dependencies not affected by latency were unaffected. During this test no users were prevented from using Netflix on any devices. Instead fallbacks and graceful degradation occurred and as soon as latency was removed all systems returned to health within seconds.

Slide 40

Slide 40 text

40 This was another latency monkey simulation that affected a single DependencyCommand.

Slide 41

Slide 41 text

Peak at 100M+ incoming requests (30k+/second) Success drops off, Timeouts and Short Circuiting shed load Latency spikes from ~30ms median to first 2000+ then 10000+ ms 41 These graphs show the full duration of a latency monkey simulation (and look similar to real production events) when latency occurred and the DependencyCommand timed-out and short- circuited the requests and returned fallbacks.

Slide 42

Slide 42 text

42

Slide 43

Slide 43 text

Fallback. Fail silent. Fail fast. Shed load. 43

Slide 44

Slide 44 text

Netflix API Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 44 Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same.

Slide 45

Slide 45 text

Single Network Request from Clients (use LAN instead of WAN) landing page requires ~dozen API requests Netflix API Device Server 45 The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page.

Slide 46

Slide 46 text

Single Network Request from Clients (use LAN instead of WAN) some clients are limited in the number of concurrent network connections 46

Slide 47

Slide 47 text

Single Network Request from Clients (use LAN instead of WAN) network latency makes this even worse (mobile, home, wifi, geographic distance, etc) 47

Slide 48

Slide 48 text

Single Network Request from Clients (use LAN instead of WAN) push call pattern to server ... Netflix API Device Server 48 The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more efficiently.

Slide 49

Slide 49 text

Single Network Request from Clients (use LAN instead of WAN) ... and eliminate redundant calls Netflix API Device Server 49

Slide 50

Slide 50 text

50 With dozens of classes of devices to support it wasn’t feasible for the API team to create custom endpoints for each device, otherwise a single team would be the bottleneck for all client teams and it would be an explosion of complexity for a single team to try and manage. Also, the subject matter expertise of what each device needs does not reside with the API team. Instead, the API team provides a platform and allows each client team to build their own custom endpoints that are optimized to the device they are targeting.

Slide 51

Slide 51 text

Send Only The Bytes That Matter (optimize responses for each client) part of client now on server Netflix API Client Client Device Server 51 The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience.

Slide 52

Slide 52 text

Send Only The Bytes That Matter (optimize responses for each client) client retrieves and delivers exactly what their device needs in its optimal format Netflix API Client Client Device Server 52

Slide 53

Slide 53 text

Send Only The Bytes That Matter (optimize responses for each client) interface is now a Java API that client interacts with at a granular level Netflix API Service Layer Client Client Device Server 53

Slide 54

Slide 54 text

Netflix API Service Layer Client Client Device Server Leverage Concurrency (but abstract away its complexity) 54

Slide 55

Slide 55 text

Leverage Concurrency (but abstract away its complexity) no synchronized, volatile, locks, Futures or Atomic*/Concurrent* classes in client-server code Netflix API Service Layer Client Client Device Server 55 Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style.

Slide 56

Slide 56 text

Functional Reactive Programming composable asynchronous functions Fully asynchronous API - Clients can’t block def video1Call = api.getVideos(api.getUser(), 123456, 7891234); def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).subscribe([ onNext: { video -> // called for each ‘video’ from the merge response.getWriter().println("{id: " + video.id + ", title: '" + video.title + "'}"); }, onError: { exception -> response.getWriter().println("{errorMessage: '" + exception.getMessage() + "'}"); } ]) Service calls are all asynchronous Functional programming with higher-order functions 56

Slide 57

Slide 57 text

57

Slide 58

Slide 58 text

Bursts to Single Dependency Duplicate Requests 58

Slide 59

Slide 59 text

Request Collapsing batch don’t burst 59 The DependencyCommand resilience layer is leveraged for concurrency including optimizations such as request collapsing (automated batching) which bundles bursts of calls to the same service into batches without the client code needing to understand or manually optimize for batching. This is particularly important when client code becomes highly concurrent and data is requested in multiple different code paths sometimes written by different engineers. Request collapsing automatically captures and batches the calls together. The collapsing functionality also supports sharded architectures so a batch of requests can be sharded into sub-batches if the client-server relationship requires requests to be routed to a sharded backend.

Slide 60

Slide 60 text

Request Collapsing batch don’t burst 100:1 collapsing ratio (batch size of ~100) 60 This graph shows an extreme example of a dependency where we collapse requests at a ratio of 100:1

Slide 61

Slide 61 text

Request Collapsing batch don’t burst 100:1 collapsing ratio (batch size of ~100) 4000 rps instead of 400,000 rps 61 This is the same graph but on a power scale instead of linear so the blue line (actual network requests) shows up.

Slide 62

Slide 62 text

62 When multiple calls to the same backend occur concurrently or within a short time-window (10ms for example) ...

Slide 63

Slide 63 text

Multiple network calls collapsed into one 63 ... they are collapsed into a single batched request.

Slide 64

Slide 64 text

Request Scoped Caching short-lived and concurrency aware 64 Another use of the DependencyCommand layer is to allow client code to perform requests without concern of duplicate network calls due to concurrency. The Futures is atomically cached using “putIfAbsent” in the request scope shared via ThreadLocals of each thread so clients can request data in multiple code paths without inefficiency concerns.

Slide 65

Slide 65 text

Request Caching stateless 65 Some examples of request caching de-duplicating backend calls. On some the impact is reasonably high while on most it is a small percentage or none at all but overall provided a measurable drop in network calls and in some use cases for client code significantly improved latency by eliminating unnecessary network calls.

Slide 66

Slide 66 text

66 Within a single user request when multiple duplicate calls are executed ...

Slide 67

Slide 67 text

Extra network call de-duped 67 ... they are de-duped through concurrency-aware request-scoped caches.

Slide 68

Slide 68 text

Optimize for each device. Leverage the server. Netflix API Device Server 68 The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.

Slide 69

Slide 69 text

/ps3/home Dependency F 10 Threads Dependency G 10 Threads Dependency H 10 Threads Dependency I 5 Threads Dependency J 8 Threads Dependency A 10 Threads Dependency B 8 Threads Dependency C 10 Threads Dependency D 15 Threads Dependency E 5 Threads Dependency K 15 Threads Dependency L 4 Threads Dependency M 5 Threads Dependency N 10 Threads Dependency O 10 Threads Dependency P 10 Threads Dependency Q 8 Threads Dependency R 10 Threads Dependency S 8 Threads Dependency T 10 Threads /android/home /tv/home Functional Reactive Dynamic Endpoints Asynchronous Java API 69

Slide 70

Slide 70 text

Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Embracing the Differences : Inside the Netflix API Redesign http://techblog.netflix.com/2012/07/embracing-differences-inside-netflix.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen Netflix is Hiring http://jobs.netflix.com 70