1) How Netflix does resilience engineering to tolerate failures and latency.
2) Changes in approach to API architecture to allow optimizing service endpoints to each of the hundreds of unique streaming devices for optimal performance rather than making all device clients use the same one-size-fits-all API that is optimized for none.
Presented June 28th 2012 at Silicon Valley Cloud Computing Group
http://www.meetup.com/cloudcomputing/events/68006112/
Notes
Slide 2) The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming.
More than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:7) to dozens of underlying subsystems with peaks of over 200k dependency requests per second.
Slide 3) First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies.
Slide 4) Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience.
Slide 8) It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture and not expect infrastructure to solve it for them.
Slide 17) Sample of 1 dependency circuit for 12 hours from production cluster with a rate of 75rps on a single server.
Each execution occurs in a separate thread with median, 90th and 99th percentile latencies shown in the first 3 legend values.
The calling thread median, 90th and 99th percentiles are the last 3 legend values.
Thus, the median cost of the thread is 1.62ms - 1.57ms = 0.05ms, at the 90th it is 4.57-2.05 = 2.52ms.
Slide 28) Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same.
Slide 29) Netflix has over 800 unique devices that fall into several dozens classes with unique user experiences, different calling patterns, capabilities and needs from the data and thus the API.
Slide 30) The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page.
Slide 33) The client should make a single request and push the "chatty" part to the server where low-latency networks and multi-core servers can perform the work far more efficiently.
Slides 35-37) The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience.
Slides 39-40) Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style.
Slide 41) The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.