Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Performance and Fault Tolerance for the Netflix...

Performance and Fault Tolerance for the Netflix API

1) How Netflix does resilience engineering to tolerate failures and latency.

2) Changes in approach to API architecture to allow optimizing service endpoints to each of the hundreds of unique streaming devices for optimal performance rather than making all device clients use the same one-size-fits-all API that is optimized for none.

Presented June 28th 2012 at Silicon Valley Cloud Computing Group

http://www.meetup.com/cloudcomputing/events/68006112/

Notes

Slide 2) The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming.

More than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:7) to dozens of underlying subsystems with peaks of over 200k dependency requests per second.

Slide 3) First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies.

Slide 4) Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience.

Slide 8) It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture and not expect infrastructure to solve it for them.

Slide 17) Sample of 1 dependency circuit for 12 hours from production cluster with a rate of 75rps on a single server.

Each execution occurs in a separate thread with median, 90th and 99th percentile latencies shown in the first 3 legend values.

The calling thread median, 90th and 99th percentiles are the last 3 legend values.

Thus, the median cost of the thread is 1.62ms - 1.57ms = 0.05ms, at the 90th it is 4.57-2.05 = 2.52ms.

Slide 28) Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same.

Slide 29) Netflix has over 800 unique devices that fall into several dozens classes with unique user experiences, different calling patterns, capabilities and needs from the data and thus the API.

Slide 30) The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page.

Slide 33) The client should make a single request and push the "chatty" part to the server where low-latency networks and multi-core servers can perform the work far more efficiently.

Slides 35-37) The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience.

Slides 39-40) Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style.

Slide 41) The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.

Ben Christensen

June 29, 2012
Tweet

More Decks by Ben Christensen

Other Decks in Programming

Transcript

  1. Performance and Fault Tolerance for the Netflix API Ben Christensen

    Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/ 1
  2. Netflix API Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 2
  3. Netflix API Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 3
  4. Dozens of dependencies. One going bad takes everything down. 99.99%30

    = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 4
  5. 5

  6. 6

  7. 7

  8. 10

  9. Tryable semaphores for “trusted” clients and fallbacks Separate threads for

    “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 11
  10. 12

  11. 13

  12. 14

  13. 15

  14. 30 rps x 0.2 seconds = 6 + breathing room

    = 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free 16
  15. Cost of Thread @ 75rps median - 90th - 99th

    (time in ms) Time for thread to execute Time user thread waited 17
  16. Rolling 10 second counters 1 minute latency percentiles 2 minute

    rate change circle color and size represent health and traffic volume 23
  17. Weekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 -

    1.6 Billion Incoming Requests API Daily Incoming vs Outgoing 24
  18. API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand

    executions (200k+/second) Peak at 100M+ incoming requests (30k+/second) 25
  19. 26

  20. Netflix API Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 28
  21. Single Network Request from Clients (use LAN instead of WAN)

    Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity) 29
  22. Single Network Request from Clients (use LAN instead of WAN)

    landing page requires ~dozen API requests Netflix API Device Server 30
  23. Single Network Request from Clients (use LAN instead of WAN)

    some clients are limited in the number of concurrent network connections 31
  24. Single Network Request from Clients (use LAN instead of WAN)

    network latency makes this even worse (mobile, home, wifi, geographic distance, etc) 32
  25. Single Network Request from Clients (use LAN instead of WAN)

    push call pattern to server ... Netflix API Device Server 33
  26. Single Network Request from Clients (use LAN instead of WAN)

    ... and eliminate redundant calls Netflix API Device Server 34
  27. Send Only The Bytes That Matter (optimize responses for each

    client) part of client now on server Netflix API Client Client Device Server 35
  28. Send Only The Bytes That Matter (optimize responses for each

    client) client retrieves and delivers exactly what their device needs in its optimal format Netflix API Client Client Device Server 36
  29. Send Only The Bytes That Matter (optimize responses for each

    client) interface is now a Java API that client interacts with at a granular level Netflix API Service Layer Client Client Device Server 37
  30. Leverage Concurrency (but abstract away its complexity) no synchronized, volatile,

    locks, Futures or Atomic*/Concurrent* classes in client-server code Netflix API Service Layer Client Client Device Server 39
  31. Leverage Concurrency (but abstract away its complexity) Fully asynchronous API

    - Clients can’t block def video1Call = api.getVideos(api.getUser(), 123456, 7891234); def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ onNext: { listOfVideos -> for(video in listOfVideos) { response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Service calls are all asynchronous Functional programming with higher-order functions 40
  32. Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making

    the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen Netflix is Hiring http://jobs.netflix.com 42