Performance and Fault Tolerance for the Netflix API

Performance and Fault Tolerance for the Netflix API Ben Christensen
Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/ 1

Netﬂix API Dependency A Dependency D Dependency G Dependency J
Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 2

Dozens of dependencies. One going bad takes everything down. 99.99%30
= 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 4

No single dependency should take down the entire app. Fallback.
Fail silent. Fail fast. Shed load. 8

Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker
9

Tryable semaphores for “trusted” clients and fallbacks Separate threads for
“untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 11

30 rps x 0.2 seconds = 6 + breathing room
= 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free 16

Cost of Thread @ 75rps median - 90th - 99th
(time in ms) Time for thread to execute Time user thread waited 17

Netﬂix DependencyCommand Implementation 18

Netﬂix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty
Response 19

Netﬂix DependencyCommand Implementation 20

So, how does it work in the real world? 21

Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2)
Video available at https://vimeo.com/33576628 22

Rolling 10 second counters 1 minute latency percentiles 2 minute
rate change circle color and size represent health and trafﬁc volume 23

Weekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 -
1.6 Billion Incoming Requests API Daily Incoming vs Outgoing 24

API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand
executions (200k+/second) Peak at 100M+ incoming requests (30k+/second) 25

Fallback. Fail silent. Fail fast. Shed load. 27

Single Network Request from Clients (use LAN instead of WAN)
Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity) 29

landing page requires ~dozen API requests Netﬂix API Device Server 30

some clients are limited in the number of concurrent network connections 31

network latency makes this even worse (mobile, home, wiﬁ, geographic distance, etc) 32

push call pattern to server ... Netﬂix API Device Server 33

... and eliminate redundant calls Netﬂix API Device Server 34

Send Only The Bytes That Matter (optimize responses for each
client) part of client now on server Netﬂix API Client Client Device Server 35

client) client retrieves and delivers exactly what their device needs in its optimal format Netﬂix API Client Client Device Server 36

client) interface is now a Java API that client interacts with at a granular level Netﬂix API Service Layer Client Client Device Server 37

Netﬂix API Service Layer Client Client Device Server Leverage Concurrency
(but abstract away its complexity) 38

Leverage Concurrency (but abstract away its complexity) no synchronized, volatile,
locks, Futures or Atomic*/Concurrent* classes in client-server code Netﬂix API Service Layer Client Client Device Server 39

Leverage Concurrency (but abstract away its complexity) Fully asynchronous API
- Clients can’t block def video1Call = api.getVideos(api.getUser(), 123456, 7891234); def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ onNext: { listOfVideos -> for(video in listOfVideos) { response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Service calls are all asynchronous Functional programming with higher-order functions 40

Optimize for each device. Leverage the server. Netﬂix API Device
Server 41

Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making
the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen Netflix is Hiring http://jobs.netflix.com 42

Performance and Fault Tolerance for the Netflix...

Performance and Fault Tolerance for the Netflix API

More Decks by Ben Christensen

Other Decks in Programming

Featured

Transcript