Slide 1

Slide 1 text

Performance and Fault Tolerance for the Netflix API Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/ 1

Slide 2

Slide 2 text

Netflix API Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 2

Slide 3

Slide 3 text

Netflix API Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 3

Slide 4

Slide 4 text

Dozens of dependencies. One going bad takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse. 4

Slide 5

Slide 5 text

5

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load. 8

Slide 9

Slide 9 text

Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker 9

Slide 10

Slide 10 text

10

Slide 11

Slide 11 text

Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve” 11

Slide 12

Slide 12 text

12

Slide 13

Slide 13 text

13

Slide 14

Slide 14 text

14

Slide 15

Slide 15 text

15

Slide 16

Slide 16 text

30 rps x 0.2 seconds = 6 + breathing room = 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free 16

Slide 17

Slide 17 text

Cost of Thread @ 75rps median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited 17

Slide 18

Slide 18 text

Netflix DependencyCommand Implementation 18

Slide 19

Slide 19 text

Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response 19

Slide 20

Slide 20 text

Netflix DependencyCommand Implementation 20

Slide 21

Slide 21 text

So, how does it work in the real world? 21

Slide 22

Slide 22 text

Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) Video available at https://vimeo.com/33576628 22

Slide 23

Slide 23 text

Rolling 10 second counters 1 minute latency percentiles 2 minute rate change circle color and size represent health and traffic volume 23

Slide 24

Slide 24 text

Weekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests API Daily Incoming vs Outgoing 24

Slide 25

Slide 25 text

API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second) Peak at 100M+ incoming requests (30k+/second) 25

Slide 26

Slide 26 text

26

Slide 27

Slide 27 text

Fallback. Fail silent. Fail fast. Shed load. 27

Slide 28

Slide 28 text

Netflix API Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R 28

Slide 29

Slide 29 text

Single Network Request from Clients (use LAN instead of WAN) Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity) 29

Slide 30

Slide 30 text

Single Network Request from Clients (use LAN instead of WAN) landing page requires ~dozen API requests Netflix API Device Server 30

Slide 31

Slide 31 text

Single Network Request from Clients (use LAN instead of WAN) some clients are limited in the number of concurrent network connections 31

Slide 32

Slide 32 text

Single Network Request from Clients (use LAN instead of WAN) network latency makes this even worse (mobile, home, wifi, geographic distance, etc) 32

Slide 33

Slide 33 text

Single Network Request from Clients (use LAN instead of WAN) push call pattern to server ... Netflix API Device Server 33

Slide 34

Slide 34 text

Single Network Request from Clients (use LAN instead of WAN) ... and eliminate redundant calls Netflix API Device Server 34

Slide 35

Slide 35 text

Send Only The Bytes That Matter (optimize responses for each client) part of client now on server Netflix API Client Client Device Server 35

Slide 36

Slide 36 text

Send Only The Bytes That Matter (optimize responses for each client) client retrieves and delivers exactly what their device needs in its optimal format Netflix API Client Client Device Server 36

Slide 37

Slide 37 text

Send Only The Bytes That Matter (optimize responses for each client) interface is now a Java API that client interacts with at a granular level Netflix API Service Layer Client Client Device Server 37

Slide 38

Slide 38 text

Netflix API Service Layer Client Client Device Server Leverage Concurrency (but abstract away its complexity) 38

Slide 39

Slide 39 text

Leverage Concurrency (but abstract away its complexity) no synchronized, volatile, locks, Futures or Atomic*/Concurrent* classes in client-server code Netflix API Service Layer Client Client Device Server 39

Slide 40

Slide 40 text

Leverage Concurrency (but abstract away its complexity) Fully asynchronous API - Clients can’t block def video1Call = api.getVideos(api.getUser(), 123456, 7891234); def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ onNext: { listOfVideos -> for(video in listOfVideos) { response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Service calls are all asynchronous Functional programming with higher-order functions 40

Slide 41

Slide 41 text

Optimize for each device. Leverage the server. Netflix API Device Server 41

Slide 42

Slide 42 text

Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen Netflix is Hiring http://jobs.netflix.com 42