Performance and Fault Tolerance for the Netflix API - July 18 2012

Slide 1

Slide 1 text

Performance and Fault Tolerance for the Netflix API Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/ Silicon Valley Cloud Computing Group - July 18 2012 1

Slide 27

Slide 27 text

Netﬂix DependencyCommand Implementation (6) Is Command Successful? Application flow is routed based on the response from the run() method. (6a) Successful Response If no exceptions are thrown and a response is returned (including a null value) then it proceeds to return the response after some logging and a performance check. (6b) Failed Response When a response throws an exception it will mark it as "failed" which will contribute to potentially tripping the circuit open and it will route application flow to (8) DependencyCommand.getFallback(). (7) Calculate Circuit Health Successes, failures, rejections and timeouts are all reported to the circuit breaker to maintain a rolling set of counters which calculate statistics. These stats are then used to determine when the circuit should "trip" and become open at which point subsequent requests are short-circuited until a period of time passes and requests are permitted again after health checks succeed. (8) DependencyCommand.getFallback() The fallback is performed whenever a command execution fails (an exception is thrown by (5) DependencyCommand.run()) or when it is (3) short-circuited because the circuit is open. The intent of the fallback is to provide a generic response without any network dependency from an in-memory cache or other static logic. (8a) Fallback Not Implemented If DependencyCommand.getFallback() is not implemented then an exception with be thrown and the caller left to deal with it. (8b) Fallback Successful If the fallback returns a response then it will be returned to the caller. (8c) Fallback Failed If DependencyCommand.getFallback() fails and throws an exception then the caller is left to deal with it. This is considered a poor practice to have a fallback implementation that can fail. A fallback should be implemented such that it is not performing any logic that would fail. Semaphores are wrapped around fallback execution to protect against software bugs that do not comply with this principle, particular if the fallback itself tries to perform a network call that can be latent. (9) Return Successful Response If (6a) occurred the successful response will be returned to the caller regardless of whether it was latent or not. 27

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text