Application Resilience Engineering and Operations at Netflix with Hystrix - JavaOne 2013

Slide 1

Slide 1 text

Application Resilience Engineering and Operations at Netflix with Hystrix Ben Christensen – @benjchristensen – Software Engineer on Edge Platform at Netﬂix

Slide 2

Slide 2 text

Netﬂix is a subscription service for movies and TV shows for $7.99USD/month (about the same converted price in each countries local currency).

Slide 3

Slide 3 text

More than 37 million Subscribers in 50+ Countries and Territories Netﬂix has over 37 million video streaming customers in 50+ countries and territories across North & South America, United Kingdom, Ireland, Netherlands and the Nordics.

Slide 4

Slide 4 text

Netflix accounts for 33% of Peak Downstream Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month Sandvine report available with free account at http://www.sandvine.com/news/global_broadband_trends.asp

Slide 5

Slide 5 text

API traffic has grown from ~20 million/day in 2010 to >2 billion/day 0 500 1000 1500 2000 2010 2011 2012 Today millions of API requests per day

Slide 6

Slide 6 text

Discovery Streaming Streaming devices talk to 2 major edge services: the ﬁrst is the Netﬂix API that provides functionality related to discovering and browsing content while the second handles the playback of video streams.

Slide 7

Slide 7 text

Netflix API Streaming This presentation focuses on architectural choices made for the “Discovery” portion of traffic that the Netflix API handles.

Slide 8

Slide 8 text

The Netﬂix API powers the “Discovery” user experience on the 800+ devices up until a user hits the play button at which point the “Streaming” edge service takes over.

Slide 9

Slide 9 text

Netflix API Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. 2+ billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems.

Slide 10

Slide 10 text

Global deployment spread across data centers in multiple AWS regions. Geographic isolation, active/active with regional failover coming (http://techblog.netﬂix.com/2013/05/denominating-multi-region-sites.html)

Slide 11

Slide 11 text

AWS Availability Zone AWS Availability Zone AWS Availability Zone 3 data centers (AWS Availability Zones) operate in each region with deployments split across them for redundancy in event of losing an entire zone.

Slide 12

Slide 12 text

Each zone is populated with application clusters (‘auto-scaling groups’ or ASGs) that make up the service oriented distributed system. Application clusters operate independently of each other with software and hardware load balancing routing traffic between them.

Slide 13

Slide 13 text

Application clusters are made up of 1 to 100s of machine instances per zone. Service registry and discovery work with software load balancing to allow machines to launch and disappear (for planned or unplanned reasons) at any time and become part of the distributed system and serve requests. Auto-scaling enables system-wide adaptation to demand as it launches instances to meet increasing traffic and load or handle instance failure.

Slide 14

Slide 14 text

Failed instances are dropped from discovery so traffic stops routing to them. Software load balancers on client applications detect and skip them until discovery removes them.

Slide 15

Slide 15 text

Auto-scale policies brings on new instances to replace failed ones or to adapt to increasing demand.

Slide 16

Slide 16 text

Slide 17

Slide 17 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User request blocked by latency in single network call Any one of these relationships can fail at any time. They can be intermittent or cluster-wide, immediate with thrown exceptions or returned error codes or latency from various causes. Latency is particularly challenging for applications to deal with as it causes resource utilization in queues and pools and blocks user requests (even with async IO).

Slide 18

Slide 18 text

At high volume all request threads can block in seconds User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Latency at high volume can quickly saturate all application resources (queues, pools, sockets, etc) causing total application failure and the inability to serve user requests even if all other dependencies are healthy.

Slide 19

Slide 19 text

Dozens of dependencies. One going bad takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month Reality is generally worse. Large distributed systems are complex and failure will occur. If failure from every component is allowed to cascade across the system they will all affect the user.

Slide 20

Slide 20 text

CONSTRAINTS Speed of Iteration Client Libraries Mixed Environment Solution design was done with constraints, context and priorities of the Netﬂix environment.

Slide 21

Slide 21 text

CONSTRAINTS Speed of Iteration Client Libraries Mixed Environment Speed of iteration is optimized for and this leads to client/server relationships where client libraries are provided rather than each team writing their own client code against a server protocol. This means “3rd party” code from many developers and teams is constantly being deployed into applications across the system. Large applications such as the Netﬂix API have dozens of client libraries.

Slide 22

Slide 22 text

Slide 23

Slide 23 text

CONSTRAINTS Speed of Iteration Client Libraries Mixed Environment The environment is also diverse with different types of client/server communications and protocols. This heterogenous and always changing environment affects the approach for resilience engineering and is potentially very different than approaches taken for a tightly controlled codebase or homogenous architecture.

Slide 24

Slide 24 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Each dependency - or distributed system relationship - must be isolated so its failure does not cascade or saturate all resources.

Slide 25

Slide 25 text

cy D dency G ependency J Dependency M Dependency B Dependency E Dependency H Dependency K Dependency N Dependency C Dependency F Dependency I Dependency L Dependency O User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Serialization - URL and/or body generation Logic - validation, decoration, object model, caching, metrics, logging, etc It is not just the network that can fail and needs isolation but the full request/response loop including business logic and serialization/deserialization. Protecting against a network failure only to return a response that causes application logic to fail elsewhere in the application only moves the problem.

Slide 26

Slide 26 text

"Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency This is an example of what a system looks like when high latency occurs without load shedding and isolation. Backend latency spiked (from <100ms to >1000ms at the median, >10,000 at the 90th percentile) and saturated all available resources resulting in the HTTP layer rejecting over 80% of requests.

Slide 27

Slide 27 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system relationship so their failure impact is limited and controllable.

Slide 28

Slide 28 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system relationship so their failure impact is limited and controllable.

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Logic - validation, decoration, object model, caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc A bulkhead wraps around the entire client behavior not just the network portion.

Slide 32

Slide 32 text

Tryable Semaphore Rejected Permitted Logic - validation, decoration, object model, caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc An effective form of bulkheading is a tryable semaphore that restricts concurrent execution. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-it-Works#semaphores

Slide 33

Slide 33 text

Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout A thread-pool also limits concurrent execution while also offering the ability to timeout and walk away from a latent thread. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-it-Works#threads--thread- pools

Slide 34

Slide 34 text

Netﬂix uses a combination of aggressive network timeouts, tryable semaphores and thread pools to isolate dependencies and limit impact of both failure and latency.

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve”

Slide 42

Slide 42 text

Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run() Circuit Open? getFallback() Success? Exception Thrown Successful Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() Asynchronous Hystrix execution flow chart. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart

Slide 43

Slide 43 text

Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run Circuit Open? getFallback() Return Successful Response Calculate Cir Health Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Time Short-circuit Reject Yes return immediately .queue() Asynchronous Execution can be synchronous or asynchronous (via a Future or Observable).

Slide 44

Slide 44 text

Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run Circuit Open? getFallback() Return Successful Response Calculate Cir Health Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Time Short-circuit Reject Yes return immediately .queue() Asynchronous Current state is queried before allowing execution to determine if it is short-circuited or throttled and should reject.

Slide 45

Slide 45 text

.observe() .execute() run() Circuit Open? getFallback() Success? Exception Thrown Successful Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() If not rejected execution proceeds to the run() method which performs underlying work.

Slide 46

Slide 46 text

Slide 47

Slide 47 text

.observe() .execute() run() Circuit Open? getFallback() Success? Exception Thrown Successful Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() All requests, successful and failed, contribute to a feedback loop used to make decisions and publish metrics.

Slide 48

Slide 48 text

.observe() .execute() run() Circuit Open? getFallback() Success? Exception Thrown Successful Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() All failure states are routed through the same path.

Slide 49

Slide 49 text

Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run() Circuit Open? getFallback() Return Successful Response Calculate Circu Health Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Timeo Short-circuit Reject Yes return immediately .queue() Asynchronous Every failure is given the opportunity to retrieve a fallback which can result in one of three results.

Slide 50

Slide 50 text

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

(If using LinkedBlockingQueue instead of SynchronousQueue)

Slide 54

Slide 54 text

30 rps x 0.2 seconds = 6 + breathing room = 10 threads If LinkedBlockingQueue is used (default uses SynchronousQueue) Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free If LinkedBlockingQueue is used instead of the default SynchronousQueue, requests in queue block user threads thus must be considered part of the resources being allocated to a dependency. Setting a queue to 100 is equivalent to saying 100 incoming requests can block while waiting for this dependency. There is typically not a good reason for having a queue size higher than 5-10. Bursting should be handled through batching and throughput should be accommodated by a large enough thread pool. It is better to increase the thread-pool size rather than the queue as commands executing in the thread-pool receive forward progress whereas items in the queue do not. See https://github.com/Netﬂix/Hystrix/wiki/Conﬁguration#maxqueuesize for more information.

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited The Netflix API has ~40 thread pools with 5-20 threads in each. A common question and concern is what impact this has on performance. Here is a sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 60rps per server. Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future. The calling thread median, 90th and 99th percentiles are the last 3 legend values. This example was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher.

Slide 58

Slide 58 text

Cost: 0ms Time for thread to execute Time user thread waited Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) At the median (and lower) there is no cost to having a separate thread.

Slide 59

Slide 59 text

Cost: 3ms Time for thread to execute Time user thread waited Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) At the 90th percentile there is a cost of 3ms for having a separate thread.

Slide 60

Slide 60 text

Cost: 9ms Time for thread to execute Time user thread waited Cost of Thread @ ~60rps mean - median - 90th - 99th (time in ms) At the 99th percentile there is a cost of 9ms for having a separate thread. Note however that the increase in cost is far smaller than the increase in execution time of the separate thread which jumped from 2 to 28 whereas the cost jumped from 0 to 9. This overhead at the 90th percentile and higher for circuits such as these has been deemed acceptable for the benefits of resilience achieved. For circuits that wrap very low latency requests (such as those primarily hitting in-memory caches) the overhead can be too high and in those cases we choose to use tryable semaphores which do not allow for timeouts but provide most of the resilience benefits without the overhead. The overhead in general though is small enough that we prefer the isolation benefits of a separate thread.

Slide 61

Slide 61 text

Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited This is a second sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 75rps per server. As with the first example this was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher. Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including queuing, scheduling, execution and waiting for the return value from the Future. The calling thread median, 90th and 99th percentiles are the last 3 legend values.

Slide 62

Slide 62 text

Cost: 0ms Time for thread to execute Time user thread waited Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) At the median (and lower) there is no cost to having a separate thread.

Slide 63

Slide 63 text

Cost: 2ms Time for thread to execute Time user thread waited Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) At the 90th percentile there is a cost of 2ms for having a separate thread.

Slide 64

Slide 64 text

Cost: 2ms Time for thread to execute Time user thread waited Cost of Thread @ ~75rps mean - median - 90th - 99th (time in ms) At the 99th percentile there is a cost of 2ms for having a separate thread.

Slide 65

Slide 65 text

Semaphores Effectively No Cost ~5000rps per instance Semaphore isolation on the other hand is used for dependencies which are very high-volume in-memory lookups that never result in a synchronous network request. The cost is practically zero (atomic compare-and-set counter for semaphore).

Slide 66

Slide 66 text

HystrixCommand run() public!class!CommandHelloWorld!extends!HystrixCommand!{ !!!!... !!!!protected!String!run()!{ !!!!!!!!return!"Hello!"!+!name!+!"!"; !!!!} } Basic successful execution pattern and sample code. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#wiki-Hello-World

Slide 67

Slide 67 text

public!class!CommandHelloWorld!extends!HystrixCommand!{ !!!!... !!!!protected!String!run()!{ !!!!!!!!return!"Hello!"!+!name!+!"!"; !!!!} } run() invokes “client” Logic HystrixCommand run() The run() method is where the wrapped logic goes.

Slide 68

Slide 68 text

HystrixCommand run() throw Exception Fail Fast Failing fast is the default behavior if no fallback is implemented. Even without a fallback this is useful as it prevents resource saturation beyond the bulkhead so the rest of the application can continue functioning and enables rapid recovery once the underlying problem is resolved. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#fail-fast

Slide 69

Slide 69 text

HystrixCommand run() getFallback() return&null; return&new&Option(); return&Collections.emptyList(); return&Collections.emptyMap(); Fail Silent Silent failure is an approach for removing non-essential functionality from the user experience by returning a value that equates to “no data”, “not available” or “don’t display”. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#fail-silent

Slide 70

Slide 70 text

HystrixCommand run() getFallback() return&true; return&DEFAULT_OBJECT; Static Fallback Static fallbacks can be used when default data or behavior can be returned to the user. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#fallback-static

Slide 71

Slide 71 text

HystrixCommand run() getFallback() return&new&UserAccount(customerId,&"Unknown&Name", &&&&&&&&&&&&&&&&countryCodeFromGeoLookup,&true,&true,&false); return&new&VideoBookmark(movieId,&0); Stubbed Fallback Stubbed fallbacks are an extension of static fallbacks when some data is available (such as from request arguments, authentication tokens or other functioning system calls) and combined with default values for data that can not be retrieved. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#fallback-stubbed

Slide 72

Slide 72 text

Slide 73

Slide 73 text

HystrixCommand run() getFallback() public!class!CommandHelloWorld!extends!HystrixCommand!{ !!!!... !!!!protected!String!run()!{ !!!!!!!!return!"Hello!"!+!name!+!"!"; !!!!} !!!!protected!String!getFallback()!{ !!!!!!!!return!"Hello!Failure!"!+!name!+!"!"; !!!!} } Stubbed Fallback The getFallback() method is executed whenever failure occurs (after run() invocation or on rejection without run() ever being invoked) to provide opportunity to do fallback.

Slide 74

Slide 74 text

HystrixCommand run() getFallback() HystrixCommand run() Fallback via network Fallback via network is a common approach for falling back to a stale cache (such as a memcache server) or less personalized value when not able to fetch from the primary source. Read more at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#fallback-cache-via-network

Slide 75

Slide 75 text

HystrixCommand run() getFallback() HystrixCommand run() getFallback() Fallback via network then Local When the fallback performs a network call it’s preferable for it to also have a fallback that does not go over the network otherwise if both primary and secondary systems fail it will fail by throwing an exception (similar to fail fast except after two fallback attempts).

Slide 76

Slide 76 text

BookmarkClient.getInstance().getBookmark(user,6movie); A typical blocking client call ...

Slide 77

Slide 77 text

public&class&GetBookmarkCommand&extends&HystrixCommand&{ &&&&private&final&User&user; &&&&private final Movie movie; &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory")); &&&&&&&&this.user&=&user; &&&&&&&&this.movie&=&movie; &&&&} &&&&@Override &&&&protected&Bookmark&run()&throws&Exception&{ &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie); &&&&} } BookmarkClient.getInstance().getBookmark(user,6movie); ... gets wrapped inside a HystrixCommand within the run() method.

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Slide 80

Slide 80 text

public!class!GetBookmarkCommand!extends!HystrixCommand!{ !!!!private!final!User!user; !!!!private!final!Movie!movie; !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{ !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory")) !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark")) !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead")) !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter() !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500))); !!!!!!!!this.user!=!user; !!!!} !!!!@Override !!!!protected!Bookmark!run()!throws!Exception!{ !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie); !!!!} } Various other conﬁg options are also available ...

Slide 81

Slide 81 text

Slide 82

Slide 82 text

public!class!GetBookmarkCommand!extends!HystrixCommand!{ !!!!private!final!User!user; !!!!private!final!Movie!movie; !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{ !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory")) !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark")) !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead")) !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter() !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500))); !!!!!!!!this.user!=!user; !!!!} !!!!@Override !!!!protected!Bookmark!run()!throws!Exception!{ !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie); !!!!} } ... the CommandKey (normally defaults to class name) ...

Slide 83

Slide 83 text

public!class!GetBookmarkCommand!extends!HystrixCommand!{ !!!!private!final!User!user; !!!!private!final!Movie!movie; !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{ !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory")) !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark")) !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead")) !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter() !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500))); !!!!!!!!this.user!=!user; !!!!} !!!!@Override !!!!protected!Bookmark!run()!throws!Exception!{ !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie); !!!!} } ... a ThreadPoolKey (normally defaults to GroupKey) ...

Slide 84

Slide 84 text

public!class!GetBookmarkCommand!extends!HystrixCommand!{ !!!!private!final!User!user; !!!!private!final!Movie!movie; !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{ !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory")) !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark")) !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead")) !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter() !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500))); !!!!!!!!this.user!=!user; !!!!} !!!!@Override !!!!protected!Bookmark!run()!throws!Exception!{ !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie); !!!!} } ... and various properties, the most common to change being the timeout (which defaults to 1000ms) ...

Slide 85

Slide 85 text

public&class&GetBookmarkCommand&extends&HystrixCommand&{ &&&&private&final&User&user; &&&&private final Movie movie; &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory")); &&&&&&&&this.user&=&user; &&&&&&&&this.movie&=&movie; &&&&} &&&&@Override &&&&protected&Bookmark&run()&throws&Exception&{ &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie); &&&&} &&&&@Override &&&&protected&Bookmark&getFallback()&{ &&&&&&&&return&new&Bookmark(0); &&&&} &&&&@Override &&&&protected&String&getCacheKey()&{ &&&&&&&&return&movie.getId()&+&"_"&+&user.getId(); &&&&} } Generally however this is all that is needed. More can be read on configuration options at https://github.com/Netflix/Hystrix/wiki/Configuration

Slide 86

Slide 86 text

&&&&public&GetBookmarkCommand(User&user,&Movie&move)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory")); &&&&&&&&this.user&=&user; &&&&&&&&this.movie&=&movie; &&&&} &&&&@Override &&&&protected&Bookmark&run()&throws&Exception&{ &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie); &&&&} &&&&@Override &&&&protected&Bookmark&getFallback()&{ &&&&&&&&return&new&Bookmark(0); &&&&} &&&&@Override &&&&protected&String&getCacheKey()&{ &&&&&&&&return&movie.getId()&+&"_"&+&user.getId(); &&&&} } The getFallback() method can be implemented for providing fallback responses when failure occurs. More information is available at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#wiki-Fallback

Slide 87

Slide 87 text

&&&&public&GetBookmarkCommand(User&user,&Movie&move)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory")); &&&&&&&&this.user&=&user; &&&&&&&&this.movie&=&movie; &&&&} &&&&@Override &&&&protected&Bookmark&run()&throws&Exception&{ &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie); &&&&} &&&&@Override &&&&protected&Bookmark&getFallback()&{ &&&&&&&&return&new&Bookmark(0); &&&&} &&&&@Override &&&&protected&String&getCacheKey()&{ &&&&&&&&return&movie.getId()&+&"_"&+&user.getId(); &&&&} } The getCacheKey() method allows de-duping calls within a single request context. More information is available at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use#wiki-Caching

Slide 88

Slide 88 text

public&class&GetMoviesCommand&extends&HystrixCommand&{ &&&&private&final&User&user; &&&&public&GetMoviesCommand(User&user)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("MovieListsService")); &&&&&&&&this.user&=&user; &&&&} &&&&@Override &&&&protected&Movies&run()&throws&Exception&{ &&&&&&&&return&MoviesClient.getInstance().getMovies(user); &&&&} &&&&@Override &&&&protected&Movies&getFallback()&{ &&&&&&&&return&new&GetDefaultMoviesCommand(user).execute(); &&&&} &&&& &&&&@Override &&&&protected&String&getCacheKey()&{ &&&&&&&&return&String.valueOf(user.getId()); &&&&} } This example will demonstrate a different fallback strategy ...

Slide 89

Slide 89 text

Slide 90

Slide 90 text

Slide 91

Slide 91 text

public&class&GetMoviesCommand&extends&HystrixCommand&{ &&&&private&final&User&user; &&&&public&GetMoviesCommand(User&user)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("MovieListsService")); &&&&&&&&this.user&=&user; &&&&} &&&&@Override &&&&protected&Movies&run()&throws&Exception&{ &&&&&&&&return&MoviesClient.getInstance().getMovies(user); &&&&} &&&&@Override &&&&protected&Movies&getFallback()&{ &&&&&&&&return&new&GetDefaultMoviesCommand(user).execute(); &&&&} &&&& &&&&@Override &&&&protected&String&getCacheKey()&{ &&&&&&&&return&String.valueOf(user.getId()); &&&&} } ... but the fallback executes another HystrixCommand since the fallback is retrieved from over the network as well.

Slide 92

Slide 92 text

public&class&GetDefaultMoviesCommand&extends&HystrixCommand&{ &&&&private&final&User&user; &&&&public&GetDefaultMoviesCommand(User&user)&{ &&&&&&&&this.user&=&user; &&&&} &&&&@Override &&&&protected&Movies&run()&throws&Exception&{ &&&&&&&&return&MoviesClient.getInstance().getDefaultMovies(user); &&&&} &&&&protected&Movies&getFallback()&{ &&&&&&&&return&MoviesClient.getInstance().getDefaultMoviesSnapshot(); &&&&} } The fallback HystrixCommand will execute the network call within its run() method ...

Slide 93

Slide 93 text

public&class&GetDefaultMoviesCommand&extends&HystrixCommand&{ &&&&private&final&User&user; &&&&public&GetDefaultMoviesCommand(User&user)&{ &&&&&&&&this.user&=&user; &&&&} &&&&@Override &&&&protected&Movies&run()&throws&Exception&{ &&&&&&&&return&MoviesClient.getInstance().getDefaultMovies(user); &&&&} &&&&protected&Movies&getFallback()&{ &&&&&&&&return&MoviesClient.getInstance().getDefaultMoviesSnapshot(); &&&&} } ... and if it also fails it too has a fallback, but this one executes locally. Thus, we ﬁrst try and get personalized movies for a user, then generic fallback from a remote cache, and if both of those fail then a global fallback from a local cache.

Slide 94

Slide 94 text

public&class&GetUserCommand&extends&HystrixCommand&{ &&&&private&final&int&id; &&&&public&GetUserCommand(int&id)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("User")); &&&&&&&&this.id&=&id; &&&&} &&&&@Override &&&&protected&User&run()&throws&Exception&{ &&&&&&&&return&UserClient.getInstance().getUser(id); &&&&} } This use case of fetching user data provides another variant of fallback behavior.

Slide 95

Slide 95 text

public&class&GetUserCommand&extends&HystrixCommand&{ &&&&private&final&int&id; &&&&public&GetUserCommand(int&id)&{ &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("User")); &&&&&&&&this.id&=&id; &&&&} &&&&@Override &&&&protected&User&run()&throws&Exception&{ &&&&&&&&return&UserClient.getInstance().getUser(id); &&&&} } At ﬁrst it appears this does not have a valid fallback and will only be able to fail fast. What do we return if we can’t fetch a user that would still allow things to work?

Slide 96

Slide 96 text

public&class&GetUserCommand&extends&HystrixCommand&{ &&&&private&final&int&id; &&&&private&final&Cookie[]&requestCookies; &&&&public&GetUserCommand(int&id,&Cookie[]&requestCookies) &&&&&&& &&&&&&&&!...!for!brevity!on!slide!... &&&&@Override &&&&protected&User&getFallback()&{ &&&&&&&&if(...&cookies&valid&...)&{ &&&&&&&&&&&&User&stubbedUser&=&new&User(id); &&&&&&&&&&&&//&logic&for&retrieving&defaults&from&cookies &&&&&&&&&&&&return&stubbedUser; &&&&&&&&}&else&{ &&&&&&&&&&&&throw&new&RuntimeException("Unable&to&retrieve&user &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&from&service&or&cookies"); &&&&&&&&} &&&&} } To do so we change the input arguments and accept stateful cookies sent with every authenticated request and use those to retrieve key user data that we can make a stubbed response from.

Slide 97

Slide 97 text

public&class&GetUserCommand&extends&HystrixCommand&{ &&&&private&final&int&id; &&&&private&final&Cookie[]&requestCookies; &&&&public&GetUserCommand(int&id,&Cookie[]&requestCookies) &&&&&&& &&&&&&&&!...!for!brevity!on!slide!... &&&&@Override &&&&protected&User&getFallback()&{ &&&&&&&&if(...&cookies&valid&...)&{ &&&&&&&&&&&&User&stubbedUser&=&new&User(id); &&&&&&&&&&&&//&logic&for&retrieving&defaults&from&cookies &&&&&&&&&&&&return&stubbedUser; &&&&&&&&}&else&{ &&&&&&&&&&&&throw&new&RuntimeException("Unable&to&retrieve&user &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&from&service&or&cookies"); &&&&&&&&} &&&&} } If the cookies have the necessary data (cookies for all authenticated users should) then we can fallback to a stubbed response.

Slide 98

Slide 98 text

public&class&GetUserCommand&extends&HystrixCommand&{ &&&&private&final&int&id; &&&&private&final&Cookie[]&requestCookies; &&&&public&GetUserCommand(int&id,&Cookie[]&requestCookies) &&&&&&& &&&&&&&&!...!for!brevity!on!slide!... &&&&@Override &&&&protected&User&getFallback()&{ &&&&&&&&if(...&cookies&valid&...)&{ &&&&&&&&&&&&User&stubbedUser&=&new&User(id); &&&&&&&&&&&&//&logic&for&retrieving&defaults&from&cookies &&&&&&&&&&&&return&stubbedUser; &&&&&&&&}&else&{ &&&&&&&&&&&&throw&new&RuntimeException("Unable&to&retrieve&user &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&from&service&or&cookies"); &&&&&&&&} &&&&} } Most of the ﬁelds of the User object will be defaults and may affect user experience in certain ways, but the critical pieces are available from the cookies. This allows the system to degrade rather than failing. Since the User object is critical to almost every incoming request this is essential to application resiliency. We favor degrading the experience over outright failure.

Slide 99

Slide 99 text

User&user&=&new&GetUserCommand(id,&requestCookies).execute() Future&user&=&new&GetUserCommand(id,&requestCookies).queue() Observable&user&=&new&GetUserCommand(id,&requestCookies).observe() Once a command exists it can be executed in 3 ways, execute(), queue() and observe(). More information can be found at https://github.com/Netﬂix/Hystrix/wiki/How-To-Use

Slide 100

Slide 100 text

So now what? Code is only part of the solution. Operations is the other critical half.

Slide 101

Slide 101 text

Historical metrics representing all possible states of success, failure, decision making and performance related to each bulk head.

Slide 102

Slide 102 text

>40,000 success 1 timeout 4 rejected Looking closely at high volume systems it is common to ﬁnd constant failure.

Slide 103

Slide 103 text

The rejection spikes on the left correlate with and do in fact represent the cause of the fallback spikes on the right.

Slide 104

Slide 104 text

Latency percentiles are captured at every 5th percentile and a few extra such as 99.5th (though this graph is only showing 50th/99th/99.5th).

Slide 105

Slide 105 text

No content

Slide 106

Slide 106 text

>40,000 success 0.10 exceptions Exceptions Thrown helps to identify if a failure state is being handled by a fallback successfully or not. In this case we are seeing < 0.1 exceptions per second being thrown but on the previous set of metrics saw 5-40 fallbacks occurring each second, thus we can see that the fallbacks are doing their job but we may want to look for very small number of edge cases where fallbacks fail resulting in an exception.

Slide 107

Slide 107 text

We found that historical metrics with 1 datapoint per minute and 1-2 minutes latency were not sufficient during operational events such as deployments, rollbacks, production alerts and conﬁguration changes so we built near realtime monitoring and data visualizations to help us consume large amounts of data easily. This dashboard is the aggregate view of a production cluster with ~1-2 second latency from the time an event occurs to being rendered in the browser. Read more at https://github.com/Netﬂix/Hystrix/wiki/Dashboard

Slide 108

Slide 108 text

Each bulkhead is represented with a visualization like this.

Slide 109

Slide 109 text

circle color and size represent health and trafﬁc volume

Slide 110

Slide 110 text

2 minutes of request rate to show relative changes in trafﬁc circle color and size represent health and trafﬁc volume

Slide 111

Slide 111 text

2 minutes of request rate to show relative changes in trafﬁc circle color and size represent health and trafﬁc volume hosts reporting from cluster

Slide 112

Slide 112 text

last minute latency percentiles 2 minutes of request rate to show relative changes in trafﬁc circle color and size represent health and trafﬁc volume hosts reporting from cluster

Slide 113

Slide 113 text

last minute latency percentiles 2 minutes of request rate to show relative changes in trafﬁc circle color and size represent health and trafﬁc volume hosts reporting from cluster Circuit-breaker status

Slide 114

Slide 114 text

Slide 115

Slide 115 text

Error percentage of last 10 seconds last minute latency percentiles Request rate 2 minutes of request rate to show relative changes in trafﬁc circle color and size represent health and trafﬁc volume hosts reporting from cluster Error percentage of last 10 seconds Circuit-breaker status

Slide 116

Slide 116 text

last minute latency percentiles Request rate 2 minutes of request rate to show relative changes in trafﬁc circle color and size represent health and trafﬁc volume hosts reporting from cluster Error percentage of last 10 seconds Rolling 10 second counters with 1 second granularity Failures/Exceptions Thread-pool Rejections Thread timeouts Successes Short-circuited (rejected) Circuit-breaker status

Slide 117

Slide 117 text

23 5 2 0 47 8 1 0 26 4 0 0 48 9 4 0 38 4 2 0 42 6 7 0 59 11 5 1 46 5 2 0 39 3 5 0 12 1 0 0 Success Timeout Failure Rejection 10 1-second "buckets" 23 5 2 0 47 8 1 0 26 4 0 0 48 9 4 0 38 4 2 0 42 6 7 0 59 11 5 1 46 5 2 0 39 3 5 0 45 6 2 0 1 0 0 0 On "getLatestBucket" if the 1-second window is passed a new bucket is created, the rest slid over and the oldest one dropped. Low Latency Granular Metrics Rolling 10 second window 1 second resolution All metrics are captured in both absolute cumulative counters and rolling windows with 1 second granularity. Read more at https://github.com/Netﬂix/Hystrix/wiki/Metrics-and-Monitoring

Slide 118

Slide 118 text

23 5 2 0 47 8 1 0 26 4 0 0 48 9 4 0 38 4 2 0 42 6 7 0 59 11 5 1 46 5 2 0 39 3 5 0 12 1 0 0 Success Timeout Failure Rejection 10 1-second "buckets" 23 5 2 0 47 8 1 0 26 4 0 0 48 9 4 0 38 4 2 0 42 6 7 0 59 11 5 1 46 5 2 0 39 3 5 0 45 6 2 0 1 0 0 0 On "getLatestBucket" if the 1-second window is passed a new bucket is created, the rest slid over and the oldest one dropped. Low Latency Granular Metrics Rolling 10 second window 1 second resolution The rolling counters default to 10 second windows with 1 second buckets.

Slide 119

Slide 119 text

23 5 2 0 47 8 1 0 26 4 0 0 48 9 4 0 38 4 2 0 42 6 7 0 59 11 5 1 46 5 2 0 39 3 5 0 12 1 0 0 Success Timeout Failure Rejection 10 1-second "buckets" 23 5 2 0 47 8 1 0 26 4 0 0 48 9 4 0 38 4 2 0 42 6 7 0 59 11 5 1 46 5 2 0 39 3 5 0 45 6 2 0 1 0 0 0 On "getLatestBucket" if the 1-second window is passed a new bucket is created, the rest slid over and the oldest one dropped. Low Latency Granular Metrics Rolling 10 second window 1 second resolution As each second passes the oldest bucket is dropped (to soon be overwritten since it is a ring buffer)...

Slide 120

Slide 120 text

Slide 121

Slide 121 text

~1 second latency aggregated stream Turbine stream aggregator Low Latency Granular Metrics Metrics are subscribed to from all servers in a cluster and aggregated with ~1 second latency from event to aggregation. This stream can then be consumed by the dashboard, an alerting system or anything else wanting low latency metrics.

Slide 122

Slide 122 text

propagate across cluster in seconds Low Latency Configuration Changes The low latency loop is completed with the ability to propagate configuration changes across a cluster in seconds. This enables rapid iterations of seeing behavior in production, pushing config changes and then watching them take effect immediately as the changes roll across a cluster of servers. Low latency operations requires both the visibility into metrics and ability to affect change operating with similar latency windows.

Slide 123

Slide 123 text

Auditing via Simulation Simulating failure states in production has proven an effective approach for auditing our applications to either prove resilience or ﬁnd weakness.

Slide 124

Slide 124 text

Auditing via Simulation In this example failure was injected into a single dependency which caused the bulkhead to return fallbacks and trip all circuits since the failure rate was almost 100%, well above the threshold for circuits to trip.

Slide 125

Slide 125 text

Auditing via Simulation When the ‘TitleStatesGetAllRentStates` bulkhead began returning fallbacks the ‘atv_mdp’ endpoint shot to the top of the dashboard with 99% error rate. There was a bug in how the fallback was handled so we immediately stopped the simulation, ﬁxed the bug over the coming days and repeated the simulation to prove it was ﬁxed and the rest of the system remained resilient. This was caught in a controlled simulation where we could catch and act in less than a minute rather than a true production incident where we likely wouldn’t have been able to do anything.

Slide 126

Slide 126 text

This shows another simulation where latency was injected. Read more at http://techblog.netﬂix.com/2011/07/netﬂix-simian-army.html

Slide 127

Slide 127 text

125 → 1500+ 1000+ ms of latency was injected into a dependency that normally completes with a median latency of ~15-20ms and 99.5th of 120-130ms.

Slide 128

Slide 128 text

~5000 The latency spike caused timeouts, short-circuiting and rejecting and up to ~5000 fallbacks per second as a result of these various failure states.

Slide 129

Slide 129 text

~1 While delivering the ~5000 fallbacks per second the exceptions thrown didn’t go beyond ~1 per second demonstrating that user impact was negligible (as perceived from the server, the client behavior must also be validated during a simulation but is not part of this dataset).

Slide 130

Slide 130 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" Other approaches to auditing take advantage of our routing layer to route traffic to different clusters. Read more at http://techblog.netﬂix.com/2013/06/announcing-zuul-edge-service-in-cloud.html

Slide 131

Slide 131 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" Every code deployment is preceded by a canary test where a small number of instances are launched to take production traffic, half with new code (canary), half with existing production code (baseline) and compared for differences. Thousands of system, application and bulkhead metrics are compared to make a decision on whether the new code should continue to full deployment. Many issues are found via production canaries that are not found in dev and test environments. Read more at http://techblog.netﬂix.com/2013/08/deploying-netﬂix-api.html

Slide 132

Slide 132 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" New instances are also put through a squeeze test before full rollout to ﬁnd the point at which the performance degrades. This is used to identify performance and throughput changes of each deployment.

Slide 133

Slide 133 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" Long-term canaries are kept in a cluster we call “coalmine” with agents intercepting all network traffic. These run the same code as the production cluster and are used to identify network traffic without a bulkhead that starts happening due to unknown code paths being enabled via conﬁguration, AB test and other changes. Read more at https://github.com/Netﬂix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent

Slide 134

Slide 134 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R System Relationship Over Network without Bulkhead For example, a network relationship could exist in production code but not be triggered in dev, test or production canaries but then be enabled via a condition that changes days after deployment to production. This can be a vulnerability and we use the “coalmine” to identity these situations and inform decisions.

Slide 135

Slide 135 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Slide 136

Slide 136 text

Failure inevitably happens ... A good read on complex systems is Drift into Failure by Sidney Dekker: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

Slide 137

Slide 137 text

Cluster adapts Failure Isolated When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and fallbacks returned.

Slide 138

Slide 138 text

Cluster adapts Failure Isolated When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and fallbacks returned.

Slide 139

Slide 139 text

Cluster adapts Failure Isolated Since the failure rate was above the threshold circuit breakers began tripping. As a portion of the cluster tripped circuits it released pressure on the underlying system so it could successfully perform some work.

Slide 140

Slide 140 text

Cluster adapts Failure Isolated The cluster naturally adapts as bulkheads constrain throughput and circuits open and close in a rolling manner across the instances in the cluster.

Slide 141

Slide 141 text

In this example the ‘CinematchGetPredictions’ functionality began failing.

Slide 142

Slide 142 text

The red metric shows it was exceptions thrown by the client, not latency or concurrency constraints.

Slide 143

Slide 143 text

The 20% error rate from the realtime visualization is also seen in the historical metrics with accompanying drop in successes.

Slide 144

Slide 144 text

Matching the increase in failures is the increase of fallbacks being delivered for every failure.

Slide 145

Slide 145 text

Distributed Systems are Complex Distributed applications need to be treated as complex systems and we must recognize that no machine or human can comprehend all of the state or interactions.

Slide 146

Slide 146 text

Isolate Relationships One way to dealing with the complex system is to isolate the relationships so they can each fail independently of each other. Bulkheads have proven an effective approach for isolating and managing failure.

Slide 147

Slide 147 text

Auditing & Operations are Essential Resilient code is only part of the solution. Systems drift and have latent bugs and failure states emerge from the complex interactions of the many relationships. Constant auditing can be part of the solution. Human operations must handle everything the system can’t which by deﬁnition means it is unknown so the system must strive to expose clear insights and effective tooling so humans can make informed decisions.

Slide 148

Slide 148 text

Hystrix https://github.com/Netflix/Hystrix Application Resilience in a Service-oriented Architecture http://programming.oreilly.com/2013/06/application-resilience-in-a-service-oriented-architecture.html Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen jobs.netflix.com