Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Application Resilience Engineering and Operations at Netflix with Hystrix - JavaOne 2013

Application Resilience Engineering and Operations at Netflix with Hystrix - JavaOne 2013

Learn how the Netflix API achieves fault tolerance in a distributed architecture while depending on dozens of systems that can fail at any time while serving more than two billion Web service calls each day to 1000+ different devices. Topics include common patterns, production examples, and operational learnings from the way Netflix incorporates fault and latency tolerance into its distributed systems, using circuit breakers, bulkheads, and other patterns embodied in the open source Hystrix library and operates them by using real-time metrics and data visualization tools.

Presented at JavaOne 2013: https://oracleus.activeevents.com/2013/connect/sessionDetail.ww?SESSION_ID=2624&tclass=popup

Hystrix at Netflix: http://techblog.netflix.com/2012/11/hystrix.html

Hystrix on Github: https://github.com/Netflix/Hystrix

Video: https://www.youtube.com/watch?v=RzlluokGi1w

Ben Christensen

September 25, 2013
Tweet

More Decks by Ben Christensen

Other Decks in Programming

Transcript

  1. Application Resilience Engineering
    and Operations at Netflix with Hystrix
    Ben Christensen – @benjchristensen – Software Engineer on Edge Platform at Netflix

    View full-size slide

  2. Netflix is a subscription service for movies and TV shows for $7.99USD/month (about the same converted price in each countries local currency).

    View full-size slide

  3. More than 37 million Subscribers
    in 50+ Countries and Territories
    Netflix has over 37 million video streaming customers in 50+ countries and territories across North & South America, United Kingdom, Ireland, Netherlands and the Nordics.

    View full-size slide

  4. Netflix accounts for 33% of Peak Downstream
    Internet Traffic in North America
    Netflix subscribers are watching
    more than 1 billion hours a month
    Sandvine report available with free account at http://www.sandvine.com/news/global_broadband_trends.asp

    View full-size slide

  5. API traffic has grown from
    ~20 million/day in 2010 to >2 billion/day
    0
    500
    1000
    1500
    2000
    2010 2011 2012 Today
    millions of API requests per day

    View full-size slide

  6. Discovery Streaming
    Streaming devices talk to 2 major edge services: the first is the Netflix API that provides functionality related to discovering and browsing content while the second handles the playback of video streams.

    View full-size slide

  7. Netflix API Streaming
    This presentation focuses on architectural choices made for the “Discovery” portion of traffic that the Netflix API handles.

    View full-size slide

  8. The Netflix API powers the “Discovery” user experience on the 800+ devices up until a user hits the play button at which point the “Streaming” edge service takes over.

    View full-size slide

  9. Netflix API
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming.
    2+ billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:6) to dozens of underlying subsystems.

    View full-size slide

  10. Global deployment spread across data centers in multiple AWS regions.
    Geographic isolation, active/active with regional failover coming (http://techblog.netflix.com/2013/05/denominating-multi-region-sites.html)

    View full-size slide

  11. AWS
    Availability Zone
    AWS
    Availability Zone
    AWS
    Availability Zone
    3 data centers (AWS Availability Zones) operate in each region with deployments split across them for redundancy in event of losing an entire zone.

    View full-size slide

  12. Each zone is populated with application clusters (‘auto-scaling groups’ or ASGs) that make up the service oriented distributed system.
    Application clusters operate independently of each other with software and hardware load balancing routing traffic between them.

    View full-size slide

  13. Application clusters are made up of 1 to 100s of machine instances per zone. Service registry and discovery work with software load balancing to allow machines to launch and disappear (for planned or unplanned
    reasons) at any time and become part of the distributed system and serve requests. Auto-scaling enables system-wide adaptation to demand as it launches instances to meet increasing traffic and load or handle
    instance failure.

    View full-size slide

  14. Failed instances are dropped from discovery so traffic stops routing to them. Software load balancers on client applications detect and skip them until discovery removes them.

    View full-size slide

  15. Auto-scale policies brings on new instances to replace failed ones or to adapt to increasing demand.

    View full-size slide

  16. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Applications communicate with dozens of other applications in the service-oriented architecture. Each of these client/server dependencies represents a relationship within the complex distributed system.

    View full-size slide

  17. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User request
    blocked by
    latency in
    single
    network call
    Any one of these relationships can fail at any time. They can be intermittent or cluster-wide, immediate with thrown exceptions or returned error codes or latency from various causes. Latency is particularly
    challenging for applications to deal with as it causes resource utilization in queues and pools and blocks user requests (even with async IO).

    View full-size slide

  18. At high
    volume
    all request
    threads can
    block in
    seconds
    User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Latency at high volume can quickly saturate all application resources (queues, pools, sockets, etc) causing total application failure and the inability to serve user requests even if all other dependencies are healthy.

    View full-size slide

  19. Dozens of dependencies.
    One going bad takes everything down.
    99.99%30 = 99.7% uptime
    0.3% of 1 billion = 3,000,000 failures
    2+ hours downtime/month
    Reality is generally worse.
    Large distributed systems are complex and failure will occur. If failure from every component is allowed to cascade across the system they will all affect the user.

    View full-size slide

  20. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Solution design was done with constraints, context and priorities of the Netflix environment.

    View full-size slide

  21. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Speed of iteration is optimized for and this leads to client/server relationships where client libraries are provided rather than each team writing their own client code against a server protocol. This means “3rd party”
    code from many developers and teams is constantly being deployed into applications across the system. Large applications such as the Netflix API have dozens of client libraries.

    View full-size slide

  22. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Speed of iteration is optimized for and this leads to client/server relationships where client libraries are provided rather than each team writing their own client code against a server protocol. This means “3rd party”
    code from many developers and teams is constantly being deployed into applications across the system. Large applications such as the Netflix API have dozens of client libraries.

    View full-size slide

  23. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    The environment is also diverse with different types of client/server communications and protocols. This heterogenous and always changing environment affects the approach for resilience engineering and is
    potentially very different than approaches taken for a tightly controlled codebase or homogenous architecture.

    View full-size slide

  24. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Each dependency - or distributed system relationship - must be isolated so its failure does not cascade or saturate all resources.

    View full-size slide

  25. cy D
    dency G
    ependency J
    Dependency M
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Serialization - URL and/or body generation
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    It is not just the network that can fail and needs isolation but the full request/response loop including business logic and serialization/deserialization.
    Protecting against a network failure only to return a response that causes application logic to fail elsewhere in the application only moves the problem.

    View full-size slide

  26. "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
    at java.net.Socket.connect(Socket.java:579)
    at java.net.Socket.connect(Socket.java:528)
    at java.net.Socket.(Socket.java:425)
    at java.net.Socket.(Socket.java:280)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722)
    [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1)
    > 80% of requests rejected
    Median
    Latency
    This is an example of what a system looks like when high latency occurs without load shedding and isolation. Backend latency spiked (from <100ms to >1000ms at the median, >10,000 at the 90th percentile) and saturated all available resources resulting in
    the HTTP layer rejecting over 80% of requests.

    View full-size slide

  27. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system relationship so their failure impact is limited and controllable.

    View full-size slide

  28. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system relationship so their failure impact is limited and controllable.

    View full-size slide

  29. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Responses can be intercepted and replaced with fallbacks.

    View full-size slide

  30. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    A user request can continue in a degraded state with a fallback response from the failing dependency.

    View full-size slide

  31. Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    A bulkhead wraps around the entire client behavior not just the network portion.

    View full-size slide

  32. Tryable Semaphore
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    An effective form of bulkheading is a tryable semaphore that restricts concurrent execution. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#semaphores

    View full-size slide

  33. Thread-pool
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Timeout
    A thread-pool also limits concurrent execution while also offering the ability to timeout and walk away from a latent thread. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#threads--thread-
    pools

    View full-size slide

  34. Netflix uses a combination of aggressive network timeouts, tryable semaphores and thread pools to isolate dependencies and limit impact of both failure and latency.

    View full-size slide

  35. Tryable semaphores for “trusted” clients and fallbacks
    Separate threads for “untrusted” clients
    Aggressive timeouts on threads and network calls
    to “give up and move on”
    Circuit breakers as the “release valve”

    View full-size slide

  36. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Hystrix execution flow chart. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart

    View full-size slide

  37. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Cir
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Time
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Execution can be synchronous or asynchronous (via a Future or Observable).

    View full-size slide

  38. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Cir
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Time
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Current state is queried before allowing execution to determine if it is short-circuited or throttled and should reject.

    View full-size slide

  39. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    If not rejected execution proceeds to the run() method which performs underlying work.

    View full-size slide

  40. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Successful responses return.

    View full-size slide

  41. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    All requests, successful and failed, contribute to a feedback loop used to make decisions and publish metrics.

    View full-size slide

  42. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    All failure states are routed through the same path.

    View full-size slide

  43. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Circu
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeo
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Every failure is given the opportunity to retrieve a fallback which can result in one of three results.

    View full-size slide

  44. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Hystrix execution flow chart. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart

    View full-size slide

  45. (If using LinkedBlockingQueue instead of SynchronousQueue)

    View full-size slide

  46. 30 rps x 0.2 seconds = 6 + breathing room = 10 threads
    If LinkedBlockingQueue is used (default uses SynchronousQueue)
    Thread-pool Queue size: 5-10 (0 doesn't work but get close to it)
    Thread-pool Size + Queue Size
    Queuing is Not Free
    If LinkedBlockingQueue is used instead of the default SynchronousQueue, requests in queue block user threads thus must be considered part of the resources being allocated to a dependency.
    Setting a queue to 100 is equivalent to saying 100 incoming requests can block while waiting for this dependency. There is typically not a good reason for having a queue size higher than 5-10.
    Bursting should be handled through batching and throughput should be accommodated by a large enough thread pool. It is better to increase the thread-pool size rather than the queue as commands executing in the thread-pool receive forward progress
    whereas items in the queue do not.
    See https://github.com/Netflix/Hystrix/wiki/Configuration#maxqueuesize for more information.

    View full-size slide

  47. Cost of Thread @ ~60rps
    mean - median - 90th - 99th (time in ms)
    Time for thread to execute Time user thread waited
    The Netflix API has ~40 thread pools with 5-20 threads in each. A common question and concern is what impact this has on performance.
    Here is a sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 60rps per server.
    Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including
    queuing, scheduling, execution and waiting for the return value from the Future.
    The calling thread median, 90th and 99th percentiles are the last 3 legend values.
    This example was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher.

    View full-size slide

  48. Cost: 0ms Time for thread to execute Time user thread waited
    Cost of Thread @ ~60rps
    mean - median - 90th - 99th (time in ms)
    At the median (and lower) there is no cost to having a separate thread.

    View full-size slide

  49. Cost: 3ms Time for thread to execute Time user thread waited
    Cost of Thread @ ~60rps
    mean - median - 90th - 99th (time in ms)
    At the 90th percentile there is a cost of 3ms for having a separate thread.

    View full-size slide

  50. Cost: 9ms Time for thread to execute Time user thread waited
    Cost of Thread @ ~60rps
    mean - median - 90th - 99th (time in ms)
    At the 99th percentile there is a cost of 9ms for having a separate thread. Note however that the increase in cost is far smaller than the increase in execution time of the separate thread which jumped from 2 to 28 whereas the cost jumped from 0 to 9.
    This overhead at the 90th percentile and higher for circuits such as these has been deemed acceptable for the benefits of resilience achieved.
    For circuits that wrap very low latency requests (such as those primarily hitting in-memory caches) the overhead can be too high and in those cases we choose to use tryable semaphores which do not allow for timeouts but provide most of the resilience
    benefits without the overhead. The overhead in general though is small enough that we prefer the isolation benefits of a separate thread.

    View full-size slide

  51. Cost of Thread @ ~75rps
    mean - median - 90th - 99th (time in ms)
    Time for thread to execute Time user thread waited
    This is a second sample of a dependency circuit for 24 hours from the Netflix API production cluster with a rate of 75rps per server.
    As with the first example this was chosen since it is relatively high volume and low latency so the cost of a separate thread is potentially more of a concern than if the backend network latency was 100ms or higher.
    Each execution occurs in a separate thread with mean, median, 90th and 99th percentile latencies shown in the first 4 legend values. The second group of 4 values is the user thread waiting on the dependency thread and shows the total time including
    queuing, scheduling, execution and waiting for the return value from the Future.
    The calling thread median, 90th and 99th percentiles are the last 3 legend values.

    View full-size slide

  52. Cost: 0ms Time for thread to execute Time user thread waited
    Cost of Thread @ ~75rps
    mean - median - 90th - 99th (time in ms)
    At the median (and lower) there is no cost to having a separate thread.

    View full-size slide

  53. Cost: 2ms Time for thread to execute Time user thread waited
    Cost of Thread @ ~75rps
    mean - median - 90th - 99th (time in ms)
    At the 90th percentile there is a cost of 2ms for having a separate thread.

    View full-size slide

  54. Cost: 2ms Time for thread to execute Time user thread waited
    Cost of Thread @ ~75rps
    mean - median - 90th - 99th (time in ms)
    At the 99th percentile there is a cost of 2ms for having a separate thread.

    View full-size slide

  55. Semaphores
    Effectively No Cost
    ~5000rps per instance
    Semaphore isolation on the other hand is used for dependencies which are very high-volume in-memory lookups that never result in a synchronous network request. The cost is practically zero (atomic compare-and-set counter for
    semaphore).

    View full-size slide

  56. HystrixCommand run()
    public!class!CommandHelloWorld!extends!HystrixCommand!{
    !!!!...
    !!!!protected!String!run()!{
    !!!!!!!!return!"Hello!"!+!name!+!"!";
    !!!!}
    }
    Basic successful execution pattern and sample code.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#wiki-Hello-World

    View full-size slide

  57. public!class!CommandHelloWorld!extends!HystrixCommand!{
    !!!!...
    !!!!protected!String!run()!{
    !!!!!!!!return!"Hello!"!+!name!+!"!";
    !!!!}
    }
    run() invokes
    “client” Logic
    HystrixCommand run()
    The run() method is where the wrapped logic goes.

    View full-size slide

  58. HystrixCommand run()
    throw Exception
    Fail Fast
    Failing fast is the default behavior if no fallback is implemented. Even without a fallback this is useful as it prevents resource saturation beyond the bulkhead so the rest of the application can continue functioning and
    enables rapid recovery once the underlying problem is resolved.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-fast

    View full-size slide

  59. HystrixCommand run()
    getFallback()
    return&null;
    return&new&Option();
    return&Collections.emptyList();
    return&Collections.emptyMap();
    Fail Silent
    Silent failure is an approach for removing non-essential functionality from the user experience by returning a value that equates to “no data”, “not available” or “don’t display”.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-silent

    View full-size slide

  60. HystrixCommand run()
    getFallback()
    return&true;
    return&DEFAULT_OBJECT;
    Static Fallback
    Static fallbacks can be used when default data or behavior can be returned to the user.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-static

    View full-size slide

  61. HystrixCommand run()
    getFallback()
    return&new&UserAccount(customerId,&"Unknown&Name",
    &&&&&&&&&&&&&&&&countryCodeFromGeoLookup,&true,&true,&false);
    return&new&VideoBookmark(movieId,&0);
    Stubbed Fallback
    Stubbed fallbacks are an extension of static fallbacks when some data is available (such as from request arguments, authentication tokens or other functioning system calls) and combined with default values for data
    that can not be retrieved.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-stubbed

    View full-size slide

  62. HystrixCommand run()
    getFallback()
    public!class!CommandHelloWorld!extends!HystrixCommand!{
    !!!!...
    !!!!protected!String!run()!{
    !!!!!!!!return!"Hello!"!+!name!+!"!";
    !!!!}
    !!!!protected!String!getFallback()!{
    !!!!!!!!return!"Hello!Failure!"!+!name!+!"!";
    !!!!}
    }
    Stubbed Fallback

    View full-size slide

  63. HystrixCommand run()
    getFallback()
    public!class!CommandHelloWorld!extends!HystrixCommand!{
    !!!!...
    !!!!protected!String!run()!{
    !!!!!!!!return!"Hello!"!+!name!+!"!";
    !!!!}
    !!!!protected!String!getFallback()!{
    !!!!!!!!return!"Hello!Failure!"!+!name!+!"!";
    !!!!}
    }
    Stubbed Fallback
    The getFallback() method is executed whenever failure occurs (after run() invocation or on rejection without run() ever being invoked) to provide opportunity to do fallback.

    View full-size slide

  64. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    Fallback via network
    Fallback via network is a common approach for falling back to a stale cache (such as a memcache server) or less personalized value when not able to fetch from the primary source.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-cache-via-network

    View full-size slide

  65. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    getFallback()
    Fallback via network then Local
    When the fallback performs a network call it’s preferable for it to also have a fallback that does not go over the network otherwise if both primary and secondary systems fail it will fail by throwing an exception
    (similar to fail fast except after two fallback attempts).

    View full-size slide

  66. BookmarkClient.getInstance().getBookmark(user,6movie);
    A typical blocking client call ...

    View full-size slide

  67. public&class&GetBookmarkCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&private final Movie movie;
    &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory"));
    &&&&&&&&this.user&=&user;
    &&&&&&&&this.movie&=&movie;
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&run()&throws&Exception&{
    &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie);
    &&&&}
    }
    BookmarkClient.getInstance().getBookmark(user,6movie);
    ... gets wrapped inside a HystrixCommand within the run() method.

    View full-size slide

  68. public&class&GetBookmarkCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&private final Movie movie;
    &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory"));
    &&&&&&&&this.user&=&user;
    &&&&&&&&this.movie&=&movie;
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&run()&throws&Exception&{
    &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie);
    &&&&}
    }
    Arguments are accepted by the constructor to make available to the run() method when invoked.

    View full-size slide

  69. public&class&GetBookmarkCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&private final Movie movie;
    &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory"));
    &&&&&&&&this.user&=&user;
    &&&&&&&&this.movie&=&movie;
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&run()&throws&Exception&{
    &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie);
    &&&&}
    }
    The minimal config of a command is providing a HystrixCommandGroupKey.

    View full-size slide

  70. public!class!GetBookmarkCommand!extends!HystrixCommand!{
    !!!!private!final!User!user;
    !!!!private!final!Movie!movie;
    !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{
    !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory"))
    !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark"))
    !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead"))
    !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter()
    !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500)));
    !!!!!!!!this.user!=!user;
    !!!!}
    !!!!@Override
    !!!!protected!Bookmark!run()!throws!Exception!{
    !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie);
    !!!!}
    }
    Various other config options are also available ...

    View full-size slide

  71. public!class!GetBookmarkCommand!extends!HystrixCommand!{
    !!!!private!final!User!user;
    !!!!private!final!Movie!movie;
    !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{
    !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory"))
    !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark"))
    !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead"))
    !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter()
    !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500)));
    !!!!!!!!this.user!=!user;
    !!!!}
    !!!!@Override
    !!!!protected!Bookmark!run()!throws!Exception!{
    !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie);
    !!!!}
    }
    ... the GroupKey ...

    View full-size slide

  72. public!class!GetBookmarkCommand!extends!HystrixCommand!{
    !!!!private!final!User!user;
    !!!!private!final!Movie!movie;
    !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{
    !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory"))
    !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark"))
    !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead"))
    !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter()
    !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500)));
    !!!!!!!!this.user!=!user;
    !!!!}
    !!!!@Override
    !!!!protected!Bookmark!run()!throws!Exception!{
    !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie);
    !!!!}
    }
    ... the CommandKey (normally defaults to class name) ...

    View full-size slide

  73. public!class!GetBookmarkCommand!extends!HystrixCommand!{
    !!!!private!final!User!user;
    !!!!private!final!Movie!movie;
    !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{
    !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory"))
    !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark"))
    !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead"))
    !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter()
    !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500)));
    !!!!!!!!this.user!=!user;
    !!!!}
    !!!!@Override
    !!!!protected!Bookmark!run()!throws!Exception!{
    !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie);
    !!!!}
    }
    ... a ThreadPoolKey (normally defaults to GroupKey) ...

    View full-size slide

  74. public!class!GetBookmarkCommand!extends!HystrixCommand!{
    !!!!private!final!User!user;
    !!!!private!final!Movie!movie;
    !!!!public!GetBookmarkCommand(User!user,!Movie!movie)!{
    !!!!!!!!super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey("VideoHistory"))
    !!!!!!!!!!!!!!!!.andCommandKey(HystrixCommandKey.Factory.asKey("GetBookmark"))
    !!!!!!!!!!!!!!!!.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey("VideoHistoryRead"))
    !!!!!!!!!!!!!!!!.andCommandPropertiesDefaults(HystrixCommandProperties.Setter()
    !!!!!!!!!!!!!!!!!!!!!!!!.withExecutionIsolationThreadTimeoutInMilliseconds(500)));
    !!!!!!!!this.user!=!user;
    !!!!}
    !!!!@Override
    !!!!protected!Bookmark!run()!throws!Exception!{
    !!!!!!!!return!BookmarkClient.getInstance().getBookmark(user,Cmovie);
    !!!!}
    }
    ... and various properties, the most common to change being the timeout (which defaults to 1000ms) ...

    View full-size slide

  75. public&class&GetBookmarkCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&private final Movie movie;
    &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory"));
    &&&&&&&&this.user&=&user;
    &&&&&&&&this.movie&=&movie;
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&run()&throws&Exception&{
    &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie);
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&getFallback()&{
    &&&&&&&&return&new&Bookmark(0);
    &&&&}
    &&&&@Override
    &&&&protected&String&getCacheKey()&{
    &&&&&&&&return&movie.getId()&+&"_"&+&user.getId();
    &&&&}
    }
    Generally however this is all that is needed.
    More can be read on configuration options at https://github.com/Netflix/Hystrix/wiki/Configuration

    View full-size slide

  76. &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory"));
    &&&&&&&&this.user&=&user;
    &&&&&&&&this.movie&=&movie;
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&run()&throws&Exception&{
    &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie);
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&getFallback()&{
    &&&&&&&&return&new&Bookmark(0);
    &&&&}
    &&&&@Override
    &&&&protected&String&getCacheKey()&{
    &&&&&&&&return&movie.getId()&+&"_"&+&user.getId();
    &&&&}
    }
    The getFallback() method can be implemented for providing fallback responses when failure occurs.
    More information is available at https://github.com/Netflix/Hystrix/wiki/How-To-Use#wiki-Fallback

    View full-size slide

  77. &&&&public&GetBookmarkCommand(User&user,&Movie&move)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("VideoHistory"));
    &&&&&&&&this.user&=&user;
    &&&&&&&&this.movie&=&movie;
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&run()&throws&Exception&{
    &&&&&&&&return&BookmarkClient.getInstance().getBookmark(user,6movie);
    &&&&}
    &&&&@Override
    &&&&protected&Bookmark&getFallback()&{
    &&&&&&&&return&new&Bookmark(0);
    &&&&}
    &&&&@Override
    &&&&protected&String&getCacheKey()&{
    &&&&&&&&return&movie.getId()&+&"_"&+&user.getId();
    &&&&}
    }
    The getCacheKey() method allows de-duping calls within a single request context.
    More information is available at https://github.com/Netflix/Hystrix/wiki/How-To-Use#wiki-Caching

    View full-size slide

  78. public&class&GetMoviesCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&public&GetMoviesCommand(User&user)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("MovieListsService"));
    &&&&&&&&this.user&=&user;
    &&&&}
    &&&&@Override
    &&&&protected&Movies&run()&throws&Exception&{
    &&&&&&&&return&MoviesClient.getInstance().getMovies(user);
    &&&&}
    &&&&@Override
    &&&&protected&Movies&getFallback()&{
    &&&&&&&&return&new&GetDefaultMoviesCommand(user).execute();
    &&&&}
    &&&&
    &&&&@Override
    &&&&protected&String&getCacheKey()&{
    &&&&&&&&return&String.valueOf(user.getId());
    &&&&}
    }
    This example will demonstrate a different fallback strategy ...

    View full-size slide

  79. public&class&GetMoviesCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&public&GetMoviesCommand(User&user)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("MovieListsService"));
    &&&&&&&&this.user&=&user;
    &&&&}
    &&&&@Override
    &&&&protected&Movies&run()&throws&Exception&{
    &&&&&&&&return&MoviesClient.getInstance().getMovies(user);
    &&&&}
    &&&&@Override
    &&&&protected&Movies&getFallback()&{
    &&&&&&&&return&new&GetDefaultMoviesCommand(user).execute();
    &&&&}
    &&&&
    &&&&@Override
    &&&&protected&String&getCacheKey()&{
    &&&&&&&&return&String.valueOf(user.getId());
    &&&&}
    }
    ... it uses a network client that can fail ...

    View full-size slide

  80. public&class&GetMoviesCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&public&GetMoviesCommand(User&user)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("MovieListsService"));
    &&&&&&&&this.user&=&user;
    &&&&}
    &&&&@Override
    &&&&protected&Movies&run()&throws&Exception&{
    &&&&&&&&return&MoviesClient.getInstance().getMovies(user);
    &&&&}
    &&&&@Override
    &&&&protected&Movies&getFallback()&{
    &&&&&&&&return&new&GetDefaultMoviesCommand(user).execute();
    &&&&}
    &&&&
    &&&&@Override
    &&&&protected&String&getCacheKey()&{
    &&&&&&&&return&String.valueOf(user.getId());
    &&&&}
    }
    ... and a cache key for de-duping ...

    View full-size slide

  81. public&class&GetMoviesCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&public&GetMoviesCommand(User&user)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("MovieListsService"));
    &&&&&&&&this.user&=&user;
    &&&&}
    &&&&@Override
    &&&&protected&Movies&run()&throws&Exception&{
    &&&&&&&&return&MoviesClient.getInstance().getMovies(user);
    &&&&}
    &&&&@Override
    &&&&protected&Movies&getFallback()&{
    &&&&&&&&return&new&GetDefaultMoviesCommand(user).execute();
    &&&&}
    &&&&
    &&&&@Override
    &&&&protected&String&getCacheKey()&{
    &&&&&&&&return&String.valueOf(user.getId());
    &&&&}
    }
    ... but the fallback executes another HystrixCommand since the fallback is retrieved from over the network as well.

    View full-size slide

  82. public&class&GetDefaultMoviesCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&public&GetDefaultMoviesCommand(User&user)&{
    &&&&&&&&this.user&=&user;
    &&&&}
    &&&&@Override
    &&&&protected&Movies&run()&throws&Exception&{
    &&&&&&&&return&MoviesClient.getInstance().getDefaultMovies(user);
    &&&&}
    &&&&protected&Movies&getFallback()&{
    &&&&&&&&return&MoviesClient.getInstance().getDefaultMoviesSnapshot();
    &&&&}
    }
    The fallback HystrixCommand will execute the network call within its run() method ...

    View full-size slide

  83. public&class&GetDefaultMoviesCommand&extends&HystrixCommand&{
    &&&&private&final&User&user;
    &&&&public&GetDefaultMoviesCommand(User&user)&{
    &&&&&&&&this.user&=&user;
    &&&&}
    &&&&@Override
    &&&&protected&Movies&run()&throws&Exception&{
    &&&&&&&&return&MoviesClient.getInstance().getDefaultMovies(user);
    &&&&}
    &&&&protected&Movies&getFallback()&{
    &&&&&&&&return&MoviesClient.getInstance().getDefaultMoviesSnapshot();
    &&&&}
    }
    ... and if it also fails it too has a fallback, but this one executes locally.
    Thus, we first try and get personalized movies for a user, then generic fallback from a remote cache, and if both of those fail then a global fallback from a local cache.

    View full-size slide

  84. public&class&GetUserCommand&extends&HystrixCommand&{
    &&&&private&final&int&id;
    &&&&public&GetUserCommand(int&id)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("User"));
    &&&&&&&&this.id&=&id;
    &&&&}
    &&&&@Override
    &&&&protected&User&run()&throws&Exception&{
    &&&&&&&&return&UserClient.getInstance().getUser(id);
    &&&&}
    }
    This use case of fetching user data provides another variant of fallback behavior.

    View full-size slide

  85. public&class&GetUserCommand&extends&HystrixCommand&{
    &&&&private&final&int&id;
    &&&&public&GetUserCommand(int&id)&{
    &&&&&&&&super(HystrixCommandGroupKey.Factory.asKey("User"));
    &&&&&&&&this.id&=&id;
    &&&&}
    &&&&@Override
    &&&&protected&User&run()&throws&Exception&{
    &&&&&&&&return&UserClient.getInstance().getUser(id);
    &&&&}
    }
    At first it appears this does not have a valid fallback and will only be able to fail fast. What do we return if we can’t fetch a user that would still allow things to work?

    View full-size slide

  86. public&class&GetUserCommand&extends&HystrixCommand&{
    &&&&private&final&int&id;
    &&&&private&final&Cookie[]&requestCookies;
    &&&&public&GetUserCommand(int&id,&Cookie[]&requestCookies)
    &&&&&&&
    &&&&&&&&!...!for!brevity!on!slide!...
    &&&&@Override
    &&&&protected&User&getFallback()&{
    &&&&&&&&if(...&cookies&valid&...)&{
    &&&&&&&&&&&&User&stubbedUser&=&new&User(id);
    &&&&&&&&&&&&//&logic&for&retrieving&defaults&from&cookies
    &&&&&&&&&&&&return&stubbedUser;
    &&&&&&&&}&else&{
    &&&&&&&&&&&&throw&new&RuntimeException("Unable&to&retrieve&user
    &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&from&service&or&cookies");
    &&&&&&&&}
    &&&&}
    }
    To do so we change the input arguments and accept stateful cookies sent with every authenticated request and use those to retrieve key user data that we can make a stubbed response from.

    View full-size slide

  87. public&class&GetUserCommand&extends&HystrixCommand&{
    &&&&private&final&int&id;
    &&&&private&final&Cookie[]&requestCookies;
    &&&&public&GetUserCommand(int&id,&Cookie[]&requestCookies)
    &&&&&&&
    &&&&&&&&!...!for!brevity!on!slide!...
    &&&&@Override
    &&&&protected&User&getFallback()&{
    &&&&&&&&if(...&cookies&valid&...)&{
    &&&&&&&&&&&&User&stubbedUser&=&new&User(id);
    &&&&&&&&&&&&//&logic&for&retrieving&defaults&from&cookies
    &&&&&&&&&&&&return&stubbedUser;
    &&&&&&&&}&else&{
    &&&&&&&&&&&&throw&new&RuntimeException("Unable&to&retrieve&user
    &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&from&service&or&cookies");
    &&&&&&&&}
    &&&&}
    }
    If the cookies have the necessary data (cookies for all authenticated users should) then we can fallback to a stubbed response.

    View full-size slide

  88. public&class&GetUserCommand&extends&HystrixCommand&{
    &&&&private&final&int&id;
    &&&&private&final&Cookie[]&requestCookies;
    &&&&public&GetUserCommand(int&id,&Cookie[]&requestCookies)
    &&&&&&&
    &&&&&&&&!...!for!brevity!on!slide!...
    &&&&@Override
    &&&&protected&User&getFallback()&{
    &&&&&&&&if(...&cookies&valid&...)&{
    &&&&&&&&&&&&User&stubbedUser&=&new&User(id);
    &&&&&&&&&&&&//&logic&for&retrieving&defaults&from&cookies
    &&&&&&&&&&&&return&stubbedUser;
    &&&&&&&&}&else&{
    &&&&&&&&&&&&throw&new&RuntimeException("Unable&to&retrieve&user
    &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&from&service&or&cookies");
    &&&&&&&&}
    &&&&}
    }
    Most of the fields of the User object will be defaults and may affect user experience in certain ways, but the critical pieces are available from the cookies. This allows the system to degrade rather than failing.
    Since the User object is critical to almost every incoming request this is essential to application resiliency. We favor degrading the experience over outright failure.

    View full-size slide

  89. User&user&=&new&GetUserCommand(id,&requestCookies).execute()
    Future&user&=&new&GetUserCommand(id,&requestCookies).queue()
    Observable&user&=&new&GetUserCommand(id,&requestCookies).observe()
    Once a command exists it can be executed in 3 ways, execute(), queue() and observe().
    More information can be found at https://github.com/Netflix/Hystrix/wiki/How-To-Use

    View full-size slide

  90. So now what?
    Code is only part of the solution. Operations is the other critical half.

    View full-size slide

  91. Historical metrics representing all possible states of success, failure, decision making and performance related to each bulk head.

    View full-size slide

  92. >40,000 success
    1 timeout
    4 rejected
    Looking closely at high volume systems it is common to find constant failure.

    View full-size slide

  93. The rejection spikes on the left correlate with and do in fact represent the cause of the fallback spikes on the right.

    View full-size slide

  94. Latency percentiles are captured at every 5th percentile and a few extra such as 99.5th (though this graph is only showing 50th/99th/99.5th).

    View full-size slide

  95. >40,000 success
    0.10 exceptions
    Exceptions Thrown helps to identify if a failure state is being handled by a fallback successfully or not. In this case we are seeing < 0.1 exceptions per second being thrown but on the previous set of metrics saw
    5-40 fallbacks occurring each second, thus we can see that the fallbacks are doing their job but we may want to look for very small number of edge cases where fallbacks fail resulting in an exception.

    View full-size slide

  96. We found that historical metrics with 1 datapoint per minute and 1-2 minutes latency were not sufficient during operational events such as deployments, rollbacks, production alerts and configuration changes so we
    built near realtime monitoring and data visualizations to help us consume large amounts of data easily. This dashboard is the aggregate view of a production cluster with ~1-2 second latency from the time an event
    occurs to being rendered in the browser.
    Read more at https://github.com/Netflix/Hystrix/wiki/Dashboard

    View full-size slide

  97. Each bulkhead is represented with a visualization like this.

    View full-size slide

  98. circle color and size represent
    health and traffic volume

    View full-size slide

  99. 2 minutes of request rate to
    show relative changes in traffic
    circle color and size represent
    health and traffic volume

    View full-size slide

  100. 2 minutes of request rate to
    show relative changes in traffic
    circle color and size represent
    health and traffic volume
    hosts reporting from cluster

    View full-size slide

  101. last minute latency percentiles
    2 minutes of request rate to
    show relative changes in traffic
    circle color and size represent
    health and traffic volume
    hosts reporting from cluster

    View full-size slide

  102. last minute latency percentiles
    2 minutes of request rate to
    show relative changes in traffic
    circle color and size represent
    health and traffic volume
    hosts reporting from cluster
    Circuit-breaker
    status

    View full-size slide

  103. last minute latency percentiles
    Request rate
    2 minutes of request rate to
    show relative changes in traffic
    circle color and size represent
    health and traffic volume
    hosts reporting from cluster
    Circuit-breaker
    status

    View full-size slide

  104. Error percentage of
    last 10 seconds
    last minute latency percentiles
    Request rate
    2 minutes of request rate to
    show relative changes in traffic
    circle color and size represent
    health and traffic volume
    hosts reporting from cluster
    Error percentage of
    last 10 seconds
    Circuit-breaker
    status

    View full-size slide

  105. last minute latency percentiles
    Request rate
    2 minutes of request rate to
    show relative changes in traffic
    circle color and size represent
    health and traffic volume
    hosts reporting from cluster
    Error percentage of
    last 10 seconds
    Rolling 10 second counters
    with 1 second granularity
    Failures/Exceptions
    Thread-pool Rejections
    Thread timeouts
    Successes
    Short-circuited (rejected)
    Circuit-breaker
    status

    View full-size slide

  106. 23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    12
    1
    0
    0
    Success
    Timeout
    Failure
    Rejection
    10 1-second "buckets"
    23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    45
    6
    2
    0
    1
    0
    0
    0
    On "getLatestBucket" if the 1-second window is passed a new bucket is created, the rest slid over and the oldest one dropped.
    Low Latency Granular Metrics
    Rolling 10 second window
    1 second resolution
    All metrics are captured in both absolute cumulative counters and rolling windows with 1 second granularity.
    Read more at https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring

    View full-size slide

  107. 23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    12
    1
    0
    0
    Success
    Timeout
    Failure
    Rejection
    10 1-second "buckets"
    23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    45
    6
    2
    0
    1
    0
    0
    0
    On "getLatestBucket" if the 1-second window is passed a new bucket is created, the rest slid over and the oldest one dropped.
    Low Latency Granular Metrics
    Rolling 10 second window
    1 second resolution
    The rolling counters default to 10 second windows with 1 second buckets.

    View full-size slide

  108. 23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    12
    1
    0
    0
    Success
    Timeout
    Failure
    Rejection
    10 1-second "buckets"
    23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    45
    6
    2
    0
    1
    0
    0
    0
    On "getLatestBucket" if the 1-second window is passed a new bucket is created, the rest slid over and the oldest one dropped.
    Low Latency Granular Metrics
    Rolling 10 second window
    1 second resolution
    As each second passes the oldest bucket is dropped (to soon be overwritten since it is a ring buffer)...

    View full-size slide

  109. 23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    12
    1
    0
    0
    Success
    Timeout
    Failure
    Rejection
    10 1-second "buckets"
    23
    5
    2
    0
    47
    8
    1
    0
    26
    4
    0
    0
    48
    9
    4
    0
    38
    4
    2
    0
    42
    6
    7
    0
    59
    11
    5
    1
    46
    5
    2
    0
    39
    3
    5
    0
    45
    6
    2
    0
    1
    0
    0
    0
    On "getLatestBucket" if the 1-second window is passed a new bucket is created, the rest slid over and the oldest one dropped.
    Low Latency Granular Metrics
    Rolling 10 second window
    1 second resolution
    ... and a new bucket is created.

    View full-size slide

  110. ~1 second latency aggregated stream
    Turbine
    stream aggregator
    Low Latency Granular Metrics
    Metrics are subscribed to from all servers in a cluster and aggregated with ~1 second latency from event to aggregation. This stream can then be consumed by the dashboard, an alerting system or anything else
    wanting low latency metrics.

    View full-size slide

  111. propagate across
    cluster in seconds
    Low Latency Configuration Changes
    The low latency loop is completed with the ability to propagate configuration changes across a cluster in seconds. This enables rapid iterations of seeing behavior in production, pushing config changes and then
    watching them take effect immediately as the changes roll across a cluster of servers. Low latency operations requires both the visibility into metrics and ability to affect change operating with similar latency
    windows.

    View full-size slide

  112. Auditing via Simulation
    Simulating failure states in production has proven an effective approach for auditing our applications to either prove resilience or find weakness.

    View full-size slide

  113. Auditing via Simulation
    In this example failure was injected into a single dependency which caused the bulkhead to return fallbacks and trip all circuits since the failure rate was almost 100%, well above the threshold for circuits to trip.

    View full-size slide

  114. Auditing via Simulation
    When the ‘TitleStatesGetAllRentStates` bulkhead began returning fallbacks the ‘atv_mdp’ endpoint shot to the top of the dashboard with 99% error rate. There was a bug in how the fallback was handled so we
    immediately stopped the simulation, fixed the bug over the coming days and repeated the simulation to prove it was fixed and the rest of the system remained resilient. This was caught in a controlled simulation
    where we could catch and act in less than a minute rather than a true production incident where we likely wouldn’t have been able to do anything.

    View full-size slide

  115. This shows another simulation where latency was injected.
    Read more at http://techblog.netflix.com/2011/07/netflix-simian-army.html

    View full-size slide

  116. 125 → 1500+
    1000+ ms of latency was injected into a dependency that normally completes with a median latency of ~15-20ms and 99.5th of 120-130ms.

    View full-size slide

  117. ~5000
    The latency spike caused timeouts, short-circuiting and rejecting and up to ~5000 fallbacks per second as a result of these various failure states.

    View full-size slide

  118. ~1
    While delivering the ~5000 fallbacks per second the exceptions thrown didn’t go beyond ~1 per second demonstrating that user impact was negligible (as perceived from the server, the client behavior must also be
    validated during a simulation but is not part of this dataset).

    View full-size slide

  119. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Other approaches to auditing take advantage of our routing layer to route traffic to different clusters.
    Read more at http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html

    View full-size slide

  120. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Every code deployment is preceded by a canary test where a small number of instances are launched to take production traffic, half with new code (canary), half with existing production code (baseline) and compared
    for differences. Thousands of system, application and bulkhead metrics are compared to make a decision on whether the new code should continue to full deployment. Many issues are found via production canaries
    that are not found in dev and test environments.
    Read more at http://techblog.netflix.com/2013/08/deploying-netflix-api.html

    View full-size slide

  121. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    New instances are also put through a squeeze test before full rollout to find the point at which the performance degrades. This is used to identify performance and throughput changes of each deployment.

    View full-size slide

  122. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Long-term canaries are kept in a cluster we call “coalmine” with agents intercepting all network traffic. These run the same code as the production cluster and are used to identify network traffic without a bulkhead
    that starts happening due to unknown code paths being enabled via configuration, AB test and other changes.
    Read more at https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent

    View full-size slide

  123. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    System
    Relationship
    Over
    Network
    without
    Bulkhead
    For example, a network relationship could exist in production code but not be triggered in dev, test or production canaries but then be enabled via a condition that changes days after deployment to production. This
    can be a vulnerability and we use the “coalmine” to identity these situations and inform decisions.

    View full-size slide

  124. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View full-size slide

  125. Failure inevitably happens ...
    A good read on complex systems is Drift into Failure by Sidney Dekker: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View full-size slide

  126. Cluster adapts
    Failure Isolated
    When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and fallbacks returned.

    View full-size slide

  127. Cluster adapts
    Failure Isolated
    When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and fallbacks returned.

    View full-size slide

  128. Cluster adapts
    Failure Isolated
    Since the failure rate was above the threshold circuit breakers began tripping. As a portion of the cluster tripped circuits it released pressure on the underlying system so it could successfully perform some work.

    View full-size slide

  129. Cluster adapts
    Failure Isolated
    The cluster naturally adapts as bulkheads constrain throughput and circuits open and close in a rolling manner across the instances in the cluster.

    View full-size slide

  130. In this example the ‘CinematchGetPredictions’ functionality began failing.

    View full-size slide

  131. The red metric shows it was exceptions thrown by the client, not latency or concurrency constraints.

    View full-size slide

  132. The 20% error rate from the realtime visualization is also seen in the historical metrics with accompanying drop in successes.

    View full-size slide

  133. Matching the increase in failures is the increase of fallbacks being delivered for every failure.

    View full-size slide

  134. Distributed Systems are Complex
    Distributed applications need to be treated as complex systems and we must recognize that no machine or human can comprehend all of the state or interactions.

    View full-size slide

  135. Isolate Relationships
    One way to dealing with the complex system is to isolate the relationships so they can each fail independently of each other. Bulkheads have proven an effective approach for isolating and managing failure.

    View full-size slide

  136. Auditing & Operations are Essential
    Resilient code is only part of the solution. Systems drift and have latent bugs and failure states emerge from the complex interactions of the many relationships. Constant auditing can be part of the solution. Human
    operations must handle everything the system can’t which by definition means it is unknown so the system must strive to expose clear insights and effective tooling so humans can make informed decisions.

    View full-size slide

  137. Hystrix
    https://github.com/Netflix/Hystrix
    Application Resilience in a Service-oriented Architecture
    http://programming.oreilly.com/2013/06/application-resilience-in-a-service-oriented-architecture.html
    Fault Tolerance in a High Volume, Distributed System
    http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
    Making the Netflix API More Resilient
    http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
    Ben Christensen
    @benjchristensen
    http://www.linkedin.com/in/benjchristensen
    jobs.netflix.com

    View full-size slide