Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reactive Service Levels at React London 2014

Reactive Service Levels at React London 2014

How do we support the current always-on systems culture? Is 100% uptime really possible? Hot deployment of software upgrades. Multi-variant testing of new features for measured and informed business reaction. Monitoring and managing reactive systems to ensure they are meeting service levels.

Learn how Netflix achieves scalability, resilience and responsiveness through its cloud architecture and tools such as Ribbon, Eureka, Hystrix, RxJava, Zuul, Scryer and the Simian Army.

Presented at React 2014 in London: http://reactconf.com

Video: http://www.youtube.com/watch?v=Ftkn1OF895E&feature=share&list=PLSD48HvrE7-Z1stQ1vIIBumB0wK0s8llY&index=7

Ben Christensen

April 08, 2014
Tweet

More Decks by Ben Christensen

Other Decks in Programming

Transcript

  1. Reactive Service Levels
    Ben Christensen
    Software Engineer – Edge Engineering at Netflix
    @benjchristensen
    http://techblog.netflix.com/
    React London - April 2014

    View Slide

  2. “the explosive growth of software has added
    greatly to systems’ interactive complexity. With
    software, the possible states that a system can
    end up in become mind-boggling.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View Slide

  3. “We can model and understand in isolation.
    But, when released into competitive, nominally
    regulated societies, their connections proliferate,
    their interactions and interdependencies multiply,
    their complexities mushroom.
    And we are caught short.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View Slide

  4. Netflix is a subscription service for movies and TV shows for $7.99USD/month (about the same converted price in
    each countries local currency).

    View Slide

  5. More than 44 million Subscribers
    in 41 Countries
    Netflix has over 44 million video streaming customers in 41 countries across North & South America, United
    Kingdom, Ireland, Netherlands and the Nordics.

    View Slide

  6. Netflix accounts for 31% of Peak Downstream
    Internet Traffic in North America
    Netflix subscribers are watching
    more than 1 billion hours a month
    Sandvine report available with free account at http://www.sandvine.com/news/global_broadband_trends.asp
    Image from report at https://www.sandvine.com/downloads/general/global-internet-phenomena/2013/2h-2013-
    global-internet-phenomena-report.pdf

    View Slide

  7. View Slide

  8. Prior to the current globally distributed cloud architecture a single data center served the US.

    View Slide

  9. Netflix expanded its service around the globe …

    View Slide

  10. … and migrated from the data center to a cloud architecture in multiple Amazon AWS regions. Geographic isolation
    and failover via active/active multi-region deployment was added in 2013 (http://techblog.netflix.com/2013/05/
    denominating-multi-region-sites.html)

    View Slide

  11. AWS
    Availability Zone
    AWS
    Availability Zone
    AWS
    Availability Zone
    3 AWS Availability Zones (think of them as independent data centers right next to each other for low latency) operate
    in each region with deployments split across them for redundancy in event of losing an entire zone.

    View Slide

  12. Each zone is populated with application clusters (‘auto-scaling groups’ or ASGs) that make up the service oriented
    distributed system. Application clusters operate independently of each other with client-side software load balancing
    routing traffic between them.

    View Slide

  13. Application clusters are made up of 1 to 100s of machine instances per zone. Service registry and discovery work with
    software load balancing to allow machines to launch and disappear (for planned or unplanned reasons) at any time
    and become part of the distributed system and serve requests. Auto-scaling enables system-wide adaptation to
    demand as it launches instances to meet increasing traffic and load or handle instance failure.

    View Slide

  14. Failed instances are dropped from discovery so traffic stops routing to them. Software load balancers on client
    applications detect and skip them until discovery removes them.

    View Slide

  15. Auto-scale policies brings on new instances to replace failed ones or to adapt to increasing demand.

    View Slide

  16. A suite of tools called the “Simian Army” is employed to assert the architecture and systems are in fact resilient,
    responsive and reactive to failure, demand and changing conditions. They are used to inject latency and failure,
    validate environments, cleanup or perform “Game Day” exercises. These are done in all environments, most
    importantly in production.
    More information available at http://techblog.netflix.com/2011/07/netflix-simian-army.html, https://github.com/
    Netflix/SimianArmy and http://queue.acm.org/detail.cfm?id=2499552

    View Slide

  17. Chaos Monkey constantly runs in the background randomly killing single instances in application clusters. Application
    owners will receive notification that an instance has been killed. The purpose is asserting that an application cluster
    can handle loss of instances without impact.

    View Slide

  18. AWS
    Availability Zone
    AWS
    Availability Zone
    AWS
    Availability Zone
    Chaos Gorilla is used in “Game Day” exercises to terminate an entire AWS Availability Zone. This is used to
    demonstrate that all systems behave correctly to migrate traffic and scale up to meet demand in the other 2 zones. It
    also serves as good practice for engineers to learn what happens and what to expect when it happens for real.

    View Slide

  19. Chaos Kong is another “Game Day” exercise where traffic is migrated away from an entire region.

    View Slide

  20. For example, all traffic from US-East could be rerouted to US-West so all North and South American traffic is going to
    a single region instead of split as it normally is. These exercises are done to ensure control systems reroute traffic via
    DNS changes, that client devices respect the changes (or learn what doesn’t) and gain experience in how the whole
    system behaves so when it needs to be done for a real reason it is a known practice.

    View Slide

  21. Eureka
    Instance Discovery
    Karyon
    Base Server with
    Instance Registration,
    Metrics, Heartbeat, etc
    Ribbon
    RPC Client with
    load balancing
    Eureka, Karyon and Ribbon are core software that enables resilient registration, discovery and communication
    between applications in the service oriented architecture.
    More information can be found at http://netflix.github.io

    View Slide

  22. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Applications communicate with dozens of other applications in the service-oriented architecture. Each of these client/
    server dependencies represents a relationship within the complex distributed system.

    View Slide

  23. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User request
    blocked by
    latency in
    single
    network call
    Any one of these relationships can fail at any time. They can be intermittent or cluster-wide, immediate with thrown
    exceptions and/or error codes or experience latency from various causes. Latency is particularly challenging for
    applications to deal with as it causes resource utilization in queues and pools and blocks user requests (even with
    async IO).

    View Slide

  24. At high
    volume
    all request
    threads can
    block in
    seconds
    User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Latency at high volume can quickly saturate all application resources (queues, pools, sockets, etc) causing total
    application failure and the inability to serve user requests even if all other dependencies are healthy.

    View Slide

  25. Dozens of dependencies.
    One going bad takes everything down.
    99.99%30 = 99.7% uptime
    0.3% of 1 billion = 3,000,000 failures
    2+ hours downtime/month
    Reality is generally worse.
    Large distributed systems are complex and failure will occur. If failure from every component is allowed to cascade
    across the system they will all affect the user.

    View Slide

  26. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Solution design for handling cascading latency and failure was done with constraints, context and priorities of the
    Netflix environment.

    View Slide

  27. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Speed of iteration is optimized for and this leads to client/server relationships where client libraries are provided
    rather than each team writing their own client code against a server protocol. This means “3rd party” code from many
    developers and teams is constantly being deployed into applications across the system. Large applications such as
    the Netflix Edge API have dozens of client libraries.

    View Slide

  28. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Speed of iteration is optimized for and this leads to client/server relationships where client libraries are provided
    rather than each team writing their own client code against a server protocol. This means “3rd party” code from many
    developers and teams is constantly being deployed into applications across the system. Large applications such as
    the Netflix Edge API have dozens of client libraries.

    View Slide

  29. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    The environment is also diverse with different types of client/server communications and protocols. This
    heterogeneous and always changing environment affects the approach for resilience engineering and is potentially
    very different than approaches taken for a tightly controlled codebase or homogeneous architecture.

    View Slide

  30. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Each dependency - or distributed system relationship - must be isolated so its failure does not cascade or saturate all
    resources.

    View Slide

  31. cy D
    dency G
    ependency J
    Dependency M
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Serialization - URL and/or body generation
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    It is not just the network that can fail and needs isolation but the full request/response loop including business logic
    and serialization/deserialization.
    Protecting against a network failure only to return a response that causes application logic to fail elsewhere in the
    application only moves the problem.

    View Slide

  32. "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
    at java.net.Socket.connect(Socket.java:579)
    at java.net.Socket.connect(Socket.java:528)
    at java.net.Socket.(Socket.java:425)
    at java.net.Socket.(Socket.java:280)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722)
    [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1)
    > 80% of requests rejected
    Median
    Latency
    This is an example of what a system looks like when high latency occurs without load shedding and isolation.
    Backend latency spiked (from <100ms to >1000ms at the median, >10,000ms at the 90th percentile) and saturated
    all available resources resulting in the HTTP layer rejecting over 80% of requests.

    View Slide

  33. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View Slide

  34. Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View Slide

  35. Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View Slide

  36. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View Slide

  37. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Responses can be intercepted and replaced with fallbacks.

    View Slide

  38. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    A user request can continue in a degraded state with a fallback response from the failing dependency.

    View Slide

  39. “Overt catastrophic failure occurs when small,
    apparently innocuous failures join to create
    opportunity for a systemic accident.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
    See http://www.ctlab.org/documents/How Complex Systems Fail.pdf

    View Slide

  40. “System operations are dynamic, with
    components (organizational, human, technical)
    failing and being replaced continuously.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
    See http://www.ctlab.org/documents/How Complex Systems Fail.pdf

    View Slide

  41. Code was written to apply bulkheads, circuit breakers, time-outs, fallbacks and other practices … and later got a
    name and logo.
    See more at http://github.com/Netflix/Hystrix

    View Slide

  42. Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    A bulkhead wraps around the entire client behavior not just the network portion.

    View Slide

  43. Tryable Semaphore
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    An effective form of bulkheading is a tryable semaphore that restricts concurrent execution.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#semaphores

    View Slide

  44. Thread-pool
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Timeout
    A thread-pool also limits concurrent execution while also offering the ability to timeout and walk away from a latent
    thread.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#threads--thread-pools

    View Slide

  45. Tryable semaphores for non-blocking clients and fallbacks
    Separate threads for blocking clients
    Aggressive timeouts to “give up and move on”
    Circuit breakers as the “release valve”

    View Slide

  46. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Hystrix execution flow chart.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart

    View Slide

  47. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Cir
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Time
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Execution can be synchronous or asynchronous (via a Future or Observable).

    View Slide

  48. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Cir
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Time
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Current state is queried before allowing execution to determine if it is short-circuited or throttled and should reject.

    View Slide

  49. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    If not rejected execution proceeds to the run() method which performs underlying work.

    View Slide

  50. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Successful responses return.

    View Slide

  51. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    All requests, successful and failed, contribute to a feedback loop used to make decisions and publish metrics.

    View Slide

  52. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    All failure states are routed through the same path.

    View Slide

  53. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Circu
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeo
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Every failure is given the opportunity to retrieve a fallback which can result in one of three results.

    View Slide

  54. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Hystrix execution flow chart.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart

    View Slide

  55. HystrixCommand run()
    public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
    }
    Basic successful execution pattern and sample code.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#wiki-Hello-World

    View Slide

  56. public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
    }
    run() invokes
    “client” Logic
    HystrixCommand run()
    The run() method is where the wrapped logic goes.

    View Slide

  57. HystrixCommand run()
    throw Exception
    Fail Fast
    Failing fast is the default behavior if no fallback is implemented. Even without a fallback this is useful as it prevents
    resource saturation beyond the bulkhead so the rest of the application can continue functioning and enables rapid
    recovery once the underlying problem is resolved.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-fast

    View Slide

  58. HystrixCommand run()
    getFallback()
    return  null;
    return  new  Option();
    return  Collections.emptyList();
    return  Collections.emptyMap();
    Fail Silent
    Silent failure is an approach for removing non-essential functionality from the user experience by returning a value
    that equates to “no data”, “not available” or “don’t display”.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-silent

    View Slide

  59. HystrixCommand run()
    getFallback()
    return  true;
    return  DEFAULT_OBJECT;
    Static Fallback
    Static fallbacks can be used when default data or behavior can be returned to the user.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-static

    View Slide

  60. HystrixCommand run()
    getFallback()
    return  new  UserAccount(customerId,  "Unknown  Name",
                                   countryCodeFromGeoLookup,  true,  true,  false);
    return  new  VideoBookmark(movieId,  0);
    Stubbed Fallback
    Stubbed fallbacks are an extension of static fallbacks when some data is available (such as from request arguments,
    authentication tokens or other functioning system calls) and combined with default values for data that can not be
    retrieved.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-stubbed

    View Slide

  61. HystrixCommand run()
    getFallback()
    public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
           protected  String  getFallback()  {
                   return  "Hello  Failure  "  +  name  +  "!";
           }
    }
    Stubbed Fallback

    View Slide

  62. HystrixCommand run()
    getFallback()
    public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
           protected  String  getFallback()  {
                   return  "Hello  Failure  "  +  name  +  "!";
           }
    }
    Stubbed Fallback
    The getFallback() method is executed whenever failure occurs (after run() invocation or on rejection without run() ever
    being invoked) to provide opportunity to do fallback.

    View Slide

  63. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    Fallback via network
    Fallback via network is a common approach for falling back to a stale cache (such as a memcache server) or less
    personalized value when not able to fetch from the primary source.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-cache-via-network

    View Slide

  64. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    getFallback()
    Fallback via network then Local
    When the fallback performs a network call it is preferable for it to also have a fallback that does not go over the
    network otherwise if both primary and secondary systems fail it will fail by throwing an exception (similar to fail fast
    except after two fallback attempts).

    View Slide

  65. “In complex systems, decision-makers are
    locally rather than globally rational. But that
    doesn’t mean that their decisions cannot lead
    to global, or system-wide events. In fact, that
    is one of the properties of complex systems:
    local actions can have global results.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View Slide

  66. “In complex systems, decision-makers are
    locally rather than globally rational. But that
    doesn’t mean that their decisions cannot lead
    to global, or system-wide events. In fact, that
    is one of the properties of complex systems:
    local actions can have global results.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Local decisions are being made constantly and do affect the entire system. These include routing decisions (such as
    by Ribbon), marking an instance “up” or “down” (via Eureka), timing out, short-circuiting, rejecting and fallbacks (via
    Hystrix).
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View Slide

  67. Auditing via Simulation
    Simulating failure states in production has proven an effective approach for auditing our applications to either prove
    resilience or find weakness and determine how local decisions affect the system as a whole.
    NOTE: This does not imply ability to understand all impacts of local decisions. Since a distributed systems is a
    “complex system” it is by definition impossible to simulate or even model all possible states. However, auditing via
    simulation does allow many (and the most common) to be understood, validated and catered for.

    View Slide

  68. Auditing via Simulation
    In this example failure was injected into a single dependency which caused the bulkhead to return fallbacks and trip
    all circuits since the failure rate was almost 100%, well above the threshold for circuits to trip.

    View Slide

  69. Auditing via Simulation
    When the ‘TitleStatesGetAllRentStates` bulkhead began returning fallbacks the ‘atv_mdp’ endpoint shot to the top of
    the dashboard with 99% error rate. There was a bug in how the fallback was handled on this device so we immediately
    stopped the simulation, fixed the bug over the coming days and repeated the simulation to prove it was fixed and the
    rest of the system remained resilient. This was caught in a controlled simulation where we could catch and act in less
    than a minute rather than a true production incident where we likely wouldn’t have been able to do anything.

    View Slide

  70. This shows another simulation where latency was injected.
    Read more at http://techblog.netflix.com/2011/07/netflix-simian-army.html

    View Slide

  71. 125 → 1500+
    1000+ ms of latency was injected into a dependency that normally completes with a median latency of ~15-20ms
    and 99.5th of 120-130ms.

    View Slide

  72. ~5000
    The latency spike caused timeouts, short-circuiting and rejecting and up to ~5000 fallbacks per second as a result of
    these various failure states.

    View Slide

  73. ~1
    While delivering the ~5000 fallbacks per second the exceptions thrown didn’t go beyond ~1 per second
    demonstrating that user impact was negligible (as perceived from the server, the client behavior must also be
    validated during a simulation but is not part of this dataset).

    View Slide

  74. Constantly Changing
    Constantly building, testing and deploying new code. Dozens of application deployments every day. Typically 100s of
    customer A/B tests running.
    For more info on A/B testing: http://www.slideshare.net/xamat/cikm-2013-beyond-data-from-user-information-
    to-business-value

    View Slide

  75. Application clusters are replaced every time a new version of code is deployed. There are often several different
    clusters with slightly different versions or configurations of the same application.

    View Slide

  76. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    For example, the Edge API application has several clusters such as these used for validation, testing, debugging and
    production traffic.

    View Slide

  77. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Zuul was built to allow dynamic routing of traffic to each cluster.
    Read more at http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html

    View Slide

  78. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Every code deployment is preceded by a canary test where a small number of instances are launched to take
    production traffic, half with new code (canary), half with existing production code (baseline) and compared for
    differences.
    Read more at http://techblog.netflix.com/2013/08/deploying-netflix-api.html

    View Slide

  79. This dashboard represents the continuous build and deployment pipeline for a single application. Each commit (or
    batch of commits) results in unit tests being executed, a new machine image being created, an instance launched in
    the test environment, smoke tests against this instance and optionally launching in production for canary testing.

    View Slide

  80. This build resulted in a poor canary score so did not get promoted to production.

    View Slide

  81. The exact commit that represents the machine image running in production is seen along with which clusters are
    running that code.

    View Slide

  82. The production code passed the canary test with a score of 99%.

    View Slide

  83. There are many different environments where code can be deployed.

    View Slide

  84. History and metadata for every environment, commit and machine image is available.

    View Slide

  85. Code can be manually pushed to an environment by selecting a build and environment.

    View Slide

  86. Selection of a build shows its status and environments it is deployed to.

    View Slide

  87. Each cluster can be inspected to see versions of code deployed, whether they are active or inactive, taking traffic or
    not and how many instances.

    View Slide

  88. Selecting an environment provides context for time zones and normal peak traffic patterns to assist in making
    decisions about when to deploy. The tooling allows ignoring the suggestions and pushing the code immediately, or
    scheduling it to be done during off-peak hours.

    View Slide

  89. The sequence of builds can be viewed to get an overview of where code is currently running.

    View Slide

  90. Canary tests in production result in reports that compare the new code against a “baseline” which is the same
    machine image as current production.

    View Slide

  91. Metrics from Ribbon, Hystrix, JVM and machine are analyzed to determine a score. They are marked as “hot” or “cold”.

    View Slide

  92. Metrics from Ribbon, Hystrix, JVM and machine are analyzed to determine a score. They are marked as “hot” or “cold”.

    View Slide

  93. This canary did not do so well. This results in review of what went wrong. Sometimes it will be judged okay to
    proceed and the problems get fixed in a later release. Other times the problems are too severe and must be fixed
    before release.

    View Slide

  94. This shows a build that went through the entire process and is now taking production traffic in all 3 regions.

    View Slide

  95. There are several clusters taking production traffic besides the primary one. They are used for debugging, squeeze
    testing, alerting, or long-lived (stable, meaning no autoscale policies) monitoring of application health (such as
    memory).

    View Slide

  96. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    New machine images are put through a squeeze test before full rollout to find the point at which the performance
    degrades. This is used to identify performance and throughput changes of each deployment and is fed as a datapoint
    into autoscaling algorithms.

    View Slide

  97. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Long-term canaries are kept in a cluster we call “coalmine” with agents intercepting all network traffic. These run the
    same code as the production cluster and are used to identify network traffic without a bulkhead that starts happening
    due to unknown code paths being enabled via configuration, AB test and other changes.
    Read more at https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent

    View Slide

  98. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    System
    Relationship
    Over
    Network
    without
    Bulkhead
    For example, a network relationship could exist in production code but not be triggered in dev, test or production
    canaries but then be enabled via a condition that changes days after deployment to production. This can be a
    vulnerability and we use the “coalmine” to identity these situations and inform decisions.

    View Slide

  99. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View Slide

  100. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    The primary clusters take the majority of production traffic and is managed by autoscale policies that add and remove
    instances to meet demand.

    View Slide

  101. The daily traffic patterns for a region typically look like this.

    View Slide

  102. The weekends have higher peaks …

    View Slide

  103. … and broader mid-day usage …

    View Slide

  104. … than weekdays.

    View Slide

  105. Usage in the morning (particularly on the weekend) climbs rapidly. These spikes cause problems with reactive
    autoscaling and requires padding so enough servers exist to handle the load before new instances can be brought
    online.

    View Slide

  106. Since the daily pattern is so predictable a new predictive autoscaling approach was built. It uses historical data to
    predict each 24 hour period and sets the scaling “floor” to the number of instances required. Reactive scaling policies
    continue to run and can raise the instance count higher if needed. When scaling down the predictive policy lowers the
    floor which allows the reactive policy to scale down as traffic reduces.
    More information available at: http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html

    View Slide

  107. Reactive Only
    with Predictive
    This shows the number of instances needed with just reactive scaling (blue) and when predictive was added (red).
    Adding predictive policies allows far more efficient autoscaling because it doesn’t need to add as much overhead to
    handle spikes in the mornings and it can step up to peak more efficiently rather than reacting by adding percentages
    of an increasingly larger cluster.

    View Slide

  108. Predictive policies are also safer in the event of anomalies that cause a reduction in traffic, whether that be a
    production outage, a device problem or a widespread event like the SuperBowl.
    Using only the reactive policy the cluster starts to scale down when it sees traffic drop. The predictive policy keeps it
    set to where it needs to be so when the traffic returns (typically as a wave that spikes traffic as seen in this image) the
    required capacity is still there.

    View Slide

  109. This behavior (the blue line) makes the system far more resilient to anomalies and outages so that autoscaling
    doesn’t itself cause degradation or outages by removing capacity right before it’s needed again.

    View Slide

  110. Load Avg After Scryer
    Another positive side-effect of more efficiently scaling up via the prediction algorithms is that the instances end up
    with smoother load and fewer extremes since the cluster is already scaled up to handle the traffic rather than trying
    to react after the load comes.

    View Slide

  111. Dozens of UIs
    across 1000+ Devices
    Netflix supports > 1000 different devices with dozens of different UIs. Many of these are constantly innovating.

    View Slide

  112. 100s of Customer A/B Tests
    Typically 100s of customer A/B tests are running on many different UIs and devices.
    For more info: http://www.slideshare.net/xamat/cikm-2013-beyond-data-from-user-information-to-business-
    value

    View Slide

  113. The Netflix Edge API was re-architected to allow each UI team to develop, deploy and operate their own web service
    endpoints on top of the Edge Platform. This shifted away from having a single one-size-fits-all REST API that all UIs
    used.
    More information can be found at https://speakerdeck.com/benjchristensen/evolution-of-the-netflix-api-qcon-
    sf-2013

    View Slide

  114. This change allows each endpoint to be optimized to exactly the needs of the device or UI it is serving. It also allows
    each UI team to innovate independently of each other so as to distribute the pace of innovation.
    The endpoints can be deployed into production in < 2 minutes.

    View Slide

  115. Each endpoint is isolated from others within its own classloader. Groovy was chosen as the first language used for
    writing endpoints but any JVM language could be used.

    View Slide

  116. Code is uploaded into an external data store via RESTful administration APIs and then pulled into each application
    instance in the production clusters. The data store globally replicates the code across all AWS regions and zones.

    View Slide

  117. Endpoints are written in a fully asynchronous manner using RxJava against a Java API that is completely non-blocking.

    View Slide

  118. Iterable
    pull
    Observable
    push
    T next()
    throws Exception
    returns;
    onNext(T)
    onError(Exception)
    onCompleted()
    (Functional) Reactive Programming with RxJava
    A Java port of Rx (Reactive Extensions)
    https://rx.codeplex.com (.Net and Javascript by Microsoft)
    Rx is used for asynchronous programming. Observable/Observer is the asynchronous dual to the synchronous
    Iterable/Iterator.
    More information on RxJava can be found at https://github.com/Netflix/RxJava and https://speakerdeck.com/
    benjchristensen/rxjava-goto-aarhus-2013

    View Slide

  119. class  VideoService  {
         def  VideoList  getPersonalizedListOfMovies(userId);
         def  VideoBookmark  getBookmark(userId,  videoId);
         def  VideoRating  getRating(userId,  videoId);
         def  VideoMetadata  getMetadata(videoId);
    }
    class  VideoService  {
         def  Observable  getPersonalizedListOfMovies(userId);
         def  Observable  getBookmark(userId,  videoId);
         def  Observable  getRating(userId,  videoId);
         def  Observable  getMetadata(videoId);
    }
    ... create an observable api:
    instead of a blocking api ...
    With Rx blocking APIs could be converted into Observable APIs and accomplish our architecture goals including
    abstracting away the control and implementation of concurrency and asynchronous execution.

    View Slide

  120. +
    return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    Hystrix and RxJava are used together for non-blocking composition of network calls (whether they be blocking or
    non-blocking under the Hystrix bulkheads).
    This code is representative of how Netflix Edge API endpoints are written.
    More information on Hystrix can be found at https://github.com/Netflix/Hystrix/wiki#what-is-hystrix

    View Slide

  121. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    A request for a `User` is kicked off and “observed”. This is non-blocking and will asynchronously receive back the
    `User` when the network response returns.

    View Slide

  122. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    The `User` object is returned and passed into a function that is given to the `flatMap` operator.
    Documentation can be found at https://github.com/Netflix/RxJava/wiki/Transforming-Observables#flatmap

    View Slide

  123. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    Once the `User` is returned, execute another network request to retrieve the personalized `catalog` (grid of videos)

    View Slide

  124. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    … and in parallel also fetch the `social` data for this user. Both the `catalog` and `social` calls are executed in
    parallel, asynchronously and will callback when they receive their responses.

    View Slide

  125. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    When the `catalog` returns it will callback multiple times with each `catalogList`.

    View Slide

  126. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    Each `catalogList` then has a list of videos.

    View Slide

  127. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    For each `video` it will then request a bookmark, rating and the metadata. Each of these are also executed in parallel,
    asynchronously.

    View Slide

  128. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    As the bookmark, rating and metadata come back they are “zipped” together and passed to a function and combines
    them into a single response Map.
    Documentation can be found at https://github.com/Netflix/RxJava/wiki/Combining-Observables#zip

    View Slide

  129. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    The `social` and `catalog` responses (of type Map) are merged together …

    View Slide

  130. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    … and as received are written to the output stream. This will write each `list` and `social` response out as individual
    messages and not wait for everything. This can be done with chunked HTTP, ServerSentEvents or WebSockets.

    View Slide

  131. All of the endpoint logic sits on top of Hystrix which provides the bulkheading and fault tolerance.

    View Slide

  132. Netflix API
    Device
    Server
    Optimize for Each Device
    The Netflix Edge has become a platform that empowers UI teams to build their own API endpoints that are optimized
    to their client applications and devices.

    View Slide

  133. Tooling supports the dynamic deployment and operations of endpoints.

    View Slide

  134. Including activity logs …

    View Slide

  135. … and metrics around deprecated functionality to assist in lifecycle management.

    View Slide

  136. Failure inevitably happens ...
    Failure will happen.
    A good read on complex systems is Drift into Failure by Sidney Dekker: http://www.amazon.com/Drift-into-Failure-
    ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View Slide

  137. Cluster adapts
    Failure Isolated
    When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and
    fallbacks returned.

    View Slide

  138. Cluster adapts
    Failure Isolated
    When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and
    fallbacks returned.

    View Slide

  139. Cluster adapts
    Failure Isolated
    Since the failure rate was above the threshold circuit breakers began tripping. As a portion of the cluster tripped
    circuits it released pressure on the underlying system so it could successfully perform some work.

    View Slide

  140. Cluster adapts
    Failure Isolated
    The cluster naturally adapts as bulkheads constrain throughput and circuits open and close in a rolling manner
    across the instances in the cluster.

    View Slide

  141. In this example the ‘CinematchGetPredictions’ functionality began failing.

    View Slide

  142. The red metric shows it was exceptions thrown by the client, not latency or concurrency constraints.

    View Slide

  143. The 20% error rate from the realtime visualization is also seen in the historical metrics with accompanying drop in
    successes.

    View Slide

  144. Matching the increase in failures is the increase of fallbacks being delivered for every failure.

    View Slide

  145. We found that historical metrics with 1 datapoint per minute and 1-2 minutes latency were not sufficient during
    operational events such as deployments, rollbacks, production alerts and configuration changes so we built near
    realtime monitoring and data visualizations with ~1-2 second latency. Read more at https://github.com/Netflix/
    Hystrix/wiki/Dashboard

    View Slide

  146. Note: This is a mockup
    Data visualizations and real-time monitoring have proven useful enough that a more comprehensive suite of tools is
    being built.
    Read more at: http://techblog.netflix.com/2014/01/improving-netflixs-operational.html

    View Slide

  147. Note: This is a mockup
    Data visualizations and real-time monitoring have proven useful enough that a more comprehensive suite of tools is
    being built.
    Read more at: http://techblog.netflix.com/2014/01/improving-netflixs-operational.html

    View Slide

  148. “…complex systems run as broken systems.
    The system continues to function because it
    contains so many redundancies and because
    people can make it function, despite the presence
    of many flaws.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
    See http://www.ctlab.org/documents/How Complex Systems Fail.pdf

    View Slide

  149. Netflix Tech Blog
    http://techblog.netflix.com
    Netflix Open Source
    http://netflix.github.io
    Functional Reactive in the Netflix API with RxJava
    http://techblog.netflix.com/2013/02/rxjava-netflix-api.html
    Application Resilience in a Service-oriented Architecture
    http://programming.oreilly.com/2013/06/application-resilience-in-a-service-oriented-architecture.html
    jobs.netflix.com
    Ben Christensen
    @benjchristensen
    http://www.linkedin.com/in/benjchristensen

    View Slide