Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reactive Service Levels at React London 2014

Reactive Service Levels at React London 2014

How do we support the current always-on systems culture? Is 100% uptime really possible? Hot deployment of software upgrades. Multi-variant testing of new features for measured and informed business reaction. Monitoring and managing reactive systems to ensure they are meeting service levels.

Learn how Netflix achieves scalability, resilience and responsiveness through its cloud architecture and tools such as Ribbon, Eureka, Hystrix, RxJava, Zuul, Scryer and the Simian Army.

Presented at React 2014 in London: http://reactconf.com

Video: http://www.youtube.com/watch?v=Ftkn1OF895E&feature=share&list=PLSD48HvrE7-Z1stQ1vIIBumB0wK0s8llY&index=7

Ben Christensen

April 08, 2014
Tweet

More Decks by Ben Christensen

Other Decks in Programming

Transcript

  1. Reactive Service Levels
    Ben Christensen
    Software Engineer – Edge Engineering at Netflix
    @benjchristensen
    http://techblog.netflix.com/
    React London - April 2014

    View full-size slide

  2. “the explosive growth of software has added
    greatly to systems’ interactive complexity. With
    software, the possible states that a system can
    end up in become mind-boggling.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View full-size slide

  3. “We can model and understand in isolation.
    But, when released into competitive, nominally
    regulated societies, their connections proliferate,
    their interactions and interdependencies multiply,
    their complexities mushroom.
    And we are caught short.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View full-size slide

  4. Netflix is a subscription service for movies and TV shows for $7.99USD/month (about the same converted price in
    each countries local currency).

    View full-size slide

  5. More than 44 million Subscribers
    in 41 Countries
    Netflix has over 44 million video streaming customers in 41 countries across North & South America, United
    Kingdom, Ireland, Netherlands and the Nordics.

    View full-size slide

  6. Netflix accounts for 31% of Peak Downstream
    Internet Traffic in North America
    Netflix subscribers are watching
    more than 1 billion hours a month
    Sandvine report available with free account at http://www.sandvine.com/news/global_broadband_trends.asp
    Image from report at https://www.sandvine.com/downloads/general/global-internet-phenomena/2013/2h-2013-
    global-internet-phenomena-report.pdf

    View full-size slide

  7. Prior to the current globally distributed cloud architecture a single data center served the US.

    View full-size slide

  8. Netflix expanded its service around the globe …

    View full-size slide

  9. … and migrated from the data center to a cloud architecture in multiple Amazon AWS regions. Geographic isolation
    and failover via active/active multi-region deployment was added in 2013 (http://techblog.netflix.com/2013/05/
    denominating-multi-region-sites.html)

    View full-size slide

  10. AWS
    Availability Zone
    AWS
    Availability Zone
    AWS
    Availability Zone
    3 AWS Availability Zones (think of them as independent data centers right next to each other for low latency) operate
    in each region with deployments split across them for redundancy in event of losing an entire zone.

    View full-size slide

  11. Each zone is populated with application clusters (‘auto-scaling groups’ or ASGs) that make up the service oriented
    distributed system. Application clusters operate independently of each other with client-side software load balancing
    routing traffic between them.

    View full-size slide

  12. Application clusters are made up of 1 to 100s of machine instances per zone. Service registry and discovery work with
    software load balancing to allow machines to launch and disappear (for planned or unplanned reasons) at any time
    and become part of the distributed system and serve requests. Auto-scaling enables system-wide adaptation to
    demand as it launches instances to meet increasing traffic and load or handle instance failure.

    View full-size slide

  13. Failed instances are dropped from discovery so traffic stops routing to them. Software load balancers on client
    applications detect and skip them until discovery removes them.

    View full-size slide

  14. Auto-scale policies brings on new instances to replace failed ones or to adapt to increasing demand.

    View full-size slide

  15. A suite of tools called the “Simian Army” is employed to assert the architecture and systems are in fact resilient,
    responsive and reactive to failure, demand and changing conditions. They are used to inject latency and failure,
    validate environments, cleanup or perform “Game Day” exercises. These are done in all environments, most
    importantly in production.
    More information available at http://techblog.netflix.com/2011/07/netflix-simian-army.html, https://github.com/
    Netflix/SimianArmy and http://queue.acm.org/detail.cfm?id=2499552

    View full-size slide

  16. Chaos Monkey constantly runs in the background randomly killing single instances in application clusters. Application
    owners will receive notification that an instance has been killed. The purpose is asserting that an application cluster
    can handle loss of instances without impact.

    View full-size slide

  17. AWS
    Availability Zone
    AWS
    Availability Zone
    AWS
    Availability Zone
    Chaos Gorilla is used in “Game Day” exercises to terminate an entire AWS Availability Zone. This is used to
    demonstrate that all systems behave correctly to migrate traffic and scale up to meet demand in the other 2 zones. It
    also serves as good practice for engineers to learn what happens and what to expect when it happens for real.

    View full-size slide

  18. Chaos Kong is another “Game Day” exercise where traffic is migrated away from an entire region.

    View full-size slide

  19. For example, all traffic from US-East could be rerouted to US-West so all North and South American traffic is going to
    a single region instead of split as it normally is. These exercises are done to ensure control systems reroute traffic via
    DNS changes, that client devices respect the changes (or learn what doesn’t) and gain experience in how the whole
    system behaves so when it needs to be done for a real reason it is a known practice.

    View full-size slide

  20. Eureka
    Instance Discovery
    Karyon
    Base Server with
    Instance Registration,
    Metrics, Heartbeat, etc
    Ribbon
    RPC Client with
    load balancing
    Eureka, Karyon and Ribbon are core software that enables resilient registration, discovery and communication
    between applications in the service oriented architecture.
    More information can be found at http://netflix.github.io

    View full-size slide

  21. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Applications communicate with dozens of other applications in the service-oriented architecture. Each of these client/
    server dependencies represents a relationship within the complex distributed system.

    View full-size slide

  22. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User request
    blocked by
    latency in
    single
    network call
    Any one of these relationships can fail at any time. They can be intermittent or cluster-wide, immediate with thrown
    exceptions and/or error codes or experience latency from various causes. Latency is particularly challenging for
    applications to deal with as it causes resource utilization in queues and pools and blocks user requests (even with
    async IO).

    View full-size slide

  23. At high
    volume
    all request
    threads can
    block in
    seconds
    User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Latency at high volume can quickly saturate all application resources (queues, pools, sockets, etc) causing total
    application failure and the inability to serve user requests even if all other dependencies are healthy.

    View full-size slide

  24. Dozens of dependencies.
    One going bad takes everything down.
    99.99%30 = 99.7% uptime
    0.3% of 1 billion = 3,000,000 failures
    2+ hours downtime/month
    Reality is generally worse.
    Large distributed systems are complex and failure will occur. If failure from every component is allowed to cascade
    across the system they will all affect the user.

    View full-size slide

  25. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Solution design for handling cascading latency and failure was done with constraints, context and priorities of the
    Netflix environment.

    View full-size slide

  26. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Speed of iteration is optimized for and this leads to client/server relationships where client libraries are provided
    rather than each team writing their own client code against a server protocol. This means “3rd party” code from many
    developers and teams is constantly being deployed into applications across the system. Large applications such as
    the Netflix Edge API have dozens of client libraries.

    View full-size slide

  27. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    Speed of iteration is optimized for and this leads to client/server relationships where client libraries are provided
    rather than each team writing their own client code against a server protocol. This means “3rd party” code from many
    developers and teams is constantly being deployed into applications across the system. Large applications such as
    the Netflix Edge API have dozens of client libraries.

    View full-size slide

  28. CONSTRAINTS
    Speed of Iteration
    Client Libraries
    Mixed Environment
    The environment is also diverse with different types of client/server communications and protocols. This
    heterogeneous and always changing environment affects the approach for resilience engineering and is potentially
    very different than approaches taken for a tightly controlled codebase or homogeneous architecture.

    View full-size slide

  29. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Each dependency - or distributed system relationship - must be isolated so its failure does not cascade or saturate all
    resources.

    View full-size slide

  30. cy D
    dency G
    ependency J
    Dependency M
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Serialization - URL and/or body generation
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    It is not just the network that can fail and needs isolation but the full request/response loop including business logic
    and serialization/deserialization.
    Protecting against a network failure only to return a response that causes application logic to fail elsewhere in the
    application only moves the problem.

    View full-size slide

  31. "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
    at java.net.Socket.connect(Socket.java:579)
    at java.net.Socket.connect(Socket.java:528)
    at java.net.Socket.(Socket.java:425)
    at java.net.Socket.(Socket.java:280)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722)
    [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1)
    > 80% of requests rejected
    Median
    Latency
    This is an example of what a system looks like when high latency occurs without load shedding and isolation.
    Backend latency spiked (from <100ms to >1000ms at the median, >10,000ms at the 90th percentile) and saturated
    all available resources resulting in the HTTP layer rejecting over 80% of requests.

    View full-size slide

  32. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View full-size slide

  33. Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View full-size slide

  34. Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View full-size slide

  35. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system
    relationship so their failure impact is limited and controllable.
    Also see http://www.reactivemanifesto.org/#resilient

    View full-size slide

  36. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    Responses can be intercepted and replaced with fallbacks.

    View full-size slide

  37. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    A user request can continue in a degraded state with a fallback response from the failing dependency.

    View full-size slide

  38. “Overt catastrophic failure occurs when small,
    apparently innocuous failures join to create
    opportunity for a systemic accident.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
    See http://www.ctlab.org/documents/How Complex Systems Fail.pdf

    View full-size slide

  39. “System operations are dynamic, with
    components (organizational, human, technical)
    failing and being replaced continuously.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
    See http://www.ctlab.org/documents/How Complex Systems Fail.pdf

    View full-size slide

  40. Code was written to apply bulkheads, circuit breakers, time-outs, fallbacks and other practices … and later got a
    name and logo.
    See more at http://github.com/Netflix/Hystrix

    View full-size slide

  41. Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    A bulkhead wraps around the entire client behavior not just the network portion.

    View full-size slide

  42. Tryable Semaphore
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    An effective form of bulkheading is a tryable semaphore that restricts concurrent execution.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#semaphores

    View full-size slide

  43. Thread-pool
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Timeout
    A thread-pool also limits concurrent execution while also offering the ability to timeout and walk away from a latent
    thread.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#threads--thread-pools

    View full-size slide

  44. Tryable semaphores for non-blocking clients and fallbacks
    Separate threads for blocking clients
    Aggressive timeouts to “give up and move on”
    Circuit breakers as the “release valve”

    View full-size slide

  45. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Hystrix execution flow chart.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart

    View full-size slide

  46. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Cir
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Time
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Execution can be synchronous or asynchronous (via a Future or Observable).

    View full-size slide

  47. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Cir
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Time
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Current state is queried before allowing execution to determine if it is short-circuited or throttled and should reject.

    View full-size slide

  48. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    If not rejected execution proceeds to the run() method which performs underlying work.

    View full-size slide

  49. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Successful responses return.

    View full-size slide

  50. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    All requests, successful and failed, contribute to a feedback loop used to make decisions and publish metrics.

    View full-size slide

  51. .observe()
    .execute()
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    All failure states are routed through the same path.

    View full-size slide

  52. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Return Successful Response
    Calculate Circu
    Health
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeo
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Every failure is given the opportunity to retrieve a fallback which can result in one of three results.

    View full-size slide

  53. Construct Hystrix
    Command Object
    .observe()
    .execute()
    Asynchronous
    Synchronous
    run()
    Circuit
    Open?
    getFallback()
    Success?
    Exception
    Thrown
    Successful
    Response
    Return Successful Response
    Calculate Circuit
    Health
    Feedback Loop
    Not Implemented
    Successful Fallback
    Failed Fallback
    Exception Thrown
    Exception Thrown
    Return Fallback Response
    Rate Limit?
    Timeout
    Short-circuit Reject
    Yes
    return immediately
    .queue()
    Asynchronous
    Hystrix execution flow chart.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart

    View full-size slide

  54. HystrixCommand run()
    public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
    }
    Basic successful execution pattern and sample code.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#wiki-Hello-World

    View full-size slide

  55. public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
    }
    run() invokes
    “client” Logic
    HystrixCommand run()
    The run() method is where the wrapped logic goes.

    View full-size slide

  56. HystrixCommand run()
    throw Exception
    Fail Fast
    Failing fast is the default behavior if no fallback is implemented. Even without a fallback this is useful as it prevents
    resource saturation beyond the bulkhead so the rest of the application can continue functioning and enables rapid
    recovery once the underlying problem is resolved.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-fast

    View full-size slide

  57. HystrixCommand run()
    getFallback()
    return  null;
    return  new  Option();
    return  Collections.emptyList();
    return  Collections.emptyMap();
    Fail Silent
    Silent failure is an approach for removing non-essential functionality from the user experience by returning a value
    that equates to “no data”, “not available” or “don’t display”.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-silent

    View full-size slide

  58. HystrixCommand run()
    getFallback()
    return  true;
    return  DEFAULT_OBJECT;
    Static Fallback
    Static fallbacks can be used when default data or behavior can be returned to the user.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-static

    View full-size slide

  59. HystrixCommand run()
    getFallback()
    return  new  UserAccount(customerId,  "Unknown  Name",
                                   countryCodeFromGeoLookup,  true,  true,  false);
    return  new  VideoBookmark(movieId,  0);
    Stubbed Fallback
    Stubbed fallbacks are an extension of static fallbacks when some data is available (such as from request arguments,
    authentication tokens or other functioning system calls) and combined with default values for data that can not be
    retrieved.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-stubbed

    View full-size slide

  60. HystrixCommand run()
    getFallback()
    public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
           protected  String  getFallback()  {
                   return  "Hello  Failure  "  +  name  +  "!";
           }
    }
    Stubbed Fallback

    View full-size slide

  61. HystrixCommand run()
    getFallback()
    public  class  CommandHelloWorld  extends  HystrixCommand  {
           ...
           protected  String  run()  {
                   return  "Hello  "  +  name  +  "!";
           }
           protected  String  getFallback()  {
                   return  "Hello  Failure  "  +  name  +  "!";
           }
    }
    Stubbed Fallback
    The getFallback() method is executed whenever failure occurs (after run() invocation or on rejection without run() ever
    being invoked) to provide opportunity to do fallback.

    View full-size slide

  62. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    Fallback via network
    Fallback via network is a common approach for falling back to a stale cache (such as a memcache server) or less
    personalized value when not able to fetch from the primary source.
    Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-cache-via-network

    View full-size slide

  63. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    getFallback()
    Fallback via network then Local
    When the fallback performs a network call it is preferable for it to also have a fallback that does not go over the
    network otherwise if both primary and secondary systems fail it will fail by throwing an exception (similar to fail fast
    except after two fallback attempts).

    View full-size slide

  64. “In complex systems, decision-makers are
    locally rather than globally rational. But that
    doesn’t mean that their decisions cannot lead
    to global, or system-wide events. In fact, that
    is one of the properties of complex systems:
    local actions can have global results.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View full-size slide

  65. “In complex systems, decision-makers are
    locally rather than globally rational. But that
    doesn’t mean that their decisions cannot lead
    to global, or system-wide events. In fact, that
    is one of the properties of complex systems:
    local actions can have global results.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure
    Local decisions are being made constantly and do affect the entire system. These include routing decisions (such as
    by Ribbon), marking an instance “up” or “down” (via Eureka), timing out, short-circuiting, rejecting and fallbacks (via
    Hystrix).
    Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View full-size slide

  66. Auditing via Simulation
    Simulating failure states in production has proven an effective approach for auditing our applications to either prove
    resilience or find weakness and determine how local decisions affect the system as a whole.
    NOTE: This does not imply ability to understand all impacts of local decisions. Since a distributed systems is a
    “complex system” it is by definition impossible to simulate or even model all possible states. However, auditing via
    simulation does allow many (and the most common) to be understood, validated and catered for.

    View full-size slide

  67. Auditing via Simulation
    In this example failure was injected into a single dependency which caused the bulkhead to return fallbacks and trip
    all circuits since the failure rate was almost 100%, well above the threshold for circuits to trip.

    View full-size slide

  68. Auditing via Simulation
    When the ‘TitleStatesGetAllRentStates` bulkhead began returning fallbacks the ‘atv_mdp’ endpoint shot to the top of
    the dashboard with 99% error rate. There was a bug in how the fallback was handled on this device so we immediately
    stopped the simulation, fixed the bug over the coming days and repeated the simulation to prove it was fixed and the
    rest of the system remained resilient. This was caught in a controlled simulation where we could catch and act in less
    than a minute rather than a true production incident where we likely wouldn’t have been able to do anything.

    View full-size slide

  69. This shows another simulation where latency was injected.
    Read more at http://techblog.netflix.com/2011/07/netflix-simian-army.html

    View full-size slide

  70. 125 → 1500+
    1000+ ms of latency was injected into a dependency that normally completes with a median latency of ~15-20ms
    and 99.5th of 120-130ms.

    View full-size slide

  71. ~5000
    The latency spike caused timeouts, short-circuiting and rejecting and up to ~5000 fallbacks per second as a result of
    these various failure states.

    View full-size slide

  72. ~1
    While delivering the ~5000 fallbacks per second the exceptions thrown didn’t go beyond ~1 per second
    demonstrating that user impact was negligible (as perceived from the server, the client behavior must also be
    validated during a simulation but is not part of this dataset).

    View full-size slide

  73. Constantly Changing
    Constantly building, testing and deploying new code. Dozens of application deployments every day. Typically 100s of
    customer A/B tests running.
    For more info on A/B testing: http://www.slideshare.net/xamat/cikm-2013-beyond-data-from-user-information-
    to-business-value

    View full-size slide

  74. Application clusters are replaced every time a new version of code is deployed. There are often several different
    clusters with slightly different versions or configurations of the same application.

    View full-size slide

  75. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    For example, the Edge API application has several clusters such as these used for validation, testing, debugging and
    production traffic.

    View full-size slide

  76. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Zuul was built to allow dynamic routing of traffic to each cluster.
    Read more at http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html

    View full-size slide

  77. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Every code deployment is preceded by a canary test where a small number of instances are launched to take
    production traffic, half with new code (canary), half with existing production code (baseline) and compared for
    differences.
    Read more at http://techblog.netflix.com/2013/08/deploying-netflix-api.html

    View full-size slide

  78. This dashboard represents the continuous build and deployment pipeline for a single application. Each commit (or
    batch of commits) results in unit tests being executed, a new machine image being created, an instance launched in
    the test environment, smoke tests against this instance and optionally launching in production for canary testing.

    View full-size slide

  79. This build resulted in a poor canary score so did not get promoted to production.

    View full-size slide

  80. The exact commit that represents the machine image running in production is seen along with which clusters are
    running that code.

    View full-size slide

  81. The production code passed the canary test with a score of 99%.

    View full-size slide

  82. There are many different environments where code can be deployed.

    View full-size slide

  83. History and metadata for every environment, commit and machine image is available.

    View full-size slide

  84. Code can be manually pushed to an environment by selecting a build and environment.

    View full-size slide

  85. Selection of a build shows its status and environments it is deployed to.

    View full-size slide

  86. Each cluster can be inspected to see versions of code deployed, whether they are active or inactive, taking traffic or
    not and how many instances.

    View full-size slide

  87. Selecting an environment provides context for time zones and normal peak traffic patterns to assist in making
    decisions about when to deploy. The tooling allows ignoring the suggestions and pushing the code immediately, or
    scheduling it to be done during off-peak hours.

    View full-size slide

  88. The sequence of builds can be viewed to get an overview of where code is currently running.

    View full-size slide

  89. Canary tests in production result in reports that compare the new code against a “baseline” which is the same
    machine image as current production.

    View full-size slide

  90. Metrics from Ribbon, Hystrix, JVM and machine are analyzed to determine a score. They are marked as “hot” or “cold”.

    View full-size slide

  91. Metrics from Ribbon, Hystrix, JVM and machine are analyzed to determine a score. They are marked as “hot” or “cold”.

    View full-size slide

  92. This canary did not do so well. This results in review of what went wrong. Sometimes it will be judged okay to
    proceed and the problems get fixed in a later release. Other times the problems are too severe and must be fixed
    before release.

    View full-size slide

  93. This shows a build that went through the entire process and is now taking production traffic in all 3 regions.

    View full-size slide

  94. There are several clusters taking production traffic besides the primary one. They are used for debugging, squeeze
    testing, alerting, or long-lived (stable, meaning no autoscale policies) monitoring of application health (such as
    memory).

    View full-size slide

  95. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    New machine images are put through a squeeze test before full rollout to find the point at which the performance
    degrades. This is used to identify performance and throughput changes of each deployment and is fed as a datapoint
    into autoscaling algorithms.

    View full-size slide

  96. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    Long-term canaries are kept in a cluster we call “coalmine” with agents intercepting all network traffic. These run the
    same code as the production cluster and are used to identify network traffic without a bulkhead that starts happening
    due to unknown code paths being enabled via configuration, AB test and other changes.
    Read more at https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent

    View full-size slide

  97. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    System
    Relationship
    Over
    Network
    without
    Bulkhead
    For example, a network relationship could exist in production code but not be triggered in dev, test or production
    canaries but then be enabled via a condition that changes days after deployment to production. This can be a
    vulnerability and we use the “coalmine” to identity these situations and inform decisions.

    View full-size slide

  98. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View full-size slide

  99. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"
    The primary clusters take the majority of production traffic and is managed by autoscale policies that add and remove
    instances to meet demand.

    View full-size slide

  100. The daily traffic patterns for a region typically look like this.

    View full-size slide

  101. The weekends have higher peaks …

    View full-size slide

  102. … and broader mid-day usage …

    View full-size slide

  103. … than weekdays.

    View full-size slide

  104. Usage in the morning (particularly on the weekend) climbs rapidly. These spikes cause problems with reactive
    autoscaling and requires padding so enough servers exist to handle the load before new instances can be brought
    online.

    View full-size slide

  105. Since the daily pattern is so predictable a new predictive autoscaling approach was built. It uses historical data to
    predict each 24 hour period and sets the scaling “floor” to the number of instances required. Reactive scaling policies
    continue to run and can raise the instance count higher if needed. When scaling down the predictive policy lowers the
    floor which allows the reactive policy to scale down as traffic reduces.
    More information available at: http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html

    View full-size slide

  106. Reactive Only
    with Predictive
    This shows the number of instances needed with just reactive scaling (blue) and when predictive was added (red).
    Adding predictive policies allows far more efficient autoscaling because it doesn’t need to add as much overhead to
    handle spikes in the mornings and it can step up to peak more efficiently rather than reacting by adding percentages
    of an increasingly larger cluster.

    View full-size slide

  107. Predictive policies are also safer in the event of anomalies that cause a reduction in traffic, whether that be a
    production outage, a device problem or a widespread event like the SuperBowl.
    Using only the reactive policy the cluster starts to scale down when it sees traffic drop. The predictive policy keeps it
    set to where it needs to be so when the traffic returns (typically as a wave that spikes traffic as seen in this image) the
    required capacity is still there.

    View full-size slide

  108. This behavior (the blue line) makes the system far more resilient to anomalies and outages so that autoscaling
    doesn’t itself cause degradation or outages by removing capacity right before it’s needed again.

    View full-size slide

  109. Load Avg After Scryer
    Another positive side-effect of more efficiently scaling up via the prediction algorithms is that the instances end up
    with smoother load and fewer extremes since the cluster is already scaled up to handle the traffic rather than trying
    to react after the load comes.

    View full-size slide

  110. Dozens of UIs
    across 1000+ Devices
    Netflix supports > 1000 different devices with dozens of different UIs. Many of these are constantly innovating.

    View full-size slide

  111. 100s of Customer A/B Tests
    Typically 100s of customer A/B tests are running on many different UIs and devices.
    For more info: http://www.slideshare.net/xamat/cikm-2013-beyond-data-from-user-information-to-business-
    value

    View full-size slide

  112. The Netflix Edge API was re-architected to allow each UI team to develop, deploy and operate their own web service
    endpoints on top of the Edge Platform. This shifted away from having a single one-size-fits-all REST API that all UIs
    used.
    More information can be found at https://speakerdeck.com/benjchristensen/evolution-of-the-netflix-api-qcon-
    sf-2013

    View full-size slide

  113. This change allows each endpoint to be optimized to exactly the needs of the device or UI it is serving. It also allows
    each UI team to innovate independently of each other so as to distribute the pace of innovation.
    The endpoints can be deployed into production in < 2 minutes.

    View full-size slide

  114. Each endpoint is isolated from others within its own classloader. Groovy was chosen as the first language used for
    writing endpoints but any JVM language could be used.

    View full-size slide

  115. Code is uploaded into an external data store via RESTful administration APIs and then pulled into each application
    instance in the production clusters. The data store globally replicates the code across all AWS regions and zones.

    View full-size slide

  116. Endpoints are written in a fully asynchronous manner using RxJava against a Java API that is completely non-blocking.

    View full-size slide

  117. Iterable
    pull
    Observable
    push
    T next()
    throws Exception
    returns;
    onNext(T)
    onError(Exception)
    onCompleted()
    (Functional) Reactive Programming with RxJava
    A Java port of Rx (Reactive Extensions)
    https://rx.codeplex.com (.Net and Javascript by Microsoft)
    Rx is used for asynchronous programming. Observable/Observer is the asynchronous dual to the synchronous
    Iterable/Iterator.
    More information on RxJava can be found at https://github.com/Netflix/RxJava and https://speakerdeck.com/
    benjchristensen/rxjava-goto-aarhus-2013

    View full-size slide

  118. class  VideoService  {
         def  VideoList  getPersonalizedListOfMovies(userId);
         def  VideoBookmark  getBookmark(userId,  videoId);
         def  VideoRating  getRating(userId,  videoId);
         def  VideoMetadata  getMetadata(videoId);
    }
    class  VideoService  {
         def  Observable  getPersonalizedListOfMovies(userId);
         def  Observable  getBookmark(userId,  videoId);
         def  Observable  getRating(userId,  videoId);
         def  Observable  getMetadata(videoId);
    }
    ... create an observable api:
    instead of a blocking api ...
    With Rx blocking APIs could be converted into Observable APIs and accomplish our architecture goals including
    abstracting away the control and implementation of concurrency and asynchronous execution.

    View full-size slide

  119. +
    return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    Hystrix and RxJava are used together for non-blocking composition of network calls (whether they be blocking or
    non-blocking under the Hystrix bulkheads).
    This code is representative of how Netflix Edge API endpoints are written.
    More information on Hystrix can be found at https://github.com/Netflix/Hystrix/wiki#what-is-hystrix

    View full-size slide

  120. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    A request for a `User` is kicked off and “observed”. This is non-blocking and will asynchronously receive back the
    `User` when the network response returns.

    View full-size slide

  121. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    The `User` object is returned and passed into a function that is given to the `flatMap` operator.
    Documentation can be found at https://github.com/Netflix/RxJava/wiki/Transforming-Observables#flatmap

    View full-size slide

  122. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    Once the `User` is returned, execute another network request to retrieve the personalized `catalog` (grid of videos)

    View full-size slide

  123. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    … and in parallel also fetch the `social` data for this user. Both the `catalog` and `social` calls are executed in
    parallel, asynchronously and will callback when they receive their responses.

    View full-size slide

  124. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    When the `catalog` returns it will callback multiple times with each `catalogList`.

    View full-size slide

  125. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    Each `catalogList` then has a list of videos.

    View full-size slide

  126. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    For each `video` it will then request a bookmark, rating and the metadata. Each of these are also executed in parallel,
    asynchronously.

    View full-size slide

  127. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    As the bookmark, rating and metadata come back they are “zipped” together and passed to a function and combines
    them into a single response Map.
    Documentation can be found at https://github.com/Netflix/RxJava/wiki/Combining-Observables#zip

    View full-size slide

  128. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    The `social` and `catalog` responses (of type Map) are merged together …

    View full-size slide

  129. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> {
    Observable> catalog = new PersonalizedCatalogCommand(user).observe()
    .flatMap(catalogList -> {
    return catalogList.videos().flatMap(video -> {
    Observable bookmark = new BookmarksCommand(video).observe();
    Observable rating = new RatingsCommand(video).observe();
    Observable metadata = new VideoMetadataCommand(video).observe();
    return Observable.zip(bookmark, rating, metadata, (b, r, m) -> {
    return combineVideoData(video, b, r, m);
    });
    });
    });
    Observable> social = new SocialCommand(user).observe().map(s -> {
    return s.getDataAsMap();
    });
    return Observable.merge(catalog, social);
    }).flatMap(data -> {
    return response.writeStringAndFlush(String.valueOf(data));
    });
    … and as received are written to the output stream. This will write each `list` and `social` response out as individual
    messages and not wait for everything. This can be done with chunked HTTP, ServerSentEvents or WebSockets.

    View full-size slide

  130. All of the endpoint logic sits on top of Hystrix which provides the bulkheading and fault tolerance.

    View full-size slide

  131. Netflix API
    Device
    Server
    Optimize for Each Device
    The Netflix Edge has become a platform that empowers UI teams to build their own API endpoints that are optimized
    to their client applications and devices.

    View full-size slide

  132. Tooling supports the dynamic deployment and operations of endpoints.

    View full-size slide

  133. Including activity logs …

    View full-size slide

  134. … and metrics around deprecated functionality to assist in lifecycle management.

    View full-size slide

  135. Failure inevitably happens ...
    Failure will happen.
    A good read on complex systems is Drift into Failure by Sidney Dekker: http://www.amazon.com/Drift-into-Failure-
    ebook/dp/B009KOKXKY/ref=tmm_kin_title_0

    View full-size slide

  136. Cluster adapts
    Failure Isolated
    When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and
    fallbacks returned.

    View full-size slide

  137. Cluster adapts
    Failure Isolated
    When the backing system for the ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and
    fallbacks returned.

    View full-size slide

  138. Cluster adapts
    Failure Isolated
    Since the failure rate was above the threshold circuit breakers began tripping. As a portion of the cluster tripped
    circuits it released pressure on the underlying system so it could successfully perform some work.

    View full-size slide

  139. Cluster adapts
    Failure Isolated
    The cluster naturally adapts as bulkheads constrain throughput and circuits open and close in a rolling manner
    across the instances in the cluster.

    View full-size slide

  140. In this example the ‘CinematchGetPredictions’ functionality began failing.

    View full-size slide

  141. The red metric shows it was exceptions thrown by the client, not latency or concurrency constraints.

    View full-size slide

  142. The 20% error rate from the realtime visualization is also seen in the historical metrics with accompanying drop in
    successes.

    View full-size slide

  143. Matching the increase in failures is the increase of fallbacks being delivered for every failure.

    View full-size slide

  144. We found that historical metrics with 1 datapoint per minute and 1-2 minutes latency were not sufficient during
    operational events such as deployments, rollbacks, production alerts and configuration changes so we built near
    realtime monitoring and data visualizations with ~1-2 second latency. Read more at https://github.com/Netflix/
    Hystrix/wiki/Dashboard

    View full-size slide

  145. Note: This is a mockup
    Data visualizations and real-time monitoring have proven useful enough that a more comprehensive suite of tools is
    being built.
    Read more at: http://techblog.netflix.com/2014/01/improving-netflixs-operational.html

    View full-size slide

  146. Note: This is a mockup
    Data visualizations and real-time monitoring have proven useful enough that a more comprehensive suite of tools is
    being built.
    Read more at: http://techblog.netflix.com/2014/01/improving-netflixs-operational.html

    View full-size slide

  147. “…complex systems run as broken systems.
    The system continues to function because it
    contains so many redundancies and because
    people can make it function, despite the presence
    of many flaws.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
    See http://www.ctlab.org/documents/How Complex Systems Fail.pdf

    View full-size slide

  148. Netflix Tech Blog
    http://techblog.netflix.com
    Netflix Open Source
    http://netflix.github.io
    Functional Reactive in the Netflix API with RxJava
    http://techblog.netflix.com/2013/02/rxjava-netflix-api.html
    Application Resilience in a Service-oriented Architecture
    http://programming.oreilly.com/2013/06/application-resilience-in-a-service-oriented-architecture.html
    jobs.netflix.com
    Ben Christensen
    @benjchristensen
    http://www.linkedin.com/in/benjchristensen

    View full-size slide