Reactive Service Levels at React London 2014

Reactive Service Levels at React London 2014

How do we support the current always-on systems culture? Is 100% uptime really possible? Hot deployment of software upgrades. Multi-variant testing of new features for measured and informed business reaction. Monitoring and managing reactive systems to ensure they are meeting service levels.

Learn how Netflix achieves scalability, resilience and responsiveness through its cloud architecture and tools such as Ribbon, Eureka, Hystrix, RxJava, Zuul, Scryer and the Simian Army.

Presented at React 2014 in London: http://reactconf.com

Video: http://www.youtube.com/watch?v=Ftkn1OF895E&feature=share&list=PLSD48HvrE7-Z1stQ1vIIBumB0wK0s8llY&index=7

25a69d1e333ff36b77cf01b84b764182?s=128

Ben Christensen

April 08, 2014
Tweet

Transcript

  1. Reactive Service Levels Ben Christensen Software Engineer – Edge Engineering

    at Netflix @benjchristensen http://techblog.netflix.com/ React London - April 2014
  2. “the explosive growth of software has added greatly to systems’

    interactive complexity. With software, the possible states that a system can end up in become mind-boggling.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0
  3. “We can model and understand in isolation. But, when released

    into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, their complexities mushroom. And we are caught short.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0
  4. Netflix is a subscription service for movies and TV shows

    for $7.99USD/month (about the same converted price in each countries local currency).
  5. More than 44 million Subscribers in 41 Countries Netflix has

    over 44 million video streaming customers in 41 countries across North & South America, United Kingdom, Ireland, Netherlands and the Nordics.
  6. Netflix accounts for 31% of Peak Downstream Internet Traffic in

    North America Netflix subscribers are watching more than 1 billion hours a month Sandvine report available with free account at http://www.sandvine.com/news/global_broadband_trends.asp Image from report at https://www.sandvine.com/downloads/general/global-internet-phenomena/2013/2h-2013- global-internet-phenomena-report.pdf
  7. None
  8. Prior to the current globally distributed cloud architecture a single

    data center served the US.
  9. Netflix expanded its service around the globe …

  10. … and migrated from the data center to a cloud

    architecture in multiple Amazon AWS regions. Geographic isolation and failover via active/active multi-region deployment was added in 2013 (http://techblog.netflix.com/2013/05/ denominating-multi-region-sites.html)
  11. AWS Availability Zone AWS Availability Zone AWS Availability Zone 3

    AWS Availability Zones (think of them as independent data centers right next to each other for low latency) operate in each region with deployments split across them for redundancy in event of losing an entire zone.
  12. Each zone is populated with application clusters (‘auto-scaling groups’ or

    ASGs) that make up the service oriented distributed system. Application clusters operate independently of each other with client-side software load balancing routing traffic between them.
  13. Application clusters are made up of 1 to 100s of

    machine instances per zone. Service registry and discovery work with software load balancing to allow machines to launch and disappear (for planned or unplanned reasons) at any time and become part of the distributed system and serve requests. Auto-scaling enables system-wide adaptation to demand as it launches instances to meet increasing traffic and load or handle instance failure.
  14. Failed instances are dropped from discovery so traffic stops routing

    to them. Software load balancers on client applications detect and skip them until discovery removes them.
  15. Auto-scale policies brings on new instances to replace failed ones

    or to adapt to increasing demand.
  16. A suite of tools called the “Simian Army” is employed

    to assert the architecture and systems are in fact resilient, responsive and reactive to failure, demand and changing conditions. They are used to inject latency and failure, validate environments, cleanup or perform “Game Day” exercises. These are done in all environments, most importantly in production. More information available at http://techblog.netflix.com/2011/07/netflix-simian-army.html, https://github.com/ Netflix/SimianArmy and http://queue.acm.org/detail.cfm?id=2499552
  17. Chaos Monkey constantly runs in the background randomly killing single

    instances in application clusters. Application owners will receive notification that an instance has been killed. The purpose is asserting that an application cluster can handle loss of instances without impact.
  18. AWS Availability Zone AWS Availability Zone AWS Availability Zone Chaos

    Gorilla is used in “Game Day” exercises to terminate an entire AWS Availability Zone. This is used to demonstrate that all systems behave correctly to migrate traffic and scale up to meet demand in the other 2 zones. It also serves as good practice for engineers to learn what happens and what to expect when it happens for real.
  19. Chaos Kong is another “Game Day” exercise where traffic is

    migrated away from an entire region.
  20. For example, all traffic from US-East could be rerouted to

    US-West so all North and South American traffic is going to a single region instead of split as it normally is. These exercises are done to ensure control systems reroute traffic via DNS changes, that client devices respect the changes (or learn what doesn’t) and gain experience in how the whole system behaves so when it needs to be done for a real reason it is a known practice.
  21. Eureka Instance Discovery Karyon Base Server with Instance Registration, Metrics,

    Heartbeat, etc Ribbon RPC Client with load balancing Eureka, Karyon and Ribbon are core software that enables resilient registration, discovery and communication between applications in the service oriented architecture. More information can be found at http://netflix.github.io
  22. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R Applications communicate with dozens of other applications in the service-oriented architecture. Each of these client/ server dependencies represents a relationship within the complex distributed system.
  23. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User request blocked by latency in single network call Any one of these relationships can fail at any time. They can be intermittent or cluster-wide, immediate with thrown exceptions and/or error codes or experience latency from various causes. Latency is particularly challenging for applications to deal with as it causes resource utilization in queues and pools and blocks user requests (even with async IO).
  24. At high volume all request threads can block in seconds

    User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Latency at high volume can quickly saturate all application resources (queues, pools, sockets, etc) causing total application failure and the inability to serve user requests even if all other dependencies are healthy.
  25. Dozens of dependencies. One going bad takes everything down. 99.99%30

    = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month Reality is generally worse. Large distributed systems are complex and failure will occur. If failure from every component is allowed to cascade across the system they will all affect the user.
  26. CONSTRAINTS Speed of Iteration Client Libraries Mixed Environment Solution design

    for handling cascading latency and failure was done with constraints, context and priorities of the Netflix environment.
  27. CONSTRAINTS Speed of Iteration Client Libraries Mixed Environment Speed of

    iteration is optimized for and this leads to client/server relationships where client libraries are provided rather than each team writing their own client code against a server protocol. This means “3rd party” code from many developers and teams is constantly being deployed into applications across the system. Large applications such as the Netflix Edge API have dozens of client libraries.
  28. CONSTRAINTS Speed of Iteration Client Libraries Mixed Environment Speed of

    iteration is optimized for and this leads to client/server relationships where client libraries are provided rather than each team writing their own client code against a server protocol. This means “3rd party” code from many developers and teams is constantly being deployed into applications across the system. Large applications such as the Netflix Edge API have dozens of client libraries.
  29. CONSTRAINTS Speed of Iteration Client Libraries Mixed Environment The environment

    is also diverse with different types of client/server communications and protocols. This heterogeneous and always changing environment affects the approach for resilience engineering and is potentially very different than approaches taken for a tightly controlled codebase or homogeneous architecture.
  30. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Each dependency - or distributed system relationship - must be isolated so its failure does not cascade or saturate all resources.
  31. cy D dency G ependency J Dependency M Dependency B

    Dependency E Dependency H Dependency K Dependency N Dependency C Dependency F Dependency I Dependency L Dependency O User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Serialization - URL and/or body generation Logic - validation, decoration, object model, caching, metrics, logging, etc It is not just the network that can fail and needs isolation but the full request/response loop including business logic and serialization/deserialization. Protecting against a network failure only to return a response that causes application logic to fail elsewhere in the application only moves the problem.
  32. "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE

    at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency This is an example of what a system looks like when high latency occurs without load shedding and isolation. Backend latency spiked (from <100ms to >1000ms at the median, >10,000ms at the 90th percentile) and saturated all available resources resulting in the HTTP layer rejecting over 80% of requests.
  33. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system relationship so their failure impact is limited and controllable. Also see http://www.reactivemanifesto.org/#resilient
  34. Bulkheading is an approach to isolating failure and latency. It

    can be used to compartmentalize each system relationship so their failure impact is limited and controllable. Also see http://www.reactivemanifesto.org/#resilient
  35. Bulkheading is an approach to isolating failure and latency. It

    can be used to compartmentalize each system relationship so their failure impact is limited and controllable. Also see http://www.reactivemanifesto.org/#resilient
  36. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R Bulkheading is an approach to isolating failure and latency. It can be used to compartmentalize each system relationship so their failure impact is limited and controllable. Also see http://www.reactivemanifesto.org/#resilient
  37. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R Responses can be intercepted and replaced with fallbacks.
  38. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R A user request can continue in a degraded state with a fallback response from the failing dependency.
  39. “Overt catastrophic failure occurs when small, apparently innocuous failures join

    to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf See http://www.ctlab.org/documents/How Complex Systems Fail.pdf
  40. “System operations are dynamic, with components (organizational, human, technical) failing

    and being replaced continuously.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf See http://www.ctlab.org/documents/How Complex Systems Fail.pdf
  41. Code was written to apply bulkheads, circuit breakers, time-outs, fallbacks

    and other practices … and later got a name and logo. See more at http://github.com/Netflix/Hystrix
  42. Logic - validation, decoration, object model, caching, metrics, logging, etc

    Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc A bulkhead wraps around the entire client behavior not just the network portion.
  43. Tryable Semaphore Rejected Permitted Logic - validation, decoration, object model,

    caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc An effective form of bulkheading is a tryable semaphore that restricts concurrent execution. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#semaphores
  44. Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,

    metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout A thread-pool also limits concurrent execution while also offering the ability to timeout and walk away from a latent thread. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#threads--thread-pools
  45. Tryable semaphores for non-blocking clients and fallbacks Separate threads for

    blocking clients Aggressive timeouts to “give up and move on” Circuit breakers as the “release valve”
  46. Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run() Circuit

    Open? getFallback() Success? Exception Thrown Successful Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() Asynchronous Hystrix execution flow chart. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart
  47. Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run Circuit

    Open? getFallback() Return Successful Response Calculate Cir Health Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Time Short-circuit Reject Yes return immediately .queue() Asynchronous Execution can be synchronous or asynchronous (via a Future or Observable).
  48. Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run Circuit

    Open? getFallback() Return Successful Response Calculate Cir Health Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Time Short-circuit Reject Yes return immediately .queue() Asynchronous Current state is queried before allowing execution to determine if it is short-circuited or throttled and should reject.
  49. .observe() .execute() run() Circuit Open? getFallback() Success? Exception Thrown Successful

    Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() If not rejected execution proceeds to the run() method which performs underlying work.
  50. .observe() .execute() run() Circuit Open? getFallback() Success? Exception Thrown Successful

    Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() Successful responses return.
  51. .observe() .execute() run() Circuit Open? getFallback() Success? Exception Thrown Successful

    Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() All requests, successful and failed, contribute to a feedback loop used to make decisions and publish metrics.
  52. .observe() .execute() run() Circuit Open? getFallback() Success? Exception Thrown Successful

    Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() All failure states are routed through the same path.
  53. Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run() Circuit

    Open? getFallback() Return Successful Response Calculate Circu Health Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Timeo Short-circuit Reject Yes return immediately .queue() Asynchronous Every failure is given the opportunity to retrieve a fallback which can result in one of three results.
  54. Construct Hystrix Command Object .observe() .execute() Asynchronous Synchronous run() Circuit

    Open? getFallback() Success? Exception Thrown Successful Response Return Successful Response Calculate Circuit Health Feedback Loop Not Implemented Successful Fallback Failed Fallback Exception Thrown Exception Thrown Return Fallback Response Rate Limit? Timeout Short-circuit Reject Yes return immediately .queue() Asynchronous Hystrix execution flow chart. Read more at https://github.com/Netflix/Hystrix/wiki/How-it-Works#flow-chart
  55. HystrixCommand run() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {    

       ...        protected  String  run()  {                return  "Hello  "  +  name  +  "!";        } } Basic successful execution pattern and sample code. Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#wiki-Hello-World
  56. public  class  CommandHelloWorld  extends  HystrixCommand<String>  {        ...

           protected  String  run()  {                return  "Hello  "  +  name  +  "!";        } } run() invokes “client” Logic HystrixCommand run() The run() method is where the wrapped logic goes.
  57. HystrixCommand run() throw Exception Fail Fast Failing fast is the

    default behavior if no fallback is implemented. Even without a fallback this is useful as it prevents resource saturation beyond the bulkhead so the rest of the application can continue functioning and enables rapid recovery once the underlying problem is resolved. Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-fast
  58. HystrixCommand run() getFallback() return  null; return  new  Option<T>(); return  Collections.emptyList();

    return  Collections.emptyMap(); Fail Silent Silent failure is an approach for removing non-essential functionality from the user experience by returning a value that equates to “no data”, “not available” or “don’t display”. Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fail-silent
  59. HystrixCommand run() getFallback() return  true; return  DEFAULT_OBJECT; Static Fallback Static

    fallbacks can be used when default data or behavior can be returned to the user. Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-static
  60. HystrixCommand run() getFallback() return  new  UserAccount(customerId,  "Unknown  Name",    

                               countryCodeFromGeoLookup,  true,  true,  false); return  new  VideoBookmark(movieId,  0); Stubbed Fallback Stubbed fallbacks are an extension of static fallbacks when some data is available (such as from request arguments, authentication tokens or other functioning system calls) and combined with default values for data that can not be retrieved. Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-stubbed
  61. HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

         ...        protected  String  run()  {                return  "Hello  "  +  name  +  "!";        }        protected  String  getFallback()  {                return  "Hello  Failure  "  +  name  +  "!";        } } Stubbed Fallback
  62. HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

         ...        protected  String  run()  {                return  "Hello  "  +  name  +  "!";        }        protected  String  getFallback()  {                return  "Hello  Failure  "  +  name  +  "!";        } } Stubbed Fallback The getFallback() method is executed whenever failure occurs (after run() invocation or on rejection without run() ever being invoked) to provide opportunity to do fallback.
  63. HystrixCommand run() getFallback() HystrixCommand run() Fallback via network Fallback via

    network is a common approach for falling back to a stale cache (such as a memcache server) or less personalized value when not able to fetch from the primary source. Read more at https://github.com/Netflix/Hystrix/wiki/How-To-Use#fallback-cache-via-network
  64. HystrixCommand run() getFallback() HystrixCommand run() getFallback() Fallback via network then

    Local When the fallback performs a network call it is preferable for it to also have a fallback that does not go over the network otherwise if both primary and secondary systems fail it will fail by throwing an exception (similar to fail fast except after two fallback attempts).
  65. “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0
  66. “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure Local decisions are being made constantly and do affect the entire system. These include routing decisions (such as by Ribbon), marking an instance “up” or “down” (via Eureka), timing out, short-circuiting, rejecting and fallbacks (via Hystrix). Drift into Failure: http://www.amazon.com/Drift-into-Failure-ebook/dp/B009KOKXKY/ref=tmm_kin_title_0
  67. Auditing via Simulation Simulating failure states in production has proven

    an effective approach for auditing our applications to either prove resilience or find weakness and determine how local decisions affect the system as a whole. NOTE: This does not imply ability to understand all impacts of local decisions. Since a distributed systems is a “complex system” it is by definition impossible to simulate or even model all possible states. However, auditing via simulation does allow many (and the most common) to be understood, validated and catered for.
  68. Auditing via Simulation In this example failure was injected into

    a single dependency which caused the bulkhead to return fallbacks and trip all circuits since the failure rate was almost 100%, well above the threshold for circuits to trip.
  69. Auditing via Simulation When the ‘TitleStatesGetAllRentStates` bulkhead began returning fallbacks

    the ‘atv_mdp’ endpoint shot to the top of the dashboard with 99% error rate. There was a bug in how the fallback was handled on this device so we immediately stopped the simulation, fixed the bug over the coming days and repeated the simulation to prove it was fixed and the rest of the system remained resilient. This was caught in a controlled simulation where we could catch and act in less than a minute rather than a true production incident where we likely wouldn’t have been able to do anything.
  70. This shows another simulation where latency was injected. Read more

    at http://techblog.netflix.com/2011/07/netflix-simian-army.html
  71. 125 → 1500+ 1000+ ms of latency was injected into

    a dependency that normally completes with a median latency of ~15-20ms and 99.5th of 120-130ms.
  72. ~5000 The latency spike caused timeouts, short-circuiting and rejecting and

    up to ~5000 fallbacks per second as a result of these various failure states.
  73. ~1 While delivering the ~5000 fallbacks per second the exceptions

    thrown didn’t go beyond ~1 per second demonstrating that user impact was negligible (as perceived from the server, the client behavior must also be validated during a simulation but is not part of this dataset).
  74. Constantly Changing Constantly building, testing and deploying new code. Dozens

    of application deployments every day. Typically 100s of customer A/B tests running. For more info on A/B testing: http://www.slideshare.net/xamat/cikm-2013-beyond-data-from-user-information- to-business-value
  75. Application clusters are replaced every time a new version of

    code is deployed. There are often several different clusters with slightly different versions or configurations of the same application.
  76. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" For

    example, the Edge API application has several clusters such as these used for validation, testing, debugging and production traffic.
  77. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" Zuul

    was built to allow dynamic routing of traffic to each cluster. Read more at http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html
  78. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" Every

    code deployment is preceded by a canary test where a small number of instances are launched to take production traffic, half with new code (canary), half with existing production code (baseline) and compared for differences. Read more at http://techblog.netflix.com/2013/08/deploying-netflix-api.html
  79. This dashboard represents the continuous build and deployment pipeline for

    a single application. Each commit (or batch of commits) results in unit tests being executed, a new machine image being created, an instance launched in the test environment, smoke tests against this instance and optionally launching in production for canary testing.
  80. This build resulted in a poor canary score so did

    not get promoted to production.
  81. The exact commit that represents the machine image running in

    production is seen along with which clusters are running that code.
  82. The production code passed the canary test with a score

    of 99%.
  83. There are many different environments where code can be deployed.

  84. History and metadata for every environment, commit and machine image

    is available.
  85. Code can be manually pushed to an environment by selecting

    a build and environment.
  86. Selection of a build shows its status and environments it

    is deployed to.
  87. Each cluster can be inspected to see versions of code

    deployed, whether they are active or inactive, taking traffic or not and how many instances.
  88. Selecting an environment provides context for time zones and normal

    peak traffic patterns to assist in making decisions about when to deploy. The tooling allows ignoring the suggestions and pushing the code immediately, or scheduling it to be done during off-peak hours.
  89. The sequence of builds can be viewed to get an

    overview of where code is currently running.
  90. Canary tests in production result in reports that compare the

    new code against a “baseline” which is the same machine image as current production.
  91. Metrics from Ribbon, Hystrix, JVM and machine are analyzed to

    determine a score. They are marked as “hot” or “cold”.
  92. Metrics from Ribbon, Hystrix, JVM and machine are analyzed to

    determine a score. They are marked as “hot” or “cold”.
  93. This canary did not do so well. This results in

    review of what went wrong. Sometimes it will be judged okay to proceed and the problems get fixed in a later release. Other times the problems are too severe and must be fixed before release.
  94. This shows a build that went through the entire process

    and is now taking production traffic in all 3 regions.
  95. There are several clusters taking production traffic besides the primary

    one. They are used for debugging, squeeze testing, alerting, or long-lived (stable, meaning no autoscale policies) monitoring of application health (such as memory).
  96. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" New

    machine images are put through a squeeze test before full rollout to find the point at which the performance degrades. This is used to identify performance and throughput changes of each deployment and is fed as a datapoint into autoscaling algorithms.
  97. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" Long-term

    canaries are kept in a cluster we call “coalmine” with agents intercepting all network traffic. These run the same code as the production cluster and are used to identify network traffic without a bulkhead that starts happening due to unknown code paths being enabled via configuration, AB test and other changes. Read more at https://github.com/Netflix/Hystrix/tree/master/hystrix-contrib/hystrix-network-auditor-agent
  98. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R System Relationship Over Network without Bulkhead For example, a network relationship could exist in production code but not be triggered in dev, test or production canaries but then be enabled via a condition that changes days after deployment to production. This can be a vulnerability and we use the “coalmine” to identity these situations and inform decisions.
  99. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

  100. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine" The

    primary clusters take the majority of production traffic and is managed by autoscale policies that add and remove instances to meet demand.
  101. The daily traffic patterns for a region typically look like

    this.
  102. The weekends have higher peaks …

  103. … and broader mid-day usage …

  104. … than weekdays.

  105. Usage in the morning (particularly on the weekend) climbs rapidly.

    These spikes cause problems with reactive autoscaling and requires padding so enough servers exist to handle the load before new instances can be brought online.
  106. Since the daily pattern is so predictable a new predictive

    autoscaling approach was built. It uses historical data to predict each 24 hour period and sets the scaling “floor” to the number of instances required. Reactive scaling policies continue to run and can raise the instance count higher if needed. When scaling down the predictive policy lowers the floor which allows the reactive policy to scale down as traffic reduces. More information available at: http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
  107. Reactive Only with Predictive This shows the number of instances

    needed with just reactive scaling (blue) and when predictive was added (red). Adding predictive policies allows far more efficient autoscaling because it doesn’t need to add as much overhead to handle spikes in the mornings and it can step up to peak more efficiently rather than reacting by adding percentages of an increasingly larger cluster.
  108. Predictive policies are also safer in the event of anomalies

    that cause a reduction in traffic, whether that be a production outage, a device problem or a widespread event like the SuperBowl. Using only the reactive policy the cluster starts to scale down when it sees traffic drop. The predictive policy keeps it set to where it needs to be so when the traffic returns (typically as a wave that spikes traffic as seen in this image) the required capacity is still there.
  109. This behavior (the blue line) makes the system far more

    resilient to anomalies and outages so that autoscaling doesn’t itself cause degradation or outages by removing capacity right before it’s needed again.
  110. Load Avg After Scryer Another positive side-effect of more efficiently

    scaling up via the prediction algorithms is that the instances end up with smoother load and fewer extremes since the cluster is already scaled up to handle the traffic rather than trying to react after the load comes.
  111. Dozens of UIs across 1000+ Devices Netflix supports > 1000

    different devices with dozens of different UIs. Many of these are constantly innovating.
  112. 100s of Customer A/B Tests Typically 100s of customer A/B

    tests are running on many different UIs and devices. For more info: http://www.slideshare.net/xamat/cikm-2013-beyond-data-from-user-information-to-business- value
  113. The Netflix Edge API was re-architected to allow each UI

    team to develop, deploy and operate their own web service endpoints on top of the Edge Platform. This shifted away from having a single one-size-fits-all REST API that all UIs used. More information can be found at https://speakerdeck.com/benjchristensen/evolution-of-the-netflix-api-qcon- sf-2013
  114. This change allows each endpoint to be optimized to exactly

    the needs of the device or UI it is serving. It also allows each UI team to innovate independently of each other so as to distribute the pace of innovation. The endpoints can be deployed into production in < 2 minutes.
  115. Each endpoint is isolated from others within its own classloader.

    Groovy was chosen as the first language used for writing endpoints but any JVM language could be used.
  116. Code is uploaded into an external data store via RESTful

    administration APIs and then pulled into each application instance in the production clusters. The data store globally replicates the code across all AWS regions and zones.
  117. Endpoints are written in a fully asynchronous manner using RxJava

    against a Java API that is completely non-blocking.
  118. Iterable pull Observable push T next() throws Exception returns; onNext(T)

    onError(Exception) onCompleted() (Functional) Reactive Programming with RxJava A Java port of Rx (Reactive Extensions) https://rx.codeplex.com (.Net and Javascript by Microsoft) Rx is used for asynchronous programming. Observable/Observer is the asynchronous dual to the synchronous Iterable/Iterator. More information on RxJava can be found at https://github.com/Netflix/RxJava and https://speakerdeck.com/ benjchristensen/rxjava-goto-aarhus-2013
  119. class  VideoService  {      def  VideoList  getPersonalizedListOfMovies(userId);    

     def  VideoBookmark  getBookmark(userId,  videoId);      def  VideoRating  getRating(userId,  videoId);      def  VideoMetadata  getMetadata(videoId); } class  VideoService  {      def  Observable<VideoList>  getPersonalizedListOfMovies(userId);      def  Observable<VideoBookmark>  getBookmark(userId,  videoId);      def  Observable<VideoRating>  getRating(userId,  videoId);      def  Observable<VideoMetadata>  getMetadata(videoId); } ... create an observable api: instead of a blocking api ... With Rx blocking APIs could be converted into Observable APIs and accomplish our architecture goals including abstracting away the control and implementation of concurrency and asynchronous execution.
  120. + return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog =

    new PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); Hystrix and RxJava are used together for non-blocking composition of network calls (whether they be blocking or non-blocking under the Hystrix bulkheads). This code is representative of how Netflix Edge API endpoints are written. More information on Hystrix can be found at https://github.com/Netflix/Hystrix/wiki#what-is-hystrix
  121. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); A request for a `User` is kicked off and “observed”. This is non-blocking and will asynchronously receive back the `User` when the network response returns.
  122. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); The `User` object is returned and passed into a function that is given to the `flatMap` operator. Documentation can be found at https://github.com/Netflix/RxJava/wiki/Transforming-Observables#flatmap
  123. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); Once the `User` is returned, execute another network request to retrieve the personalized `catalog` (grid of videos) …
  124. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); … and in parallel also fetch the `social` data for this user. Both the `catalog` and `social` calls are executed in parallel, asynchronously and will callback when they receive their responses.
  125. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); When the `catalog` returns it will callback multiple times with each `catalogList`.
  126. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); Each `catalogList` then has a list of videos.
  127. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); For each `video` it will then request a bookmark, rating and the metadata. Each of these are also executed in parallel, asynchronously.
  128. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); As the bookmark, rating and metadata come back they are “zipped” together and passed to a function and combines them into a single response Map<String, Object>. Documentation can be found at https://github.com/Netflix/RxJava/wiki/Combining-Observables#zip
  129. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); The `social` and `catalog` responses (of type Map<String, Object>) are merged together …
  130. return new UserCommand(request.getQueryParameters().get("userId")).observe().flatMap(user -> { Observable<Map<String, Object>> catalog = new

    PersonalizedCatalogCommand(user).observe() .flatMap(catalogList -> { return catalogList.videos().flatMap(video -> { Observable<Bookmark> bookmark = new BookmarksCommand(video).observe(); Observable<Rating> rating = new RatingsCommand(video).observe(); Observable<Video> metadata = new VideoMetadataCommand(video).observe(); return Observable.zip(bookmark, rating, metadata, (b, r, m) -> { return combineVideoData(video, b, r, m); }); }); }); Observable<Map<String, Object>> social = new SocialCommand(user).observe().map(s -> { return s.getDataAsMap(); }); return Observable.merge(catalog, social); }).flatMap(data -> { return response.writeStringAndFlush(String.valueOf(data)); }); … and as received are written to the output stream. This will write each `list` and `social` response out as individual messages and not wait for everything. This can be done with chunked HTTP, ServerSentEvents or WebSockets.
  131. All of the endpoint logic sits on top of Hystrix

    which provides the bulkheading and fault tolerance.
  132. Netflix API Device Server Optimize for Each Device The Netflix

    Edge has become a platform that empowers UI teams to build their own API endpoints that are optimized to their client applications and devices.
  133. Tooling supports the dynamic deployment and operations of endpoints.

  134. Including activity logs …

  135. … and metrics around deprecated functionality to assist in lifecycle

    management.
  136. Failure inevitably happens ... Failure will happen. A good read

    on complex systems is Drift into Failure by Sidney Dekker: http://www.amazon.com/Drift-into-Failure- ebook/dp/B009KOKXKY/ref=tmm_kin_title_0
  137. Cluster adapts Failure Isolated When the backing system for the

    ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and fallbacks returned.
  138. Cluster adapts Failure Isolated When the backing system for the

    ‘SocialGetTitleContext’ bulkhead became latent the impact was contained and fallbacks returned.
  139. Cluster adapts Failure Isolated Since the failure rate was above

    the threshold circuit breakers began tripping. As a portion of the cluster tripped circuits it released pressure on the underlying system so it could successfully perform some work.
  140. Cluster adapts Failure Isolated The cluster naturally adapts as bulkheads

    constrain throughput and circuits open and close in a rolling manner across the instances in the cluster.
  141. In this example the ‘CinematchGetPredictions’ functionality began failing.

  142. The red metric shows it was exceptions thrown by the

    client, not latency or concurrency constraints.
  143. The 20% error rate from the realtime visualization is also

    seen in the historical metrics with accompanying drop in successes.
  144. Matching the increase in failures is the increase of fallbacks

    being delivered for every failure.
  145. We found that historical metrics with 1 datapoint per minute

    and 1-2 minutes latency were not sufficient during operational events such as deployments, rollbacks, production alerts and configuration changes so we built near realtime monitoring and data visualizations with ~1-2 second latency. Read more at https://github.com/Netflix/ Hystrix/wiki/Dashboard
  146. Note: This is a mockup Data visualizations and real-time monitoring

    have proven useful enough that a more comprehensive suite of tools is being built. Read more at: http://techblog.netflix.com/2014/01/improving-netflixs-operational.html
  147. Note: This is a mockup Data visualizations and real-time monitoring

    have proven useful enough that a more comprehensive suite of tools is being built. Read more at: http://techblog.netflix.com/2014/01/improving-netflixs-operational.html
  148. “…complex systems run as broken systems. The system continues to

    function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf See http://www.ctlab.org/documents/How Complex Systems Fail.pdf
  149. Netflix Tech Blog http://techblog.netflix.com Netflix Open Source http://netflix.github.io Functional Reactive

    in the Netflix API with RxJava http://techblog.netflix.com/2013/02/rxjava-netflix-api.html Application Resilience in a Service-oriented Architecture http://programming.oreilly.com/2013/06/application-resilience-in-a-service-oriented-architecture.html jobs.netflix.com Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen