Resilient by Design at React SF 2014

Resilient by Design at React SF 2014

In order to operate 24/7 an application must embrace constant change and failure. This kind of resiliency is achievable through the application of reactive design principles. Learn the theory via real-world examples at Netflix along with some lessons learned the hard way in production. Topics of interest will include service-oriented architectures (microservices), cloud computing, where to put application state, hot deployments, bulk heading, circuit breakers, degrading gracefully, operational tooling and how application architecture affects resilience.

Presented at React Conf 2014 in San Francisco http://reactconf.com

Video: http://youtu.be/MEgyGamo79I

25a69d1e333ff36b77cf01b84b764182?s=128

Ben Christensen

November 18, 2014
Tweet

Transcript

  1. Ben Christensen Developer – Edge Engineering at Netflix @benjchristensen http://techblog.netflix.com/

    React San Francisco - November 2014 Resilient By Design
  2. “the explosive growth of software has added greatly to systems’

    interactive complexity. With software, the possible states that a system can end up in become mind-boggling.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  3. “We can model and understand in isolation. But, when released

    into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, their complexities mushroom. And we are caught short.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  4. Cache Origin Servers Cache Cache Read-through Cache

  5. Cache Origin Servers Cache Cache low ~1% cache miss rate

  6. Cache Origin Servers Cache Cache reads through to origin

  7. Cache Origin Servers Cache Cache writes back to cache

  8. Cache Origin Servers Cache Cache lose a cache shard

  9. Cache Origin Servers Cache Cache normal 1% cache miss rate

    becomes 10% … 30% … origin is overwhelmed
  10. Cache for Performance Becomes an Availability Concern Cache Origin Servers

    Cache Cache
  11. Multiple Dependencies

  12. Allowing One To Break User Experience

  13. Transitive Failure

  14. Sticky Sessions

  15. Complicate Fault Tolerance & Scaling

  16. Feature Complete!

  17. … hmmm … resilience?

  18. We Must Design For Resilience

  19. Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

  20. Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

  21. "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  22. "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  23. "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  24. Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

  25. “System operations are dynamic, with components (organizational, human, technical) failing

    and being replaced continuously.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  26. None
  27. None
  28. None
  29. AWS Availability Zone AWS Availability Zone AWS Availability Zone

  30. None
  31. None
  32. None
  33. None
  34. None
  35. None
  36. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  37. “Overt catastrophic failure occurs when small, apparently innocuous failures join

    to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  38. None
  39. None
  40. None
  41. None
  42. None
  43. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User request blocked by latency in single network call
  44. At high volume all request threads can block in seconds

    User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
  45. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . At high volume all request threads can block in seconds
  46. cy D dency G ependency J Dependency M Dependency B

    Dependency E Dependency H Dependency K Dependency N Dependency C Dependency F Dependency I Dependency L Dependency O User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Serialization - URL and/or body generation Logic - validation, decoration, object model, caching, metrics, logging, etc
  47. "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE

    at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency
  48. “Overt catastrophic failure occurs when small, apparently innocuous failures join

    to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  49. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  50. None
  51. None
  52. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  53. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  54. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  55. None
  56. Logic - validation, decoration, object model, caching, metrics, logging, etc

    Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc
  57. Tryable Semaphore Rejected Permitted Logic - validation, decoration, object model,

    caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc
  58. Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,

    metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with non-blocking IO
  59. Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,

    metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with blocking IO
  60. Tryable semaphores for non-blocking clients and fallbacks Separate threads for

    blocking clients Bulkhead – Limit Concurrency Aggressive timeouts to “give up and move on” Circuit breakers as the “release valve” Release Pressure
  61. HystrixCommand run() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {    

         ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   }
  62. public  class  CommandHelloWorld  extends  HystrixCommand<String>  {        

     ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   } run() invokes “client” Logic HystrixCommand run()
  63. HystrixCommand run() throw Exception Fail Fast

  64. HystrixCommand run() getFallback() return  null;   return  new  Option<T>();  

    return  Collections.emptyList();   return  Collections.emptyMap(); Fail Silent
  65. HystrixCommand run() getFallback() return  true;   return  DEFAULT_OBJECT; Static Fallback

  66. HystrixCommand run() getFallback() return  new  UserAccount(customerId,  "Unknown  Name",    

                                 countryCodeFromGeoLookup,  true,  true,  false);   return  new  VideoBookmark(movieId,  0); Stubbed Fallback
  67. HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

           ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback
  68. HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

           ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback
  69. HystrixCommand run() getFallback() HystrixCommand run() Fallback via network

  70. HystrixCommand run() getFallback() HystrixCommand run() getFallback() Fallback via network then

    Local
  71. Transitive Failure

  72. Transitive Failure with Bulkheads & Fallbacks

  73. All Relationships

  74. State State State State Application State?

  75. State State State State Cluster Replication (and similar approaches)

  76. State State State State State State All Instances Are Now

    Stateful
  77. State State State State State State This Can Be Done

  78. State State State State State State But Doesn’t Need To

    Be State State State State State State
  79. So Where To Put State?

  80. State State State State State Stateful Client

  81. State State State State State Cache Cache Ephemeral Cache (ie.

    memcached, redis, etc)
  82. State State State State State Cache Cache Cache Database Database

    (SQL, key-value, etc)
  83. State State State State State Cache Cache Cache Database Database

    (generally ends up here anyways)
  84. State State State State State Cache Cache Cache Database Why?

    Isn’t this more complicated?
  85. Cache Cache Database Database Bounded Context

  86. Cache Cache Database Database Despite more parts it simplifies ownership,

    operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.
  87. Cache Cache Database Database Few focus on durability and state

    and increased operational challenges and costs. Despite more parts it simplifies ownership, operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.
  88. State An example …

  89. State Cookie Identity is a critical service. Client state in

    cookie allows a reasonable fallback even if entire Identity service fails.
  90. “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  91. “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  92. Load Shedding → Retry Storms

  93. Cache Shard Failure → DDOS Origin

  94. Dynamic Property Change → Saturate All CPUs

  95. Reactive Scaling → Scale Down During Outage → Overwhelmed By

    Thundering Herd
  96. Reactive Scaling → Scale Down During Superbowl → Overwhelmed By

    Thundering Herd
  97. Achieve Resilience → Neglect → Drift → Vulnerability

  98. "Failure Recovery must be a very simple path and that

    path must be tested frequently" https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  99. None
  100. None
  101. AWS Availability Zone AWS Availability Zone AWS Availability Zone

  102. None
  103. None
  104. Auditing via Simulation

  105. Auditing via Simulation

  106. Auditing via Simulation

  107. None
  108. 125 → 1500+

  109. ~5000

  110. ~1

  111. None
  112. None
  113. None
  114. None
  115. None
  116. None
  117. Constantly Changing

  118. None
  119. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

  120. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

  121. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

  122. None
  123. None
  124. None
  125. None
  126. None
  127. None
  128. None
  129. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

  130. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

  131. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R System Relationship Over Network without Bulkhead
  132. Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

  133. Failure inevitably happens ...

  134. Cluster adapts Failure Isolated

  135. Cluster adapts Failure Isolated

  136. Cluster adapts Failure Isolated

  137. Cluster adapts Failure Isolated

  138. None
  139. None
  140. None
  141. None
  142. None
  143. Note: This is a mockup

  144. Note: This is a mockup

  145. “…complex systems run as broken systems. The system continues to

    function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.” Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf – Richard Cook, How Complex Systems Fail
  146. Where to next?

  147. Low Latency Anomaly Detection

  148. Automate Configuration?

  149. Global vs Regional Deployment

  150. Servers as Pets → Herds (Clusters) Clusters as Pets →

    Herds (Global Application)
  151. Human Involvement

  152. Assert Production Readiness?

  153. We have long believed that 80% of operations issues originate

    in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  154. We have long believed that 80% of operations issues originate

    in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  155. Resilience is by Design

  156. Ben Christensen @benjchristensen jobs.netflix.com Fault Tolerance in a High Volume,

    Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Hystrix https://github.com/Netflix/Hystrix/wiki Drift Into Failure http://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216 Release It! http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213