Resilient by Design at React SF 2014

Resilient by Design at React SF 2014

In order to operate 24/7 an application must embrace constant change and failure. This kind of resiliency is achievable through the application of reactive design principles. Learn the theory via real-world examples at Netflix along with some lessons learned the hard way in production. Topics of interest will include service-oriented architectures (microservices), cloud computing, where to put application state, hot deployments, bulk heading, circuit breakers, degrading gracefully, operational tooling and how application architecture affects resilience.

Presented at React Conf 2014 in San Francisco http://reactconf.com

Video: http://youtu.be/MEgyGamo79I

25a69d1e333ff36b77cf01b84b764182?s=128

Ben Christensen

November 18, 2014
Tweet

Transcript

  1. 2.

    “the explosive growth of software has added greatly to systems’

    interactive complexity. With software, the possible states that a system can end up in become mind-boggling.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  2. 3.

    “We can model and understand in isolation. But, when released

    into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, their complexities mushroom. And we are caught short.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  3. 9.

    Cache Origin Servers Cache Cache normal 1% cache miss rate

    becomes 10% … 30% … origin is overwhelmed
  4. 21.

    "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  5. 22.

    "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  6. 23.

    "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  7. 25.

    “System operations are dynamic, with components (organizational, human, technical) failing

    and being replaced continuously.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  8. 26.
  9. 27.
  10. 28.
  11. 30.
  12. 31.
  13. 32.
  14. 33.
  15. 34.
  16. 35.
  17. 36.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  18. 37.

    “Overt catastrophic failure occurs when small, apparently innocuous failures join

    to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  19. 38.
  20. 39.
  21. 40.
  22. 41.
  23. 42.
  24. 43.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User request blocked by latency in single network call
  25. 44.

    At high volume all request threads can block in seconds

    User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
  26. 45.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . At high volume all request threads can block in seconds
  27. 46.

    cy D dency G ependency J Dependency M Dependency B

    Dependency E Dependency H Dependency K Dependency N Dependency C Dependency F Dependency I Dependency L Dependency O User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Serialization - URL and/or body generation Logic - validation, decoration, object model, caching, metrics, logging, etc
  28. 47.

    "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE

    at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency
  29. 48.

    “Overt catastrophic failure occurs when small, apparently innocuous failures join

    to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  30. 49.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  31. 50.
  32. 51.
  33. 52.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  34. 53.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  35. 54.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  36. 55.
  37. 56.

    Logic - validation, decoration, object model, caching, metrics, logging, etc

    Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc
  38. 57.

    Tryable Semaphore Rejected Permitted Logic - validation, decoration, object model,

    caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc
  39. 58.

    Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,

    metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with non-blocking IO
  40. 59.

    Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,

    metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with blocking IO
  41. 60.

    Tryable semaphores for non-blocking clients and fallbacks Separate threads for

    blocking clients Bulkhead – Limit Concurrency Aggressive timeouts to “give up and move on” Circuit breakers as the “release valve” Release Pressure
  42. 61.

    HystrixCommand run() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {    

         ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   }
  43. 62.

    public  class  CommandHelloWorld  extends  HystrixCommand<String>  {        

     ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   } run() invokes “client” Logic HystrixCommand run()
  44. 64.

    HystrixCommand run() getFallback() return  null;   return  new  Option<T>();  

    return  Collections.emptyList();   return  Collections.emptyMap(); Fail Silent
  45. 66.

    HystrixCommand run() getFallback() return  new  UserAccount(customerId,  "Unknown  Name",    

                                 countryCodeFromGeoLookup,  true,  true,  false);   return  new  VideoBookmark(movieId,  0); Stubbed Fallback
  46. 67.

    HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

           ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback
  47. 68.

    HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

           ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback
  48. 78.

    State State State State State State But Doesn’t Need To

    Be State State State State State State
  49. 86.

    Cache Cache Database Database Despite more parts it simplifies ownership,

    operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.
  50. 87.

    Cache Cache Database Database Few focus on durability and state

    and increased operational challenges and costs. Despite more parts it simplifies ownership, operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.
  51. 89.

    State Cookie Identity is a critical service. Client state in

    cookie allows a reasonable fallback even if entire Identity service fails.
  52. 90.

    “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  53. 91.

    “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  54. 98.

    "Failure Recovery must be a very simple path and that

    path must be tested frequently" https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  55. 99.
  56. 100.
  57. 102.
  58. 103.
  59. 107.
  60. 109.
  61. 110.

    ~1

  62. 111.
  63. 112.
  64. 113.
  65. 114.
  66. 115.
  67. 116.
  68. 118.
  69. 122.
  70. 123.
  71. 124.
  72. 125.
  73. 126.
  74. 127.
  75. 128.
  76. 131.

    User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R System Relationship Over Network without Bulkhead
  77. 138.
  78. 139.
  79. 140.
  80. 141.
  81. 142.
  82. 145.

    “…complex systems run as broken systems. The system continues to

    function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.” Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf – Richard Cook, How Complex Systems Fail
  83. 153.

    We have long believed that 80% of operations issues originate

    in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  84. 154.

    We have long believed that 80% of operations issues originate

    in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  85. 156.

    Ben Christensen @benjchristensen jobs.netflix.com Fault Tolerance in a High Volume,

    Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Hystrix https://github.com/Netflix/Hystrix/wiki Drift Into Failure http://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216 Release It! http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213