Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient by Design at React SF 2014

Resilient by Design at React SF 2014

In order to operate 24/7 an application must embrace constant change and failure. This kind of resiliency is achievable through the application of reactive design principles. Learn the theory via real-world examples at Netflix along with some lessons learned the hard way in production. Topics of interest will include service-oriented architectures (microservices), cloud computing, where to put application state, hot deployments, bulk heading, circuit breakers, degrading gracefully, operational tooling and how application architecture affects resilience.

Presented at React Conf 2014 in San Francisco http://reactconf.com

Video: http://youtu.be/MEgyGamo79I

Ben Christensen

November 18, 2014
Tweet

More Decks by Ben Christensen

Other Decks in Programming

Transcript

  1. “the explosive growth of software has added greatly to systems’

    interactive complexity. With software, the possible states that a system can end up in become mind-boggling.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  2. “We can model and understand in isolation. But, when released

    into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, their complexities mushroom. And we are caught short.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  3. Cache Origin Servers Cache Cache normal 1% cache miss rate

    becomes 10% … 30% … origin is overwhelmed
  4. "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  5. "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  6. "LFTRs (liquid fluoride thorium reactor) also have excellent safety features.

    My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0
  7. “System operations are dynamic, with components (organizational, human, technical) failing

    and being replaced continuously.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  8. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  9. “Overt catastrophic failure occurs when small, apparently innocuous failures join

    to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  10. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User request blocked by latency in single network call
  11. At high volume all request threads can block in seconds

    User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
  12. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . At high volume all request threads can block in seconds
  13. cy D dency G ependency J Dependency M Dependency B

    Dependency E Dependency H Dependency K Dependency N Dependency C Dependency F Dependency I Dependency L Dependency O User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Serialization - URL and/or body generation Logic - validation, decoration, object model, caching, metrics, logging, etc
  14. "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE

    at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency
  15. “Overt catastrophic failure occurs when small, apparently innocuous failures join

    to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
  16. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  17. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  18. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  19. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R
  20. Logic - validation, decoration, object model, caching, metrics, logging, etc

    Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc
  21. Tryable Semaphore Rejected Permitted Logic - validation, decoration, object model,

    caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc
  22. Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,

    metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with non-blocking IO
  23. Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,

    metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with blocking IO
  24. Tryable semaphores for non-blocking clients and fallbacks Separate threads for

    blocking clients Bulkhead – Limit Concurrency Aggressive timeouts to “give up and move on” Circuit breakers as the “release valve” Release Pressure
  25. HystrixCommand run() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {    

         ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   }
  26. public  class  CommandHelloWorld  extends  HystrixCommand<String>  {        

     ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   } run() invokes “client” Logic HystrixCommand run()
  27. HystrixCommand run() getFallback() return  null;   return  new  Option<T>();  

    return  Collections.emptyList();   return  Collections.emptyMap(); Fail Silent
  28. HystrixCommand run() getFallback() return  new  UserAccount(customerId,  "Unknown  Name",    

                                 countryCodeFromGeoLookup,  true,  true,  false);   return  new  VideoBookmark(movieId,  0); Stubbed Fallback
  29. HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

           ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback
  30. HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand<String>  {  

           ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback
  31. State State State State State State But Doesn’t Need To

    Be State State State State State State
  32. Cache Cache Database Database Despite more parts it simplifies ownership,

    operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.
  33. Cache Cache Database Database Few focus on durability and state

    and increased operational challenges and costs. Despite more parts it simplifies ownership, operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.
  34. State Cookie Identity is a critical service. Client state in

    cookie allows a reasonable fallback even if entire Identity service fails.
  35. “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  36. “In complex systems, decision-makers are locally rather than globally rational.

    But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure
  37. "Failure Recovery must be a very simple path and that

    path must be tested frequently" https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  38. ~1

  39. User Request Dependency A Dependency D Dependency G Dependency J

    Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R System Relationship Over Network without Bulkhead
  40. “…complex systems run as broken systems. The system continues to

    function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.” Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf – Richard Cook, How Complex Systems Fail
  41. We have long believed that 80% of operations issues originate

    in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  42. We have long believed that 80% of operations issues originate

    in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton
  43. Ben Christensen @benjchristensen jobs.netflix.com Fault Tolerance in a High Volume,

    Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Hystrix https://github.com/Netflix/Hystrix/wiki Drift Into Failure http://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216 Release It! http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213