Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient by Design at React SF 2014

Resilient by Design at React SF 2014

In order to operate 24/7 an application must embrace constant change and failure. This kind of resiliency is achievable through the application of reactive design principles. Learn the theory via real-world examples at Netflix along with some lessons learned the hard way in production. Topics of interest will include service-oriented architectures (microservices), cloud computing, where to put application state, hot deployments, bulk heading, circuit breakers, degrading gracefully, operational tooling and how application architecture affects resilience.

Presented at React Conf 2014 in San Francisco http://reactconf.com

Video: http://youtu.be/MEgyGamo79I

Ben Christensen

November 18, 2014
Tweet

More Decks by Ben Christensen

Other Decks in Programming

Transcript

  1. Ben Christensen
    Developer – Edge Engineering at Netflix
    @benjchristensen
    http://techblog.netflix.com/
    React San Francisco - November 2014
    Resilient By Design

    View Slide

  2. “the explosive growth of software has added
    greatly to systems’ interactive complexity. With
    software, the possible states that a system can
    end up in become mind-boggling.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure

    View Slide

  3. “We can model and understand in isolation.
    But, when released into competitive, nominally
    regulated societies, their connections proliferate,
    their interactions and interdependencies multiply,
    their complexities mushroom.
    And we are caught short.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure

    View Slide

  4. Cache
    Origin Servers
    Cache
    Cache
    Read-through Cache

    View Slide

  5. Cache
    Origin Servers
    Cache
    Cache
    low ~1% cache miss rate

    View Slide

  6. Cache
    Origin Servers
    Cache
    Cache
    reads through to origin

    View Slide

  7. Cache
    Origin Servers
    Cache
    Cache
    writes back to cache

    View Slide

  8. Cache
    Origin Servers
    Cache
    Cache
    lose a cache shard

    View Slide

  9. Cache
    Origin Servers
    Cache
    Cache
    normal 1% cache miss rate becomes
    10% … 30% … origin is overwhelmed

    View Slide

  10. Cache for Performance
    Becomes an Availability Concern
    Cache
    Origin Servers
    Cache
    Cache

    View Slide

  11. Multiple Dependencies

    View Slide

  12. Allowing One To
    Break User Experience

    View Slide

  13. Transitive Failure

    View Slide

  14. Sticky Sessions

    View Slide

  15. Complicate Fault Tolerance
    & Scaling

    View Slide

  16. Feature Complete!

    View Slide

  17. … hmmm …
    resilience?

    View Slide

  18. We Must
    Design For Resilience

    View Slide

  19. Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

    View Slide

  20. Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

    View Slide

  21. "LFTRs (liquid fluoride thorium reactor) also have excellent safety
    features. My favorite is the use of a ‘plug’ which would melt if the
    molten mass got too hot for any reason, draining it away into a
    protected lower tank which would stop any fissioning and cool the
    whole lot down. It’s a clever idea: the plug is a frozen wedge of salt
    in a pipe at the bottom of the core tank, cooled by an external fan. If
    power is lost for some reason which might threaten to overheat the
    LFTR, the fan stops, the plug melts, and the salts all drain away. The
    fuel can’t melt down for the straightforward reason that it is already
    molten. No China Syndrome here."
    – Mark Lynas, Nuclear 2.0

    View Slide

  22. "LFTRs (liquid fluoride thorium reactor) also have excellent safety
    features. My favorite is the use of a ‘plug’ which would melt if the
    molten mass got too hot for any reason, draining it away into a
    protected lower tank which would stop any fissioning and cool the
    whole lot down. It’s a clever idea: the plug is a frozen wedge of salt
    in a pipe at the bottom of the core tank, cooled by an external fan. If
    power is lost for some reason which might threaten to overheat the
    LFTR, the fan stops, the plug melts, and the salts all drain away. The
    fuel can’t melt down for the straightforward reason that it is already
    molten. No China Syndrome here."
    – Mark Lynas, Nuclear 2.0

    View Slide

  23. "LFTRs (liquid fluoride thorium reactor) also have excellent safety
    features. My favorite is the use of a ‘plug’ which would melt if the
    molten mass got too hot for any reason, draining it away into a
    protected lower tank which would stop any fissioning and cool the
    whole lot down. It’s a clever idea: the plug is a frozen wedge of salt
    in a pipe at the bottom of the core tank, cooled by an external fan. If
    power is lost for some reason which might threaten to overheat the
    LFTR, the fan stops, the plug melts, and the salts all drain away. The
    fuel can’t melt down for the straightforward reason that it is already
    molten. No China Syndrome here."
    – Mark Lynas, Nuclear 2.0

    View Slide

  24. Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

    View Slide

  25. “System operations are dynamic, with
    components (organizational, human, technical)
    failing and being replaced continuously.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

    View Slide

  26. View Slide

  27. View Slide

  28. View Slide

  29. AWS
    Availability Zone
    AWS
    Availability Zone
    AWS
    Availability Zone

    View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. View Slide

  34. View Slide

  35. View Slide

  36. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R

    View Slide

  37. “Overt catastrophic failure occurs when small,
    apparently innocuous failures join to create
    opportunity for a systemic accident.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

    View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User request
    blocked by
    latency in
    single
    network call

    View Slide

  44. At high
    volume
    all request
    threads can
    block in
    seconds
    User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    View Slide

  45. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    User Request
    User Request
    User Request
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    At high
    volume
    all request
    threads can
    block in
    seconds

    View Slide

  46. cy D
    dency G
    ependency J
    Dependency M
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    User Request
    User Request
    User Request
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Serialization - URL and/or body generation
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc

    View Slide

  47. "Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
    at java.net.Socket.connect(Socket.java:579)
    at java.net.Socket.connect(Socket.java:528)
    at java.net.Socket.(Socket.java:425)
    at java.net.Socket.(Socket.java:280)
    at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91)
    at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722)
    [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1)
    > 80% of requests rejected
    Median
    Latency

    View Slide

  48. “Overt catastrophic failure occurs when small,
    apparently innocuous failures join to create
    opportunity for a systemic accident.”
    – Richard Cook, How Complex Systems Fail
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

    View Slide

  49. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R

    View Slide

  50. View Slide

  51. View Slide

  52. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R

    View Slide

  53. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R

    View Slide

  54. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R

    View Slide

  55. View Slide

  56. Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc

    View Slide

  57. Tryable Semaphore
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc

    View Slide

  58. Thread-pool
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Timeout
    with
    non-blocking
    IO

    View Slide

  59. Thread-pool
    Rejected
    Permitted
    Logic - validation, decoration, object model, caching,
    metrics, logging, etc
    Deserialization - JSON/XML/Thrift/Protobuf/etc
    Network Request - TCP/HTTP, latency, 4xx, 5xx, etc
    Serialization - URL and/or body generation
    Logic - argument validation, caches, metrics, logging,
    multivariate testing, routing, etc
    Timeout
    with
    blocking
    IO

    View Slide

  60. Tryable semaphores for non-blocking clients and fallbacks
    Separate threads for blocking clients
    Bulkhead – Limit Concurrency
    Aggressive timeouts to “give up and move on”
    Circuit breakers as the “release valve”
    Release Pressure

    View Slide

  61. HystrixCommand run()
    public  class  CommandHelloWorld  extends  HystrixCommand  {  
           ...  
           protected  String  run()  {  
                   return  "Hello  "  +  name  +  "!";  
           }  
    }

    View Slide

  62. public  class  CommandHelloWorld  extends  HystrixCommand  {  
           ...  
           protected  String  run()  {  
                   return  "Hello  "  +  name  +  "!";  
           }  
    }
    run() invokes
    “client” Logic
    HystrixCommand run()

    View Slide

  63. HystrixCommand run()
    throw Exception
    Fail Fast

    View Slide

  64. HystrixCommand run()
    getFallback()
    return  null;  
    return  new  Option();  
    return  Collections.emptyList();  
    return  Collections.emptyMap();
    Fail Silent

    View Slide

  65. HystrixCommand run()
    getFallback()
    return  true;  
    return  DEFAULT_OBJECT;
    Static Fallback

    View Slide

  66. HystrixCommand run()
    getFallback()
    return  new  UserAccount(customerId,  "Unknown  Name",  
                                   countryCodeFromGeoLookup,  true,  true,  false);  
    return  new  VideoBookmark(movieId,  0);
    Stubbed Fallback

    View Slide

  67. HystrixCommand run()
    getFallback()
    public  class  CommandHelloWorld  extends  HystrixCommand  {  
           ...  
           protected  String  run()  {  
                   return  "Hello  "  +  name  +  "!";  
           }  
           protected  String  getFallback()  {  
                   return  "Hello  Failure  "  +  name  +  "!";  
           }  
    }
    Stubbed Fallback

    View Slide

  68. HystrixCommand run()
    getFallback()
    public  class  CommandHelloWorld  extends  HystrixCommand  {  
           ...  
           protected  String  run()  {  
                   return  "Hello  "  +  name  +  "!";  
           }  
           protected  String  getFallback()  {  
                   return  "Hello  Failure  "  +  name  +  "!";  
           }  
    }
    Stubbed Fallback

    View Slide

  69. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    Fallback via network

    View Slide

  70. HystrixCommand run()
    getFallback() HystrixCommand
    run()
    getFallback()
    Fallback via network then Local

    View Slide

  71. Transitive Failure

    View Slide

  72. Transitive Failure
    with Bulkheads & Fallbacks

    View Slide

  73. All Relationships

    View Slide

  74. State
    State State State
    Application State?

    View Slide

  75. State
    State State State
    Cluster Replication (and similar approaches)

    View Slide

  76. State
    State
    State
    State
    State
    State
    All Instances
    Are Now Stateful

    View Slide

  77. State
    State
    State
    State
    State
    State
    This Can Be Done

    View Slide

  78. State
    State
    State
    State
    State
    State
    But Doesn’t Need
    To Be
    State
    State
    State State
    State
    State

    View Slide

  79. So Where To Put State?

    View Slide

  80. State State State
    State State
    Stateful Client

    View Slide

  81. State State State
    State State
    Cache
    Cache
    Ephemeral Cache
    (ie. memcached, redis, etc)

    View Slide

  82. State State State
    State State
    Cache
    Cache
    Cache
    Database
    Database
    (SQL, key-value, etc)

    View Slide

  83. State State State
    State State
    Cache
    Cache
    Cache
    Database
    Database
    (generally ends up
    here anyways)

    View Slide

  84. State State State
    State State
    Cache
    Cache
    Cache
    Database
    Why?
    Isn’t this more
    complicated?

    View Slide

  85. Cache
    Cache Database
    Database
    Bounded
    Context

    View Slide

  86. Cache
    Cache Database
    Database
    Despite more parts it simplifies
    ownership, operations,
    reasoning, deployments, failure
    modes.
    Most systems focus on logic and
    behavior with simple
    operations.

    View Slide

  87. Cache
    Cache Database
    Database
    Few focus on durability and
    state and increased
    operational challenges and
    costs.
    Despite more parts it simplifies
    ownership, operations,
    reasoning, deployments, failure
    modes.
    Most systems focus on logic and
    behavior with simple
    operations.

    View Slide

  88. State
    An example …

    View Slide

  89. State
    Cookie
    Identity is a critical service.
    Client state in cookie
    allows a reasonable fallback
    even if entire Identity
    service fails.

    View Slide

  90. “In complex systems, decision-makers are
    locally rather than globally rational. But that
    doesn’t mean that their decisions cannot lead
    to global, or system-wide events. In fact, that
    is one of the properties of complex systems:
    local actions can have global results.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure

    View Slide

  91. “In complex systems, decision-makers are
    locally rather than globally rational. But that
    doesn’t mean that their decisions cannot lead
    to global, or system-wide events. In fact, that
    is one of the properties of complex systems:
    local actions can have global results.”
    Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition.
    – Sidney Dekker, Drift into Failure

    View Slide

  92. Load Shedding → Retry Storms

    View Slide

  93. Cache Shard Failure → DDOS Origin

    View Slide

  94. Dynamic Property Change → Saturate All CPUs

    View Slide

  95. Reactive Scaling
    → Scale Down During Outage
    → Overwhelmed By Thundering Herd

    View Slide

  96. Reactive Scaling
    → Scale Down During Superbowl
    → Overwhelmed By Thundering Herd

    View Slide

  97. Achieve Resilience → Neglect → Drift → Vulnerability

    View Slide

  98. "Failure Recovery must be a very
    simple path and that path must be
    tested frequently"
    https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/
    – James Hamilton

    View Slide

  99. View Slide

  100. View Slide

  101. AWS
    Availability Zone
    AWS
    Availability Zone
    AWS
    Availability Zone

    View Slide

  102. View Slide

  103. View Slide

  104. Auditing via Simulation

    View Slide

  105. Auditing via Simulation

    View Slide

  106. Auditing via Simulation

    View Slide

  107. View Slide

  108. 125 → 1500+

    View Slide

  109. ~5000

    View Slide

  110. ~1

    View Slide

  111. View Slide

  112. View Slide

  113. View Slide

  114. View Slide

  115. View Slide

  116. View Slide

  117. Constantly Changing

    View Slide

  118. View Slide

  119. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View Slide

  120. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View Slide

  121. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View Slide

  122. View Slide

  123. View Slide

  124. View Slide

  125. View Slide

  126. View Slide

  127. View Slide

  128. View Slide

  129. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View Slide

  130. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View Slide

  131. User Request
    Dependency A
    Dependency D
    Dependency G
    Dependency J
    Dependency M
    Dependency P
    Dependency B
    Dependency E
    Dependency H
    Dependency K
    Dependency N
    Dependency Q
    Dependency C
    Dependency F
    Dependency I
    Dependency L
    Dependency O
    Dependency R
    System
    Relationship
    Over
    Network
    without
    Bulkhead

    View Slide

  132. Zuul Routing Layer
    Canary vs Baseline
    Squeeze
    Production
    "Coalmine"

    View Slide

  133. Failure inevitably happens ...

    View Slide

  134. Cluster adapts
    Failure Isolated

    View Slide

  135. Cluster adapts
    Failure Isolated

    View Slide

  136. Cluster adapts
    Failure Isolated

    View Slide

  137. Cluster adapts
    Failure Isolated

    View Slide

  138. View Slide

  139. View Slide

  140. View Slide

  141. View Slide

  142. View Slide

  143. Note: This is a mockup

    View Slide

  144. Note: This is a mockup

    View Slide

  145. “…complex systems run as broken systems.
    The system continues to function because it
    contains so many redundancies and because
    people can make it function, despite the
    presence of many flaws.”
    Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf
    – Richard Cook, How Complex Systems Fail

    View Slide

  146. Where to next?

    View Slide

  147. Low Latency Anomaly Detection

    View Slide

  148. Automate Configuration?

    View Slide

  149. Global vs Regional Deployment

    View Slide

  150. Servers as Pets → Herds (Clusters)
    Clusters as Pets → Herds (Global Application)

    View Slide

  151. Human Involvement

    View Slide

  152. Assert Production Readiness?

    View Slide

  153. We have long believed that 80% of operations issues originate
    in design and development, so this section on overall service
    design is the largest and most important. When systems fail,
    there is a natural tendency to look first to operations since that
    is where the problem actually took place. Most operations
    issues, however, either have their genesis in design and
    development or are best solved there.
    https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/
    – James Hamilton

    View Slide

  154. We have long believed that 80% of operations issues originate
    in design and development, so this section on overall service
    design is the largest and most important. When systems fail,
    there is a natural tendency to look first to operations since that
    is where the problem actually took place. Most operations
    issues, however, either have their genesis in design and
    development or are best solved there.
    https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/
    – James Hamilton

    View Slide

  155. Resilience is by Design

    View Slide

  156. Ben Christensen
    @benjchristensen
    jobs.netflix.com
    Fault Tolerance in a High Volume, Distributed System
    http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html
    Hystrix
    https://github.com/Netflix/Hystrix/wiki
    Drift Into Failure
    http://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216
    Release It!
    http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213

    View Slide