Resilient by Design at React SF 2014

Ben Christensen Developer – Edge Engineering at Netﬂix @benjchristensen http://techblog.netﬂix.com/
React San Francisco - November 2014 Resilient By Design

“the explosive growth of software has added greatly to systems’
interactive complexity. With software, the possible states that a system can end up in become mind-boggling.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure

“We can model and understand in isolation. But, when released
into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, their complexities mushroom. And we are caught short.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure

Cache Origin Servers Cache Cache Read-through Cache

Cache Origin Servers Cache Cache low ~1% cache miss rate

Cache Origin Servers Cache Cache reads through to origin

Cache Origin Servers Cache Cache writes back to cache

Cache Origin Servers Cache Cache lose a cache shard

Cache Origin Servers Cache Cache normal 1% cache miss rate
becomes 10% … 30% … origin is overwhelmed

Cache for Performance Becomes an Availability Concern Cache Origin Servers
Cache Cache

Multiple Dependencies

Allowing One To Break User Experience

Transitive Failure

Sticky Sessions

Complicate Fault Tolerance & Scaling

Feature Complete!

… hmmm … resilience?

We Must Design For Resilience

Source: http://reich-chemistry.wikispaces.com/ﬁle/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

"LFTRs (liquid ﬂuoride thorium reactor) also have excellent safety features.
My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any ﬁssioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0

Source: http://reich-chemistry.wikispaces.com/ﬁle/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

“System operations are dynamic, with components (organizational, human, technical) failing
and being replaced continuously.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

AWS Availability Zone AWS Availability Zone AWS Availability Zone

User Request Dependency A Dependency D Dependency G Dependency J
Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R

“Overt catastrophic failure occurs when small, apparently innocuous failures join
to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User request blocked by latency in single network call

At high volume all request threads can block in seconds
User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . At high volume all request threads can block in seconds

cy D dency G ependency J Dependency M Dependency B
Dependency E Dependency H Dependency K Dependency N Dependency C Dependency F Dependency I Dependency L Dependency O User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Serialization - URL and/or body generation Logic - validation, decoration, object model, caching, metrics, logging, etc

"Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency

“Overt catastrophic failure occurs when small, apparently innocuous failures join
to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R

Logic - validation, decoration, object model, caching, metrics, logging, etc
Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc

Tryable Semaphore Rejected Permitted Logic - validation, decoration, object model,
caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc

Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,
metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with non-blocking IO

Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching,
metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with blocking IO

Tryable semaphores for non-blocking clients and fallbacks Separate threads for
blocking clients Bulkhead – Limit Concurrency Aggressive timeouts to “give up and move on” Circuit breakers as the “release valve” Release Pressure

HystrixCommand run() public class CommandHelloWorld extends HystrixCommand<String> {
... protected String run() { return "Hello " + name + "!"; } }

public class CommandHelloWorld extends HystrixCommand<String> {
... protected String run() { return "Hello " + name + "!"; } } run() invokes “client” Logic HystrixCommand run()

HystrixCommand run() throw Exception Fail Fast

HystrixCommand run() getFallback() return null; return new Option<T>();
return Collections.emptyList(); return Collections.emptyMap(); Fail Silent

HystrixCommand run() getFallback() return true; return DEFAULT_OBJECT; Static Fallback

HystrixCommand run() getFallback() return new UserAccount(customerId, "Unknown Name",
countryCodeFromGeoLookup, true, true, false); return new VideoBookmark(movieId, 0); Stubbed Fallback

HystrixCommand run() getFallback() public class CommandHelloWorld extends HystrixCommand<String> {
... protected String run() { return "Hello " + name + "!"; } protected String getFallback() { return "Hello Failure " + name + "!"; } } Stubbed Fallback

HystrixCommand run() getFallback() HystrixCommand run() Fallback via network

HystrixCommand run() getFallback() HystrixCommand run() getFallback() Fallback via network then
Local

Transitive Failure

Transitive Failure with Bulkheads & Fallbacks

All Relationships

State State State State Application State?

State State State State Cluster Replication (and similar approaches)

State State State State State State All Instances Are Now
Stateful

State State State State State State This Can Be Done

State State State State State State But Doesn’t Need To
Be State State State State State State

So Where To Put State?

State State State State State Stateful Client

State State State State State Cache Cache Ephemeral Cache (ie.
memcached, redis, etc)

State State State State State Cache Cache Cache Database Database
(SQL, key-value, etc)

State State State State State Cache Cache Cache Database Database
(generally ends up here anyways)

State State State State State Cache Cache Cache Database Why?
Isn’t this more complicated?

Cache Cache Database Database Bounded Context

Cache Cache Database Database Despite more parts it simpliﬁes ownership,
operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.

Cache Cache Database Database Few focus on durability and state
and increased operational challenges and costs. Despite more parts it simpliﬁes ownership, operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.

State An example …

State Cookie Identity is a critical service. Client state in
cookie allows a reasonable fallback even if entire Identity service fails.

“In complex systems, decision-makers are locally rather than globally rational.
But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure

Load Shedding → Retry Storms

Cache Shard Failure → DDOS Origin

Dynamic Property Change → Saturate All CPUs

Reactive Scaling → Scale Down During Outage → Overwhelmed By
Thundering Herd

Reactive Scaling → Scale Down During Superbowl → Overwhelmed By
Thundering Herd

Achieve Resilience → Neglect → Drift → Vulnerability

"Failure Recovery must be a very simple path and that
path must be tested frequently" https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton

AWS Availability Zone AWS Availability Zone AWS Availability Zone

Auditing via Simulation

125 → 1500+

Constantly Changing

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R System Relationship Over Network without Bulkhead

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Failure inevitably happens ...

Cluster adapts Failure Isolated

Note: This is a mockup

“…complex systems run as broken systems. The system continues to
function because it contains so many redundancies and because people can make it function, despite the presence of many ﬂaws.” Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf – Richard Cook, How Complex Systems Fail

Where to next?

Low Latency Anomaly Detection

Automate Conﬁguration?

Global vs Regional Deployment

Servers as Pets → Herds (Clusters) Clusters as Pets →
Herds (Global Application)

Human Involvement

Assert Production Readiness?

We have long believed that 80% of operations issues originate
in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look ﬁrst to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton

Resilience is by Design

Ben Christensen @benjchristensen jobs.netflix.com Fault Tolerance in a High Volume,
Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Hystrix https://github.com/Netflix/Hystrix/wiki Drift Into Failure http://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216 Release It! http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213

Resilient by Design at React SF 2014

Resilient by Design at React SF 2014

More Decks by Ben Christensen

Other Decks in Programming

Featured

Transcript