Slide 1

Slide 1 text

Ben Christensen Developer – Edge Engineering at Netflix @benjchristensen http://techblog.netflix.com/ React San Francisco - November 2014 Resilient By Design

Slide 2

Slide 2 text

“the explosive growth of software has added greatly to systems’ interactive complexity. With software, the possible states that a system can end up in become mind-boggling.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure

Slide 3

Slide 3 text

“We can model and understand in isolation. But, when released into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, their complexities mushroom. And we are caught short.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 290-292). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure

Slide 4

Slide 4 text

Cache Origin Servers Cache Cache Read-through Cache

Slide 5

Slide 5 text

Cache Origin Servers Cache Cache low ~1% cache miss rate

Slide 6

Slide 6 text

Cache Origin Servers Cache Cache reads through to origin

Slide 7

Slide 7 text

Cache Origin Servers Cache Cache writes back to cache

Slide 8

Slide 8 text

Cache Origin Servers Cache Cache lose a cache shard

Slide 9

Slide 9 text

Cache Origin Servers Cache Cache normal 1% cache miss rate becomes 10% … 30% … origin is overwhelmed

Slide 10

Slide 10 text

Cache for Performance Becomes an Availability Concern Cache Origin Servers Cache Cache

Slide 11

Slide 11 text

Multiple Dependencies

Slide 12

Slide 12 text

Allowing One To Break User Experience

Slide 13

Slide 13 text

Transitive Failure

Slide 14

Slide 14 text

Sticky Sessions

Slide 15

Slide 15 text

Complicate Fault Tolerance & Scaling

Slide 16

Slide 16 text

Feature Complete!

Slide 17

Slide 17 text

… hmmm … resilience?

Slide 18

Slide 18 text

We Must Design For Resilience

Slide 19

Slide 19 text

Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

Slide 20

Slide 20 text

Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

Slide 21

Slide 21 text

"LFTRs (liquid fluoride thorium reactor) also have excellent safety features. My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0

Slide 22

Slide 22 text

"LFTRs (liquid fluoride thorium reactor) also have excellent safety features. My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0

Slide 23

Slide 23 text

"LFTRs (liquid fluoride thorium reactor) also have excellent safety features. My favorite is the use of a ‘plug’ which would melt if the molten mass got too hot for any reason, draining it away into a protected lower tank which would stop any fissioning and cool the whole lot down. It’s a clever idea: the plug is a frozen wedge of salt in a pipe at the bottom of the core tank, cooled by an external fan. If power is lost for some reason which might threaten to overheat the LFTR, the fan stops, the plug melts, and the salts all drain away. The fuel can’t melt down for the straightforward reason that it is already molten. No China Syndrome here." – Mark Lynas, Nuclear 2.0

Slide 24

Slide 24 text

Source: http://reich-chemistry.wikispaces.com/file/view/liquid_thorium_reactor_large.jpg/245978425/616x547/liquid_thorium_reactor_large.jpg

Slide 25

Slide 25 text

“System operations are dynamic, with components (organizational, human, technical) failing and being replaced continuously.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

AWS Availability Zone AWS Availability Zone AWS Availability Zone

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R

Slide 37

Slide 37 text

“Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User request blocked by latency in single network call

Slide 44

Slide 44 text

At high volume all request threads can block in seconds User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Slide 45

Slide 45 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R User Request User Request User Request User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . At high volume all request threads can block in seconds

Slide 46

Slide 46 text

cy D dency G ependency J Dependency M Dependency B Dependency E Dependency H Dependency K Dependency N Dependency C Dependency F Dependency I Dependency L Dependency O User Request User Request User Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Serialization - URL and/or body generation Logic - validation, decoration, object model, caching, metrics, logging, etc

Slide 47

Slide 47 text

"Timeout guard" daemon prio=10 tid=0x00002aaacd5e5000 nid=0x3aac runnable [0x00002aaac388f000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) - locked <0x000000055c7e8bd8> (a java.net.SocksSocketImpl) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.(Socket.java:425) at java.net.Socket.(Socket.java:280) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$1.doit(ControllerThreadSocketFactory.java:91) at org.apache.commons.httpclient.protocol.ControllerThreadSocketFactory$SocketTask.run(ControllerThreadSocketFactory.java:158) at java.lang.Thread.run(Thread.java:722) [Sat Jun 30 04:01:37 2012] [error] proxy: HTTP: disabled connection for (127.0.0.1) > 80% of requests rejected Median Latency

Slide 48

Slide 48 text

“Overt catastrophic failure occurs when small, apparently innocuous failures join to create opportunity for a systemic accident.” – Richard Cook, How Complex Systems Fail Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf

Slide 49

Slide 49 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R

Slide 53

Slide 53 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R

Slide 54

Slide 54 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Logic - validation, decoration, object model, caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc

Slide 57

Slide 57 text

Tryable Semaphore Rejected Permitted Logic - validation, decoration, object model, caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc

Slide 58

Slide 58 text

Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with non-blocking IO

Slide 59

Slide 59 text

Thread-pool Rejected Permitted Logic - validation, decoration, object model, caching, metrics, logging, etc Deserialization - JSON/XML/Thrift/Protobuf/etc Network Request - TCP/HTTP, latency, 4xx, 5xx, etc Serialization - URL and/or body generation Logic - argument validation, caches, metrics, logging, multivariate testing, routing, etc Timeout with blocking IO

Slide 60

Slide 60 text

Tryable semaphores for non-blocking clients and fallbacks Separate threads for blocking clients Bulkhead – Limit Concurrency Aggressive timeouts to “give up and move on” Circuit breakers as the “release valve” Release Pressure

Slide 61

Slide 61 text

HystrixCommand run() public  class  CommandHelloWorld  extends  HystrixCommand  {          ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   }

Slide 62

Slide 62 text

public  class  CommandHelloWorld  extends  HystrixCommand  {          ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }   } run() invokes “client” Logic HystrixCommand run()

Slide 63

Slide 63 text

HystrixCommand run() throw Exception Fail Fast

Slide 64

Slide 64 text

HystrixCommand run() getFallback() return  null;   return  new  Option();   return  Collections.emptyList();   return  Collections.emptyMap(); Fail Silent

Slide 65

Slide 65 text

HystrixCommand run() getFallback() return  true;   return  DEFAULT_OBJECT; Static Fallback

Slide 66

Slide 66 text

HystrixCommand run() getFallback() return  new  UserAccount(customerId,  "Unknown  Name",                                  countryCodeFromGeoLookup,  true,  true,  false);   return  new  VideoBookmark(movieId,  0); Stubbed Fallback

Slide 67

Slide 67 text

HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand  {          ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback

Slide 68

Slide 68 text

HystrixCommand run() getFallback() public  class  CommandHelloWorld  extends  HystrixCommand  {          ...          protected  String  run()  {                  return  "Hello  "  +  name  +  "!";          }          protected  String  getFallback()  {                  return  "Hello  Failure  "  +  name  +  "!";          }   } Stubbed Fallback

Slide 69

Slide 69 text

HystrixCommand run() getFallback() HystrixCommand run() Fallback via network

Slide 70

Slide 70 text

HystrixCommand run() getFallback() HystrixCommand run() getFallback() Fallback via network then Local

Slide 71

Slide 71 text

Transitive Failure

Slide 72

Slide 72 text

Transitive Failure with Bulkheads & Fallbacks

Slide 73

Slide 73 text

All Relationships

Slide 74

Slide 74 text

State State State State Application State?

Slide 75

Slide 75 text

State State State State Cluster Replication (and similar approaches)

Slide 76

Slide 76 text

State State State State State State All Instances Are Now Stateful

Slide 77

Slide 77 text

State State State State State State This Can Be Done

Slide 78

Slide 78 text

State State State State State State But Doesn’t Need To Be State State State State State State

Slide 79

Slide 79 text

So Where To Put State?

Slide 80

Slide 80 text

State State State State State Stateful Client

Slide 81

Slide 81 text

State State State State State Cache Cache Ephemeral Cache (ie. memcached, redis, etc)

Slide 82

Slide 82 text

State State State State State Cache Cache Cache Database Database (SQL, key-value, etc)

Slide 83

Slide 83 text

State State State State State Cache Cache Cache Database Database (generally ends up here anyways)

Slide 84

Slide 84 text

State State State State State Cache Cache Cache Database Why? Isn’t this more complicated?

Slide 85

Slide 85 text

Cache Cache Database Database Bounded Context

Slide 86

Slide 86 text

Cache Cache Database Database Despite more parts it simplifies ownership, operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.

Slide 87

Slide 87 text

Cache Cache Database Database Few focus on durability and state and increased operational challenges and costs. Despite more parts it simplifies ownership, operations, reasoning, deployments, failure modes. Most systems focus on logic and behavior with simple operations.

Slide 88

Slide 88 text

State An example …

Slide 89

Slide 89 text

State Cookie Identity is a critical service. Client state in cookie allows a reasonable fallback even if entire Identity service fails.

Slide 90

Slide 90 text

“In complex systems, decision-makers are locally rather than globally rational. But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure

Slide 91

Slide 91 text

“In complex systems, decision-makers are locally rather than globally rational. But that doesn’t mean that their decisions cannot lead to global, or system-wide events. In fact, that is one of the properties of complex systems: local actions can have global results.” Dekker, Sidney (2012-10-01). Drift into Failure (Kindle Locations 3268-3270). Ashgate Publishing. Kindle Edition. – Sidney Dekker, Drift into Failure

Slide 92

Slide 92 text

Load Shedding → Retry Storms

Slide 93

Slide 93 text

Cache Shard Failure → DDOS Origin

Slide 94

Slide 94 text

Dynamic Property Change → Saturate All CPUs

Slide 95

Slide 95 text

Reactive Scaling → Scale Down During Outage → Overwhelmed By Thundering Herd

Slide 96

Slide 96 text

Reactive Scaling → Scale Down During Superbowl → Overwhelmed By Thundering Herd

Slide 97

Slide 97 text

Achieve Resilience → Neglect → Drift → Vulnerability

Slide 98

Slide 98 text

"Failure Recovery must be a very simple path and that path must be tested frequently" https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton

Slide 99

Slide 99 text

No content

Slide 100

Slide 100 text

No content

Slide 101

Slide 101 text

AWS Availability Zone AWS Availability Zone AWS Availability Zone

Slide 102

Slide 102 text

No content

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

Auditing via Simulation

Slide 105

Slide 105 text

Auditing via Simulation

Slide 106

Slide 106 text

Auditing via Simulation

Slide 107

Slide 107 text

No content

Slide 108

Slide 108 text

125 → 1500+

Slide 109

Slide 109 text

~5000

Slide 110

Slide 110 text

~1

Slide 111

Slide 111 text

No content

Slide 112

Slide 112 text

No content

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

No content

Slide 115

Slide 115 text

No content

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

Constantly Changing

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Slide 120

Slide 120 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Slide 121

Slide 121 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Slide 122

Slide 122 text

No content

Slide 123

Slide 123 text

No content

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

No content

Slide 126

Slide 126 text

No content

Slide 127

Slide 127 text

No content

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Slide 130

Slide 130 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Slide 131

Slide 131 text

User Request Dependency A Dependency D Dependency G Dependency J Dependency M Dependency P Dependency B Dependency E Dependency H Dependency K Dependency N Dependency Q Dependency C Dependency F Dependency I Dependency L Dependency O Dependency R System Relationship Over Network without Bulkhead

Slide 132

Slide 132 text

Zuul Routing Layer Canary vs Baseline Squeeze Production "Coalmine"

Slide 133

Slide 133 text

Failure inevitably happens ...

Slide 134

Slide 134 text

Cluster adapts Failure Isolated

Slide 135

Slide 135 text

Cluster adapts Failure Isolated

Slide 136

Slide 136 text

Cluster adapts Failure Isolated

Slide 137

Slide 137 text

Cluster adapts Failure Isolated

Slide 138

Slide 138 text

No content

Slide 139

Slide 139 text

No content

Slide 140

Slide 140 text

No content

Slide 141

Slide 141 text

No content

Slide 142

Slide 142 text

No content

Slide 143

Slide 143 text

Note: This is a mockup

Slide 144

Slide 144 text

Note: This is a mockup

Slide 145

Slide 145 text

“…complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.” Richard I. Cook - How Complex Systems Fail - http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf – Richard Cook, How Complex Systems Fail

Slide 146

Slide 146 text

Where to next?

Slide 147

Slide 147 text

Low Latency Anomaly Detection

Slide 148

Slide 148 text

Automate Configuration?

Slide 149

Slide 149 text

Global vs Regional Deployment

Slide 150

Slide 150 text

Servers as Pets → Herds (Clusters) Clusters as Pets → Herds (Global Application)

Slide 151

Slide 151 text

Human Involvement

Slide 152

Slide 152 text

Assert Production Readiness?

Slide 153

Slide 153 text

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton

Slide 154

Slide 154 text

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important. When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there. https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/ – James Hamilton

Slide 155

Slide 155 text

Resilience is by Design

Slide 156

Slide 156 text

Ben Christensen @benjchristensen jobs.netflix.com Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Hystrix https://github.com/Netflix/Hystrix/wiki Drift Into Failure http://www.amazon.com/Drift-into-Failure-Sidney-Dekker/dp/1409422216 Release It! http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213