Chaos Engineering and design principles for building  Highly Available services on the cloud - Speaker Deck

Tweet

Tweet

Slide 1

Slide 1 text

HASHICORP Chaos Engineering and design principles for building   Highly Available services on the cloud Diptanu Gon Choudhury @diptanu RootConf 2016

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

HASHICORP Correlation of failures with scale and change Rate of Change Scale

Slide 5

Slide 5 text

HASHICORP Evolution of hardware Commodity hardware and network are the new normal Processors are not getting faster, we are running more of them

Slide 6

Slide 6 text

HASHICORP Evolution of application architecture SOA and Micro Services are replacing monoliths Distributed Systems are the new normal

Slide 7

Slide 7 text

HASHICORP Fanout in Micro Services

Slide 8

Slide 8 text

HASHICORP A common architecture of the modern web

Slide 9

Slide 9 text

HASHICORP Steady state of the service

Slide 10

Slide 10 text

HASHICORP Steady state of the service

Slide 11

Slide 11 text

HASHICORP Steady state of the service

Slide 12

Slide 12 text

HASHICORP Failure in the caching sub-system

Slide 13

Slide 13 text

HASHICORP Failure in the caching sub-system

Slide 14

Slide 14 text

HASHICORP Failures cascade all the way to the edge

Slide 15

Slide 15 text

HASHICORP Steady state of a SOA based application

Slide 16

Slide 16 text

HASHICORP Failure in a single service

Slide 17

Slide 17 text

HASHICORP Failure cascades to the edge

Slide 18

Slide 18 text

HASHICORP Drift into Failure - Sydney Dekkar We can model and understand in isolation. But, when released into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply,   their complexities mushroom.   And we are caught short

Slide 19

Slide 19 text

WE MUST   DESIGN FOR FAILURE

Slide 20

Slide 20 text

HASHICORP Resilience By Design Feature Complete is only 30% of the journey  Implications of dependencies failing Implications of surge in trafﬁc  How quickly can the system recover  Which failures can be mitigated and how  How does the system grow  Implications of Data Centers failing

Slide 21

Slide 21 text

THE BEST WAY   TO AVOID FAILURES  IS TO FAIL CONSTANTLY

Slide 22

Slide 22 text

HASHICORP Chaos Engineering Chaos Engineering is the discipline of experimenting on a distributed system   in order to build conﬁdence in the system’s capability to withstand turbulent   conditions in production.

Slide 23

Slide 23 text

HASHICORP Failure Injection Faults across network boundaries are blurry in distributed systems  It’s hard to reason why a service across the network has increased latency  Introducing failures by disrupting hardware uncovers a lot of issues Introduce failures in dependencies in a slow but gradual manner  Use auditing and monitoring while failures being injected in a system

Slide 24

Slide 24 text

HASHICORP Simian Army from Netﬂix

Slide 25

Slide 25 text

HASHICORP Simian Army Chaos Monkey  Chaos Gorilla  Chaos Kong Latency Monkey  Monkey Commander

Slide 26

Slide 26 text

HASHICORP Simian Army Chaos Monkey  Chaos Gorilla  Chaos Kong Latency Monkey  Monkey Commander

Slide 27

Slide 27 text

HASHICORP Simian Army Chaos Monkey  Chaos Gorilla  Chaos Kong Latency Monkey  Monkey Commander

Slide 28

Slide 28 text

HASHICORP Simian Army Chaos Monkey  Chaos Gorilla  Chaos Kong Latency Monkey  Monkey Commander

Slide 29

Slide 29 text

HASHICORP Simian Army Chaos Monkey  Chaos Gorilla  Chaos Kong Latency Monkey  Monkey Commander

Slide 30

Slide 30 text

HASHICORP Failures revisited Node Failures  Switch Failures  Datacenter interconnect Failures  Region Failures  DNS Failures Deploy clusters not nodes  State should be at a service level  Unleash chaos monkey

Slide 31

Slide 31 text

HASHICORP Failures revisited Node Failures  Switch Failures  Datacenter interconnect Failures  Region Failures  DNS Failures Spread clusters across rack  Use smart load balancers  Rely on service discovery  Unleash latency monkey

Slide 32

Slide 32 text

HASHICORP Failures revisited Node Failures  Switch Failures  Datacenter interconnect Failures  Region Failures  DNS Failures Spread clusters across data centers  Loadbalancer to fall back on healthy DC  Unleash the Chaos Gorilla

Slide 33

Slide 33 text

HASHICORP Failures revisited Node Failures  Switch Failures  Datacenter interconnect Failures  Region Failures  DNS Failures Run services Active-Active  Use Geo-DNS to divide trafﬁc  Proxies at the edge to relieve pressure  Unleash the Chaos Kong

Slide 34

Slide 34 text

HASHICORP Failures revisited Node Failures  Switch Failures  Datacenter interconnect Failures  Region Failures  DNS Failures Be prepared to know how to escalate  Increase TTL

Slide 35

Slide 35 text

HASHICORP Chaos Engineering applied to Human Resources Be prepared to not have the full team to deal with outages  Send your team members on vacation  Prepare runbooks and dashboards to triage and avoid tribal knowledge

Slide 36

Slide 36 text

HASHICORP Tools for building resilient systems Reactive Load Balancers Circuit Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure

Slide 37

Slide 37 text

HASHICORP Reactive Load Balancers Reactive Load Balancers Circuit Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Score nodes based on latencies  Parallel requests Request cancellation  Shufﬂe and randomize balancing  Stream oriented protocols

Slide 38

Slide 38 text

HASHICORP Circuit Breakers Reactive Load Balancers Circuit Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Guard all network calls  Return fallbacks   Bulkhead requests  Aggressive timeouts

Slide 39

Slide 39 text

HASHICORP Cluster Schedulers Reactive Load Balancers Circuit Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Restarts services when the fail  Faster deployments  Higher utilization  Quality of Service

Slide 40

Slide 40 text

HASHICORP Dynamic Service Discovery Reactive Load Balancers Circuit Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Registry of the data center  Remove and add services quickly  Be prone to failures

Slide 41

Slide 41 text

HASHICORP Reactive Scaling Reactive Load Balancers Circuit Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Scale out in reaction to latencies and QPS  Throttle when hit by thundering storms 

Slide 42

Slide 42 text

HASHICORP Immutable Infrastructure Reactive Load Balancers Circuit Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Cattle and not pets  Infrastructure should be considered disposable

Slide 43

Slide 43 text

HOPE IS NOT A STRATEGY

Slide 44

Slide 44 text

HASHICORP Thanks! Drift into Failure: From hunting broken systems to Understanding Complex Systems  - Sidney Dekkar How Complex Systems Fail  - Richard Cook Notes on Distributed Systems for the young blood  - Jeff Hodges