Slide 1

Slide 1 text

HASHICORP Chaos Engineering and design principles for building 
 Highly Available services on the cloud Diptanu Gon Choudhury @diptanu RootConf 2016

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

HASHICORP Correlation of failures with scale and change Rate of Change Scale

Slide 5

Slide 5 text

HASHICORP Evolution of hardware Commodity hardware and network are the new normal Processors are not getting faster, we are running more of them

Slide 6

Slide 6 text

HASHICORP Evolution of application architecture SOA and Micro Services are replacing monoliths Distributed Systems are the new normal

Slide 7

Slide 7 text

HASHICORP Fanout in Micro Services

Slide 8

Slide 8 text

HASHICORP A common architecture of the modern web

Slide 9

Slide 9 text

HASHICORP Steady state of the service

Slide 10

Slide 10 text

HASHICORP Steady state of the service

Slide 11

Slide 11 text

HASHICORP Steady state of the service

Slide 12

Slide 12 text

HASHICORP Failure in the caching sub-system

Slide 13

Slide 13 text

HASHICORP Failure in the caching sub-system

Slide 14

Slide 14 text

HASHICORP Failures cascade all the way to the edge

Slide 15

Slide 15 text

HASHICORP Steady state of a SOA based application

Slide 16

Slide 16 text

HASHICORP Failure in a single service

Slide 17

Slide 17 text

HASHICORP Failure cascades to the edge

Slide 18

Slide 18 text

HASHICORP Drift into Failure - Sydney Dekkar We can model and understand in isolation. But, when released into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, 
 their complexities mushroom. 
 And we are caught short

Slide 19

Slide 19 text

WE MUST 
 DESIGN FOR FAILURE

Slide 20

Slide 20 text

HASHICORP Resilience By Design Feature Complete is only 30% of the journey
 Implications of dependencies failing Implications of surge in traffic
 How quickly can the system recover
 Which failures can be mitigated and how
 How does the system grow
 Implications of Data Centers failing

Slide 21

Slide 21 text

THE BEST WAY 
 TO AVOID FAILURES
 IS TO FAIL CONSTANTLY

Slide 22

Slide 22 text

HASHICORP Chaos Engineering Chaos Engineering is the discipline of experimenting on a distributed system 
 in order to build confidence in the system’s capability to withstand turbulent 
 conditions in production.

Slide 23

Slide 23 text

HASHICORP Failure Injection Faults across network boundaries are blurry in distributed systems
 It’s hard to reason why a service across the network has increased latency
 Introducing failures by disrupting hardware uncovers a lot of issues Introduce failures in dependencies in a slow but gradual manner
 Use auditing and monitoring while failures being injected in a system

Slide 24

Slide 24 text

HASHICORP Simian Army from Netflix

Slide 25

Slide 25 text

HASHICORP Simian Army Chaos Monkey
 Chaos Gorilla
 Chaos Kong Latency Monkey
 Monkey Commander

Slide 26

Slide 26 text

HASHICORP Simian Army Chaos Monkey
 Chaos Gorilla
 Chaos Kong Latency Monkey
 Monkey Commander

Slide 27

Slide 27 text

HASHICORP Simian Army Chaos Monkey
 Chaos Gorilla
 Chaos Kong Latency Monkey
 Monkey Commander

Slide 28

Slide 28 text

HASHICORP Simian Army Chaos Monkey
 Chaos Gorilla
 Chaos Kong Latency Monkey
 Monkey Commander

Slide 29

Slide 29 text

HASHICORP Simian Army Chaos Monkey
 Chaos Gorilla
 Chaos Kong Latency Monkey
 Monkey Commander

Slide 30

Slide 30 text

HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures
 Region Failures
 DNS Failures Deploy clusters not nodes
 State should be at a service level
 Unleash chaos monkey

Slide 31

Slide 31 text

HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures
 Region Failures
 DNS Failures Spread clusters across rack
 Use smart load balancers
 Rely on service discovery
 Unleash latency monkey

Slide 32

Slide 32 text

HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures
 Region Failures
 DNS Failures Spread clusters across data centers
 Loadbalancer to fall back on healthy DC
 Unleash the Chaos Gorilla

Slide 33

Slide 33 text

HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures
 Region Failures
 DNS Failures Run services Active-Active
 Use Geo-DNS to divide traffic
 Proxies at the edge to relieve pressure
 Unleash the Chaos Kong

Slide 34

Slide 34 text

HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures
 Region Failures
 DNS Failures Be prepared to know how to escalate
 Increase TTL

Slide 35

Slide 35 text

HASHICORP Chaos Engineering applied to Human Resources Be prepared to not have the full team to deal with outages
 Send your team members on vacation
 Prepare runbooks and dashboards to triage and avoid tribal knowledge

Slide 36

Slide 36 text

HASHICORP Tools for building resilient systems Reactive Load Balancers Circuit Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure

Slide 37

Slide 37 text

HASHICORP Reactive Load Balancers Reactive Load Balancers Circuit Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Score nodes based on latencies
 Parallel requests Request cancellation
 Shuffle and randomize balancing
 Stream oriented protocols

Slide 38

Slide 38 text

HASHICORP Circuit Breakers Reactive Load Balancers Circuit Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Guard all network calls
 Return fallbacks 
 Bulkhead requests
 Aggressive timeouts

Slide 39

Slide 39 text

HASHICORP Cluster Schedulers Reactive Load Balancers Circuit Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Restarts services when the fail
 Faster deployments
 Higher utilization
 Quality of Service

Slide 40

Slide 40 text

HASHICORP Dynamic Service Discovery Reactive Load Balancers Circuit Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Registry of the data center
 Remove and add services quickly
 Be prone to failures

Slide 41

Slide 41 text

HASHICORP Reactive Scaling Reactive Load Balancers Circuit Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Scale out in reaction to latencies and QPS
 Throttle when hit by thundering storms


Slide 42

Slide 42 text

HASHICORP Immutable Infrastructure Reactive Load Balancers Circuit Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Cattle and not pets
 Infrastructure should be considered disposable

Slide 43

Slide 43 text

HOPE IS NOT A STRATEGY

Slide 44

Slide 44 text

HASHICORP Thanks! Drift into Failure: From hunting broken systems to Understanding Complex Systems
 - Sidney Dekkar How Complex Systems Fail
 - Richard Cook Notes on Distributed Systems for the young blood
 - Jeff Hodges