Chaos Engineering and design principles for building  Highly Available services on the cloud

HASHICORP Chaos Engineering and design principles for building   Highly
Available services on the cloud Diptanu Gon Choudhury @diptanu RootConf 2016

HASHICORP Correlation of failures with scale and change Rate of
Change Scale

HASHICORP Evolution of hardware Commodity hardware and network are the
new normal Processors are not getting faster, we are running more of them

HASHICORP Evolution of application architecture SOA and Micro Services are
replacing monoliths Distributed Systems are the new normal

HASHICORP Fanout in Micro Services

HASHICORP A common architecture of the modern web

HASHICORP Steady state of the service

HASHICORP Failure in the caching sub-system

HASHICORP Failures cascade all the way to the edge

HASHICORP Steady state of a SOA based application

HASHICORP Failure in a single service

HASHICORP Failure cascades to the edge

HASHICORP Drift into Failure - Sydney Dekkar We can model
and understand in isolation. But, when released into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply,   their complexities mushroom.   And we are caught short

WE MUST   DESIGN FOR FAILURE

HASHICORP Resilience By Design Feature Complete is only 30% of
the journey  Implications of dependencies failing Implications of surge in trafﬁc  How quickly can the system recover  Which failures can be mitigated and how  How does the system grow  Implications of Data Centers failing

THE BEST WAY   TO AVOID FAILURES  IS TO FAIL
CONSTANTLY

HASHICORP Chaos Engineering Chaos Engineering is the discipline of experimenting
on a distributed system   in order to build conﬁdence in the system’s capability to withstand turbulent   conditions in production.

HASHICORP Failure Injection Faults across network boundaries are blurry in
distributed systems  It’s hard to reason why a service across the network has increased latency  Introducing failures by disrupting hardware uncovers a lot of issues Introduce failures in dependencies in a slow but gradual manner  Use auditing and monitoring while failures being injected in a system

HASHICORP Simian Army from Netﬂix

HASHICORP Simian Army Chaos Monkey  Chaos Gorilla  Chaos Kong Latency
Monkey  Monkey Commander

HASHICORP Failures revisited Node Failures  Switch Failures  Datacenter interconnect Failures 
Region Failures  DNS Failures Deploy clusters not nodes  State should be at a service level  Unleash chaos monkey

Region Failures  DNS Failures Spread clusters across rack  Use smart load balancers  Rely on service discovery  Unleash latency monkey

Region Failures  DNS Failures Spread clusters across data centers  Loadbalancer to fall back on healthy DC  Unleash the Chaos Gorilla

Region Failures  DNS Failures Run services Active-Active  Use Geo-DNS to divide trafﬁc  Proxies at the edge to relieve pressure  Unleash the Chaos Kong

Region Failures  DNS Failures Be prepared to know how to escalate  Increase TTL

HASHICORP Chaos Engineering applied to Human Resources Be prepared to
not have the full team to deal with outages  Send your team members on vacation  Prepare runbooks and dashboards to triage and avoid tribal knowledge

HASHICORP Tools for building resilient systems Reactive Load Balancers Circuit
Breakers  Dynamic Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure

HASHICORP Reactive Load Balancers Reactive Load Balancers Circuit Breakers  Dynamic
Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Score nodes based on latencies  Parallel requests Request cancellation  Shufﬂe and randomize balancing  Stream oriented protocols

HASHICORP Circuit Breakers Reactive Load Balancers Circuit Breakers  Dynamic Cluster
Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Guard all network calls  Return fallbacks   Bulkhead requests  Aggressive timeouts

HASHICORP Cluster Schedulers Reactive Load Balancers Circuit Breakers  Dynamic Cluster
Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Restarts services when the fail  Faster deployments  Higher utilization  Quality of Service

HASHICORP Dynamic Service Discovery Reactive Load Balancers Circuit Breakers  Dynamic
Cluster Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Registry of the data center  Remove and add services quickly  Be prone to failures

HASHICORP Reactive Scaling Reactive Load Balancers Circuit Breakers  Dynamic Cluster
Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Scale out in reaction to latencies and QPS  Throttle when hit by thundering storms 

HASHICORP Immutable Infrastructure Reactive Load Balancers Circuit Breakers  Dynamic Cluster
Schedulers  Dynamic Service Discovery  Reactive scaling  Immutable infrastructure Cattle and not pets  Infrastructure should be considered disposable

HOPE IS NOT A STRATEGY

HASHICORP Thanks! Drift into Failure: From hunting broken systems to
Understanding Complex Systems  - Sidney Dekkar How Complex Systems Fail  - Richard Cook Notes on Distributed Systems for the young blood  - Jeff Hodges

Chaos Engineering and design principles for bui...

Chaos Engineering and design principles for building Highly Available services on the cloud

More Decks by Diptanu Choudhury

Featured

Transcript

Chaos Engineering and design principles for building  Highly Available services on the cloud