Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering and design principles for building 
Highly Available services on the cloud

Diptanu Choudhury
April 22, 2016
330

Chaos Engineering and design principles for building 
Highly Available services on the cloud

Complex Distributed Systems are hard to operate and has very complex failure modes. In this talk, we are going to discuss how we can build confidence in large scale distributed systems by introducing random but controlled failures in them in production and understand how services de-generate and work towards healing and recovering from failures automatically. We will also discuss patterns and various techniques for designing highly available and resilient distributed systems.

Diptanu Choudhury

April 22, 2016
Tweet

Transcript

  1. HASHICORP Chaos Engineering and design principles for building 
 Highly

    Available services on the cloud Diptanu Gon Choudhury @diptanu RootConf 2016
  2. HASHICORP Evolution of hardware Commodity hardware and network are the

    new normal Processors are not getting faster, we are running more of them
  3. HASHICORP Evolution of application architecture SOA and Micro Services are

    replacing monoliths Distributed Systems are the new normal
  4. HASHICORP Drift into Failure - Sydney Dekkar We can model

    and understand in isolation. But, when released into competitive, nominally regulated societies, their connections proliferate, their interactions and interdependencies multiply, 
 their complexities mushroom. 
 And we are caught short
  5. HASHICORP Resilience By Design Feature Complete is only 30% of

    the journey
 Implications of dependencies failing Implications of surge in traffic
 How quickly can the system recover
 Which failures can be mitigated and how
 How does the system grow
 Implications of Data Centers failing
  6. HASHICORP Chaos Engineering Chaos Engineering is the discipline of experimenting

    on a distributed system 
 in order to build confidence in the system’s capability to withstand turbulent 
 conditions in production.
  7. HASHICORP Failure Injection Faults across network boundaries are blurry in

    distributed systems
 It’s hard to reason why a service across the network has increased latency
 Introducing failures by disrupting hardware uncovers a lot of issues Introduce failures in dependencies in a slow but gradual manner
 Use auditing and monitoring while failures being injected in a system
  8. HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures


    Region Failures
 DNS Failures Deploy clusters not nodes
 State should be at a service level
 Unleash chaos monkey
  9. HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures


    Region Failures
 DNS Failures Spread clusters across rack
 Use smart load balancers
 Rely on service discovery
 Unleash latency monkey
  10. HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures


    Region Failures
 DNS Failures Spread clusters across data centers
 Loadbalancer to fall back on healthy DC
 Unleash the Chaos Gorilla
  11. HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures


    Region Failures
 DNS Failures Run services Active-Active
 Use Geo-DNS to divide traffic
 Proxies at the edge to relieve pressure
 Unleash the Chaos Kong
  12. HASHICORP Failures revisited Node Failures
 Switch Failures
 Datacenter interconnect Failures


    Region Failures
 DNS Failures Be prepared to know how to escalate
 Increase TTL
  13. HASHICORP Chaos Engineering applied to Human Resources Be prepared to

    not have the full team to deal with outages
 Send your team members on vacation
 Prepare runbooks and dashboards to triage and avoid tribal knowledge
  14. HASHICORP Tools for building resilient systems Reactive Load Balancers Circuit

    Breakers
 Dynamic Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure
  15. HASHICORP Reactive Load Balancers Reactive Load Balancers Circuit Breakers
 Dynamic

    Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Score nodes based on latencies
 Parallel requests Request cancellation
 Shuffle and randomize balancing
 Stream oriented protocols
  16. HASHICORP Circuit Breakers Reactive Load Balancers Circuit Breakers
 Dynamic Cluster

    Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Guard all network calls
 Return fallbacks 
 Bulkhead requests
 Aggressive timeouts
  17. HASHICORP Cluster Schedulers Reactive Load Balancers Circuit Breakers
 Dynamic Cluster

    Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Restarts services when the fail
 Faster deployments
 Higher utilization
 Quality of Service
  18. HASHICORP Dynamic Service Discovery Reactive Load Balancers Circuit Breakers
 Dynamic

    Cluster Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Registry of the data center
 Remove and add services quickly
 Be prone to failures
  19. HASHICORP Reactive Scaling Reactive Load Balancers Circuit Breakers
 Dynamic Cluster

    Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Scale out in reaction to latencies and QPS
 Throttle when hit by thundering storms

  20. HASHICORP Immutable Infrastructure Reactive Load Balancers Circuit Breakers
 Dynamic Cluster

    Schedulers
 Dynamic Service Discovery
 Reactive scaling
 Immutable infrastructure Cattle and not pets
 Infrastructure should be considered disposable
  21. HASHICORP Thanks! Drift into Failure: From hunting broken systems to

    Understanding Complex Systems
 - Sidney Dekkar How Complex Systems Fail
 - Richard Cook Notes on Distributed Systems for the young blood
 - Jeff Hodges