Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilience and Chaos Engineering in the Cloud

Resilience and Chaos Engineering in the Cloud

Nikos Katirtzis

August 06, 2020
Tweet

More Decks by Nikos Katirtzis

Other Decks in Technology

Transcript

  1. Facts (2019) • $100+ billion in gross bookings • $12+

    billion revenue • 200+ travel booking sites Every incident impacts our customers
  2. Common Practices - Traditional Retries Application Timeouts Application Circuit Breaking

    Application Redundancy Multiple Instances Data Replication Distributed Caching Fallbacks Application Load Balancing Round-robin Rate Limiting Application
  3. Common Practices – Cloud Providers Redundancy Multi-AZ Region 1 AZ1

    Redundancy Multi-region Data Replication Cross-region Replication AZ2 AZ3 Region 1 AZ1 AZ2 AZ3 Region 2 AZ1 AZ2 AZ3 Region 1 Region 2
  4. Common Practices – Cloud Providers AWS EC2 Autoscaling Serverless Aurora

    VPC Failover Load Balancing AWS VPC Peering AWS Elastic Load Balancing
  5. Common Practices – Kubernetes Readiness/Liveness Probes Multiple Replicas Graceful Shutdowns

    Horizontal Autoscaling Pods spread across AZs Pod Disruption Budget AZ1 AZ2
  6. Common Practices – Service Mesh Retries Mesh Deadline Propagation Mesh

    Circuit Breaking Mesh Fault Injection Load Balancing Latency-aware Rate Limiting Mesh
  7. Common Practices – Chaos Engineering The discipline of experimenting on

    a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions
  8. Why Chaos Engineering Chaos experiments are not expected to have

    catastrophic consequences But many times things go wrong J. Paul Reed - Senior Applied Resilience Engineer @Netflix Chernobyl
  9. Demo - Hypothesis 1. When the cache becomes unavailable the

    service falls back to fetching the reviews from the DB 2. Given (1), when the DB is slow, but latency < circuit breaker's threshold, the service is still healthy 3. Given (1), when the DB is very slow and latency > circuit breaker's threshold, the circuit breaker opens and we don't see reviews.
  10. Continuous Verification Source: Casey Rosenthal & Nora Jones – Chaos

    Engineering, System Resiliency in Practice Proactive Experimentation Tool Verification System Behaviors Reactive Testing Methodology Validation Known Properties Continuous Verification Continuous Verification Continuous Verification Continuous Verification Continuous Verification Disaster Recovery Incident Response Alerting Property-based Testing Load Testing Squeeze Testing Unit Testing Functional Testing Integration Testing Version Control Continuous Integration Continuous Delivery Agile DevOps SRE Code Review Static Analysis Logging Distributed Tracing Observability Monitoring Alerting Synthetic Monitoring