[Velocity EU] Chaos Engineering: When The Network Breaks

9fccf1fe0a5da1402f23e0566cb7c2ae?s=47 Ho Ming Li
November 07, 2019

[Velocity EU] Chaos Engineering: When The Network Breaks

Outages like S3 (2017) and Dyn DNS (2016) will happen again
Prepare for more of these incidents by practicing Chaos Engineering
Internet, and your applications, are complex distributed systems
Downtime are incredibly costly due to direct and indirect (hidden) costs

Demo of running an experiment towards Non-Critical and Critical service on Kubernetes


Ho Ming Li

November 07, 2019


  1. Chaos Engineering When the Network Breaks Ho Ming Li (HML)

    Principal SA, Gremlin hml@gremlin.com @horeal
  2. @horeal #velocityconf February 28th 2017

  3. @horeal #velocityconf Simple Storage Service Outage

  4. @horeal #velocityconf S3 Outage

  5. @horeal #velocityconf

  6. @horeal #velocityconf

  7. @horeal #velocityconf

  8. Every system is becoming a distributed system. THE PROBLEM

  9. None
  10. @horeal #velocityconf Chaos Engineering Thoughtful, planned experiments designed to reveal

    the weakness in our systems.
  11. @horeal #velocityconf build an immunity

  12. @horeal #velocityconf proactively

  13. None
  14. Measuring the Cost of Downtime Cost = R + E

    + C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min
  15. Network Chaos

  16. Network Chaos Engineering Demos 01 02 Setup: Microservices on Kubernetes

    Experiment: Blackhole Non-Critical Service 03 Experiment: Blackhole Critical Service
  17. @horeal #velocityconf Load Balancer (AWS ELB) Kubernetes (AWS EKS) Microservices

    Demo (GCP App)
  18. Hipster Shop Architecture https://github.com/GoogleCloudPlatform/microservices-demo

  19. @horeal #velocityconf $ kubectl get pods -n hipster-shop NAME READY

    STATUS RESTARTS AGE adservice-68646787f-gthm8 1/1 Running 0 33d cartservice-9565d66fb-79kbt 1/1 Running 0 33d checkoutservice-56b68ddf6c-bggx2 1/1 Running 0 33d currencyservice-dc765bd97-n9w2k 1/1 Running 0 33d emailservice-7b8f79cf59-jvtks 1/1 Running 0 33d frontend-765d87bdf6-jkwrz 1/1 Running 0 33d paymentservice-6bc589546c-x85hm 1/1 Running 0 33d productcatalogservice-688889974c-hg65j 1/1 Running 0 33d recommendationservice-7cdcdc8799-gd6k8 1/1 Running 0 33d redis-cart-5d9c69b749-j2dtp 1/1 Running 0 33d shippingservice-76dfb78d4c-lp6lp 1/1 Running 0 33d
  20. @horeal #velocityconf $ kubectl get nodes NAME STATUS ROLES AGE

    VERSION ip-192-168-10-180.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-16-15.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-64-15.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-66-173.us-west-2.compute.internal Ready <none> 104d v1.12.7 $ kubectl get svc -n hipster-shop | grep frontend-external frontend-external LoadBalancer a3ea30b0ce63111e9a91c06eb1d4a1ec-761119657.us-west-2.elb.amazonaws.com 80:31126/TCP 33d
  21. Non-Critical Service Demo

  22. Ad Service : Non-Critical (Tier 2)

  23. @horeal #velocityconf Ad Service Checkout Flow is impacted. Ad Impression

    Drops below threshold of 10,000/s Blackhole 120 seconds (2 minute) owner=hml, app=AdService
  24. Demo

  25. Critical Service Demo

  26. Payment Service : Critical (Tier 1)

  27. @horeal #velocityconf Traffic not hitting Payment Service Network: Blackhole owner=hml,

    app=PaymentService 20 seconds 5 seconds all traffic (whitelist for control plane) owner=hml, app=PaymentService 5 minutes (300 seconds) 5 seconds all traffic (whitelist for control plane) Payment Service getting intermittent issues with network where traffic does not flow to the service. Refer to incident INC-3278. For brief interruptions, checkout should be delayed, but confirm after network recovers. We are expecting timeout for longer duration of issue.
  28. Demo

  29. How to communicate results of your Chaos Engineering experiments? go.gremlin.com/10x

  30. @horeal #velocityconf Was it expected? Chaos Engineering uncovers unknown side

    effects. Was it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.
  31. @horeal #velocityconf Fix the issues. Whether code, configuration or process

    - iterate and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.
  32. If you have... Properties Experiment “Cloud native” app 12-factor stateless

    Shutdown Host/Container 3-tier web application presentation, app-tier data-tier Faulty connection to the data store. Peak Season or Launch Event auto-scaling scale out, scale in System under high traffic/load Resource contention/starvation Monitoring & Alerting metrics, logging, tracing alerting thresholds Ensure alerts are fired upon signals. Ensure that engineers can find answers to operational questions. Past Incidents incident root cause analysis Recreate scenarios to validate fixes 32 CE Getting Started Cheatsheet
  33. Network Breaks. Prepare for Failure. Get Started Now.

  34. Get Started for FREE gremlin.com/free

  35. Reliably Yours hml@gremlin.com @horeal 35 Thank You Get this deck

    @ gremlin.com/hml