Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Velocity EU] Chaos Engineering: When The Network Breaks

Ho Ming Li
November 07, 2019

[Velocity EU] Chaos Engineering: When The Network Breaks

Outages like S3 (2017) and Dyn DNS (2016) will happen again
Prepare for more of these incidents by practicing Chaos Engineering
Internet, and your applications, are complex distributed systems
Downtime are incredibly costly due to direct and indirect (hidden) costs

Demo of running an experiment towards Non-Critical and Critical service on Kubernetes

Ho Ming Li

November 07, 2019
Tweet

More Decks by Ho Ming Li

Other Decks in Technology

Transcript

  1. Measuring the Cost of Downtime Cost = R + E

    + C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min
  2. Network Chaos Engineering Demos 01 02 Setup: Microservices on Kubernetes

    Experiment: Blackhole Non-Critical Service 03 Experiment: Blackhole Critical Service
  3. @horeal #velocityconf $ kubectl get pods -n hipster-shop NAME READY

    STATUS RESTARTS AGE adservice-68646787f-gthm8 1/1 Running 0 33d cartservice-9565d66fb-79kbt 1/1 Running 0 33d checkoutservice-56b68ddf6c-bggx2 1/1 Running 0 33d currencyservice-dc765bd97-n9w2k 1/1 Running 0 33d emailservice-7b8f79cf59-jvtks 1/1 Running 0 33d frontend-765d87bdf6-jkwrz 1/1 Running 0 33d paymentservice-6bc589546c-x85hm 1/1 Running 0 33d productcatalogservice-688889974c-hg65j 1/1 Running 0 33d recommendationservice-7cdcdc8799-gd6k8 1/1 Running 0 33d redis-cart-5d9c69b749-j2dtp 1/1 Running 0 33d shippingservice-76dfb78d4c-lp6lp 1/1 Running 0 33d
  4. @horeal #velocityconf $ kubectl get nodes NAME STATUS ROLES AGE

    VERSION ip-192-168-10-180.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-16-15.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-64-15.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-66-173.us-west-2.compute.internal Ready <none> 104d v1.12.7 $ kubectl get svc -n hipster-shop | grep frontend-external frontend-external LoadBalancer 10.100.44.108 a3ea30b0ce63111e9a91c06eb1d4a1ec-761119657.us-west-2.elb.amazonaws.com 80:31126/TCP 33d
  5. @horeal #velocityconf Ad Service Checkout Flow is impacted. Ad Impression

    Drops below threshold of 10,000/s Blackhole 120 seconds (2 minute) owner=hml, app=AdService
  6. @horeal #velocityconf Traffic not hitting Payment Service Network: Blackhole owner=hml,

    app=PaymentService 20 seconds 5 seconds all traffic (whitelist for control plane) owner=hml, app=PaymentService 5 minutes (300 seconds) 5 seconds all traffic (whitelist for control plane) Payment Service getting intermittent issues with network where traffic does not flow to the service. Refer to incident INC-3278. For brief interruptions, checkout should be delayed, but confirm after network recovers. We are expecting timeout for longer duration of issue.
  7. @horeal #velocityconf Was it expected? Chaos Engineering uncovers unknown side

    effects. Was it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.
  8. @horeal #velocityconf Fix the issues. Whether code, configuration or process

    - iterate and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.
  9. If you have... Properties Experiment “Cloud native” app 12-factor stateless

    Shutdown Host/Container 3-tier web application presentation, app-tier data-tier Faulty connection to the data store. Peak Season or Launch Event auto-scaling scale out, scale in System under high traffic/load Resource contention/starvation Monitoring & Alerting metrics, logging, tracing alerting thresholds Ensure alerts are fired upon signals. Ensure that engineers can find answers to operational questions. Past Incidents incident root cause analysis Recreate scenarios to validate fixes 32 CE Getting Started Cheatsheet