[Velocity EU] Chaos Engineering: When The Network Breaks

Chaos Engineering When the Network Breaks Ho Ming Li (HML)
Principal SA, Gremlin [email protected] @horeal

@horeal #velocityconf February 28th 2017

@horeal #velocityconf Simple Storage Service Outage

@horeal #velocityconf S3 Outage

@horeal #velocityconf

Every system is becoming a distributed system. THE PROBLEM

@horeal #velocityconf Chaos Engineering Thoughtful, planned experiments designed to reveal
the weakness in our systems.

@horeal #velocityconf build an immunity

@horeal #velocityconf proactively

Measuring the Cost of Downtime Cost = R + E
+ C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantiﬁable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min

Network Chaos

Network Chaos Engineering Demos 01 02 Setup: Microservices on Kubernetes
Experiment: Blackhole Non-Critical Service 03 Experiment: Blackhole Critical Service

@horeal #velocityconf Load Balancer (AWS ELB) Kubernetes (AWS EKS) Microservices
Demo (GCP App)

Hipster Shop Architecture https://github.com/GoogleCloudPlatform/microservices-demo

@horeal #velocityconf $ kubectl get pods -n hipster-shop NAME READY
STATUS RESTARTS AGE adservice-68646787f-gthm8 1/1 Running 0 33d cartservice-9565d66fb-79kbt 1/1 Running 0 33d checkoutservice-56b68ddf6c-bggx2 1/1 Running 0 33d currencyservice-dc765bd97-n9w2k 1/1 Running 0 33d emailservice-7b8f79cf59-jvtks 1/1 Running 0 33d frontend-765d87bdf6-jkwrz 1/1 Running 0 33d paymentservice-6bc589546c-x85hm 1/1 Running 0 33d productcatalogservice-688889974c-hg65j 1/1 Running 0 33d recommendationservice-7cdcdc8799-gd6k8 1/1 Running 0 33d redis-cart-5d9c69b749-j2dtp 1/1 Running 0 33d shippingservice-76dfb78d4c-lp6lp 1/1 Running 0 33d

@horeal #velocityconf $ kubectl get nodes NAME STATUS ROLES AGE
VERSION ip-192-168-10-180.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-16-15.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-64-15.us-west-2.compute.internal Ready <none> 104d v1.12.7 ip-192-168-66-173.us-west-2.compute.internal Ready <none> 104d v1.12.7 $ kubectl get svc -n hipster-shop | grep frontend-external frontend-external LoadBalancer 10.100.44.108 a3ea30b0ce63111e9a91c06eb1d4a1ec-761119657.us-west-2.elb.amazonaws.com 80:31126/TCP 33d

Non-Critical Service Demo

Ad Service : Non-Critical (Tier 2)

@horeal #velocityconf Ad Service Checkout Flow is impacted. Ad Impression
Drops below threshold of 10,000/s Blackhole 120 seconds (2 minute) owner=hml, app=AdService

Critical Service Demo

Payment Service : Critical (Tier 1)

@horeal #velocityconf Traffic not hitting Payment Service Network: Blackhole owner=hml,
app=PaymentService 20 seconds 5 seconds all traffic (whitelist for control plane) owner=hml, app=PaymentService 5 minutes (300 seconds) 5 seconds all traffic (whitelist for control plane) Payment Service getting intermittent issues with network where traffic does not flow to the service. Refer to incident INC-3278. For brief interruptions, checkout should be delayed, but confirm after network recovers. We are expecting timeout for longer duration of issue.

How to communicate results of your Chaos Engineering experiments? go.gremlin.com/10x

@horeal #velocityconf Was it expected? Chaos Engineering uncovers unknown side
effects. Was it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.

@horeal #velocityconf Fix the issues. Whether code, configuration or process
- iterate and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.

If you have... Properties Experiment “Cloud native” app 12-factor stateless
Shutdown Host/Container 3-tier web application presentation, app-tier data-tier Faulty connection to the data store. Peak Season or Launch Event auto-scaling scale out, scale in System under high traffic/load Resource contention/starvation Monitoring & Alerting metrics, logging, tracing alerting thresholds Ensure alerts are fired upon signals. Ensure that engineers can find answers to operational questions. Past Incidents incident root cause analysis Recreate scenarios to validate fixes 32 CE Getting Started Cheatsheet

Network Breaks. Prepare for Failure. Get Started Now.

Get Started for FREE gremlin.com/free

Reliably Yours [email protected] @horeal 35 Thank You Get this deck
@ gremlin.com/hml

[Velocity EU] Chaos Engineering: When The Netwo...

[Velocity EU] Chaos Engineering: When The Network Breaks

HML

More Decks by HML

Other Decks in Technology

Featured

Transcript

Chaos Engineering When the Network Breaks Ho Ming Li (HML)

@horeal #velocityconf February 28th 2017

@horeal #velocityconf Simple Storage Service Outage

@horeal #velocityconf S3 Outage

@horeal #velocityconf

@horeal #velocityconf

@horeal #velocityconf

Every system is becoming a distributed system. THE PROBLEM

@horeal #velocityconf Chaos Engineering Thoughtful, planned experiments designed to reveal

@horeal #velocityconf build an immunity

@horeal #velocityconf proactively

Measuring the Cost of Downtime Cost = R + E

Network Chaos

Network Chaos Engineering Demos 01 02 Setup: Microservices on Kubernetes

@horeal #velocityconf Load Balancer (AWS ELB) Kubernetes (AWS EKS) Microservices

Hipster Shop Architecture https://github.com/GoogleCloudPlatform/microservices-demo

@horeal #velocityconf $ kubectl get pods -n hipster-shop NAME READY

@horeal #velocityconf $ kubectl get nodes NAME STATUS ROLES AGE

Non-Critical Service Demo

Ad Service : Non-Critical (Tier 2)

@horeal #velocityconf Ad Service Checkout Flow is impacted. Ad Impression

Demo

Critical Service Demo

Payment Service : Critical (Tier 1)

@horeal #velocityconf Traffic not hitting Payment Service Network: Blackhole owner=hml,

Demo

How to communicate results of your Chaos Engineering experiments? go.gremlin.com/10x

@horeal #velocityconf Was it expected? Chaos Engineering uncovers unknown side

@horeal #velocityconf Fix the issues. Whether code, configuration or process

If you have... Properties Experiment “Cloud native” app 12-factor stateless

Network Breaks. Prepare for Failure. Get Started Now.

Get Started for FREE gremlin.com/free

Reliably Yours [email protected] @horeal 35 Thank You Get this deck