[PD Summit 2019] Best Practices to Kickstart your Chaos Engineering Journey

Best Practices to Kickstart your Chaos Engineering Journey September 2019
@horeal [email protected]

Retail outages online leave shoppers frustrated on Black Friday 11.23.18
Customers report difficulty accessing Chase Bank mobile and online 2.16.19 Major US Airlines hit by delays after glitch at vendor 4.1.19 Modern Day Failures 3

Resilience Chaos Engineering Failure Injection Tool Gremlin: Tooling Platform and
Experts What Break Things on Purpose How Increase Confidence Why 4 WHAT WHY The Golden Circle (Start with Why) HOW

Getting Buy-In

Getting Buy-In Debunking Chaos Engineering Myths!

Random Failures Random Breakage Myth #1: Chaos Engineering must be
Chaotic

Random Failures Random Breakage Myth #1: Chaos Engineering must be
Chaotic Chaos Engineering should be Controlled Controlled Failures Targeted Breakage

What about the customers? What about the co-tenants? Myth #2:
Chaos Engineering is Dangerous

What about the customers? What about the co-tenants? Myth #2:
Chaos Engineering is Dangerous Chaos Engineering should be Safe Thoughtfulness on Risk Mitigation. Tightly Scope to contain Blast Radius. Have an Abort Plan.

Because… Netflix says so? Staging is not Production ? Myth
#3: CE is only useful in Production

Because… ¯\_(ツ)_/¯ Myth #3: CE is only useful in Production
Start CE Practice in Staging Lots to learn from earlier environments. Try making staging more production like. Catch issues earlier. Duh!

Takes a village to get it done. Difficult to coordinate
across teams. Takes weeks to plan and execute. Myth #4: CE takes too much time & effort

Myth #4: CE takes too much time & effort CE
can be practiced with little time and effort Never too early, only too late. Invest the time and effort now. Don’t start with Region/DC Failover. Start with a simple experiment... Takes a village to get it done. Difficult to coordinate across teams. Takes weeks to plan and execute.

12-Factor app ✔ Confirm Statelessness ✔ Confirm Automated Startup ✔
Confirm (N + 1) Serving Traffic “Cloud Native” App 15

Back End Servers → Data Stores ✔ Confirm Retries and
Timeout ✔ Confirm User Experience 3-Tier Web Service 16

Peak Season Prep Launch Readiness ✔ Confirm Horizontal Scale Out
✔ Confirm Horizontal Scale In High Traffic Event 17

Tuning thresholds for Signal vs Noise ✔ Ensure Alerts are
Fired ✔ Improve Signal to Noise Ratio Monitoring & Alerting 18

Recreate scenarios ✔ Validate Fix ✔ Update Runbook ✔ Train
New Hire Past Incidents 19

Chaos Engineering... Reason Advice must be Chaotic Controlled Random is
not a requirement. CE can be practiced with control and precision. Start controlled, with precision. Slowly add entropy. is Dangerous Safe There are risks. Just like anything else that has risk. Risks can be mitigated. Thoughtfulness is Key. Properly scope to contain Blast Radius. Have an Abort Plan in mind. useful in (Staging and) Production Not mutually exclusive. Why not practice Chaos Engineering in multiple environments. Don’t start in Production. Start in Staging. Catch issues early. Move to production with confidence. takes too much little Time & Effort Coordinating large-scale exercise across teams can be daunting. Don’t start with it. Start with small wins. Start simple. Shutdown a host. Delay a specific connection. 20 Chaos Engineering Myths Truths

If you have... Properties Experiment “Cloud native” app 12-factor stateless
Shutdown Host/Container 3-tier web application presentation, app-tier data-tier Faulty connection to the data store. Peak Season or Launch Event auto-scaling scale out, scale in System under high traffic/load Resource contention/starvation Monitoring & Alerting metrics, logging, tracing alerting thresholds Ensure alerts are fired upon signals. Ensure that engineers can find answers to operational questions. Past Incidents incident root cause analysis Recreate scenarios to validate fixes 21 CE Getting Started Cheatsheet

HUMAN FACTORS 22 “Layer 8” HUMAN Layer 7 APPLICATION Layer
3 NETWORK Layer 1 PHYSICAL OSI Model

Embrace Change, Complexity, and Failure To be safe is not
to avoid the issue. Can’t avoid it. Embrace it. 23 23

Socialize the Practice Share wins. Make people comfortable. Debunk myths.
24

Show Kindness Appreciate those in the frontline battling downtime in
technical operations. 25

Give yourself some love Do what it takes, but have
your own personal time too. Personal time is critically important. 26

(We are moving from reactive to proactive) Let’s bring sexy
back to “Work-Life Balance” (with Chaos Engineering)

Reliably Yours @horeal [email protected]

[PD Summit 2019] Best Practices to Kickstart yo...

[PD Summit 2019] Best Practices to Kickstart your Chaos Engineering Journey

HML

More Decks by HML

Other Decks in Technology

Featured

Transcript

Best Practices to Kickstart your Chaos Engineering Journey September 2019

Retail outages online leave shoppers frustrated on Black Friday 11.23.18

Resilience Chaos Engineering Failure Injection Tool Gremlin: Tooling Platform and

Getting Buy-In

Getting Buy-In Debunking Chaos Engineering Myths!

Random Failures Random Breakage Myth #1: Chaos Engineering must be

Random Failures Random Breakage Myth #1: Chaos Engineering must be

What about the customers? What about the co-tenants? Myth #2:

What about the customers? What about the co-tenants? Myth #2:

Because… Netflix says so? Staging is not Production ? Myth

Because… ¯\_(ツ)_/¯ Myth #3: CE is only useful in Production

Takes a village to get it done. Difficult to coordinate

Myth #4: CE takes too much time & effort CE

12-Factor app ✔ Confirm Statelessness ✔ Confirm Automated Startup ✔

Back End Servers → Data Stores ✔ Confirm Retries and

Peak Season Prep Launch Readiness ✔ Confirm Horizontal Scale Out

Tuning thresholds for Signal vs Noise ✔ Ensure Alerts are

Recreate scenarios ✔ Validate Fix ✔ Update Runbook ✔ Train

Chaos Engineering... Reason Advice must be Chaotic Controlled Random is

If you have... Properties Experiment “Cloud native” app 12-factor stateless

HUMAN FACTORS 22 “Layer 8” HUMAN Layer 7 APPLICATION Layer

Embrace Change, Complexity, and Failure To be safe is not

Socialize the Practice Share wins. Make people comfortable. Debunk myths.

Show Kindness Appreciate those in the frontline battling downtime in

Give yourself some love Do what it takes, but have

(We are moving from reactive to proactive) Let’s bring sexy

Reliably Yours @horeal [email protected]