[PD Summit 2019] Best Practices to Kickstart your Chaos Engineering Journey

9fccf1fe0a5da1402f23e0566cb7c2ae?s=47 Ho Ming Li
September 25, 2019

[PD Summit 2019] Best Practices to Kickstart your Chaos Engineering Journey

Debunk Myths:
Chaos is in the name, but it can be controlled.
It doesn’t have to be dangerous. Can be thoughtful and safe.
It’s useful not only in production, but also good in staging and other earlier environments.
It doesn’t have to be a giant cross-team entire company exercise.
You can start small.

Experiments:
Shutdown your stateless hosts. See that they come back up healthy.
What happens when you drop connection to your data store? Let’s verify the retry and timeout behaviors.
Auto-scaling may sound easy, but there are nuances that you have to really experience it in order to know.
Ensure alerts are triggered appropriately, and that the receiver has sufficient information to work with. Take signal to noise into consideration.
Past Incidents, with your team, with a third party dependency. Are good lessons to share with your team. Re-run those scenarios.

9fccf1fe0a5da1402f23e0566cb7c2ae?s=128

Ho Ming Li

September 25, 2019
Tweet

Transcript

  1. 2.

    Retail outages online leave shoppers frustrated on Black Friday 11.23.18

    Customers report difficulty accessing Chase Bank mobile and online 2.16.19 Major US Airlines hit by delays after glitch at vendor 4.1.19 Modern Day Failures 3
  2. 3.

    Resilience Chaos Engineering Failure Injection Tool Gremlin: Tooling Platform and

    Experts What Break Things on Purpose How Increase Confidence Why 4 WHAT WHY The Golden Circle (Start with Why) HOW
  3. 7.

    Random Failures Random Breakage Myth #1: Chaos Engineering must be

    Chaotic Chaos Engineering should be Controlled Controlled Failures Targeted Breakage
  4. 9.

    What about the customers? What about the co-tenants? Myth #2:

    Chaos Engineering is Dangerous Chaos Engineering should be Safe Thoughtfulness on Risk Mitigation. Tightly Scope to contain Blast Radius. Have an Abort Plan.
  5. 10.
  6. 11.

    Because… ¯\_(ツ)_/¯ Myth #3: CE is only useful in Production

    Start CE Practice in Staging Lots to learn from earlier environments. Try making staging more production like. Catch issues earlier. Duh!
  7. 12.

    Takes a village to get it done. Difficult to coordinate

    across teams. Takes weeks to plan and execute. Myth #4: CE takes too much time & effort
  8. 13.

    Myth #4: CE takes too much time & effort CE

    can be practiced with little time and effort Never too early, only too late. Invest the time and effort now. Don’t start with Region/DC Failover. Start with a simple experiment... Takes a village to get it done. Difficult to coordinate across teams. Takes weeks to plan and execute.
  9. 14.

    12-Factor app ✔ Confirm Statelessness ✔ Confirm Automated Startup ✔

    Confirm (N + 1) Serving Traffic “Cloud Native” App 15
  10. 15.

    Back End Servers → Data Stores ✔ Confirm Retries and

    Timeout ✔ Confirm User Experience 3-Tier Web Service 16
  11. 16.

    Peak Season Prep Launch Readiness ✔ Confirm Horizontal Scale Out

    ✔ Confirm Horizontal Scale In High Traffic Event 17
  12. 17.

    Tuning thresholds for Signal vs Noise ✔ Ensure Alerts are

    Fired ✔ Improve Signal to Noise Ratio Monitoring & Alerting 18
  13. 19.

    Chaos Engineering... Reason Advice must be Chaotic Controlled Random is

    not a requirement. CE can be practiced with control and precision. Start controlled, with precision. Slowly add entropy. is Dangerous Safe There are risks. Just like anything else that has risk. Risks can be mitigated. Thoughtfulness is Key. Properly scope to contain Blast Radius. Have an Abort Plan in mind. useful in (Staging and) Production Not mutually exclusive. Why not practice Chaos Engineering in multiple environments. Don’t start in Production. Start in Staging. Catch issues early. Move to production with confidence. takes too much little Time & Effort Coordinating large-scale exercise across teams can be daunting. Don’t start with it. Start with small wins. Start simple. Shutdown a host. Delay a specific connection. 20 Chaos Engineering Myths Truths
  14. 20.

    If you have... Properties Experiment “Cloud native” app 12-factor stateless

    Shutdown Host/Container 3-tier web application presentation, app-tier data-tier Faulty connection to the data store. Peak Season or Launch Event auto-scaling scale out, scale in System under high traffic/load Resource contention/starvation Monitoring & Alerting metrics, logging, tracing alerting thresholds Ensure alerts are fired upon signals. Ensure that engineers can find answers to operational questions. Past Incidents incident root cause analysis Recreate scenarios to validate fixes 21 CE Getting Started Cheatsheet
  15. 21.
  16. 22.

    Embrace Change, Complexity, and Failure To be safe is not

    to avoid the issue. Can’t avoid it. Embrace it. 23 23
  17. 25.

    Give yourself some love Do what it takes, but have

    your own personal time too. Personal time is critically important. 26
  18. 26.

    (We are moving from reactive to proactive) Let’s bring sexy

    back to “Work-Life Balance” (with Chaos Engineering)