Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering the 5 W's All Things K8S Meet...

Chaos Engineering the 5 W's All Things K8S Meetup DC

During this talk, Gremlin's Jacob Plicque (Chaos Engineer & Resilience Coach, former Senior SRE @ Fanatics) will answer the 5 W's (Who, What, When, Where & Why) of Chaos Engineering.

You’ll Learn:
* The systematic way to begin Chaos Engineering
* The value of running chaos experiments to build more reliable systems and confidence in your remediation processes.
* How other companies are using Chaos Engineering—and the positive results they’ve seen creating reliable distributed systems with CE

Chaos Engineering is NOT:
* Applying failure modes randomly
* Applying failures to your entire infrastructure straight away
* Applying failure on systems without communication
* Creating a one-off fix to be run once and then abandoned

Chaos Engineering IS:
* Applying failures carefully, and with an explicit hypothesis
* Starting small and growing the blast radius
* Communicating plans clearly with all stakeholders
* Designing a well-defined practice that requires constant attention

Kubernetes:
How to improve the availability and reliability of Kubernetes clusters using the discipline of Chaos Engineering
How to use Chaos Engineering to safely inject failure into your applications and nodes in order to detect weaknesses.
Specific Chaos Experiments for you to run on Kubernetes to ensure you’ve designed a reliable system

Jacob Plicque

May 27, 2020
Tweet

More Decks by Jacob Plicque

Other Decks in Technology

Transcript

  1. 1 Chaos Engineering The 5 W’s May 27th, 2020 All

    Things Kubernetes Meetup Jacob Plicque Sr. Solutions Architect, Gremlin [email protected] @DuvalKingJabub
  2. Chaos Engineering is NOT • randomly applying failure modes •

    applying failures to your entire infrastructure straight away • applying failure on systems without communication • a one off fix to be run once and then abandoned
  3. Chaos Engineering IS • carefully applying failures with an explicit

    hypothesis • starting small and growing the blast radius • clearly communicating plans with all stakeholders • a well defined practice that requires constant attention
  4. By 2023, 40% of organizations will implement chaos engineering practices

    as part of DevOps initiatives, reducing unplanned downtime by 20%. Chaos Engineering completes DevOps
  5. “Our last assessment we performed was with a team that

    had been struggling with an issue that had caused incidents in production a number of times over the last few months. They had trouble reproducing it. They hadn’t been able to reproduce it reliably to be able to determine what was happening and get a good fix out. We just walked in carrying this new tool and we’re able to reproduce it.” -Matt Simons, Product Dev Manager, Workiva - Break Things On Purpose Podcast @DuvalKingJabub gremlin.com/podcast twitter.com/btoppod
  6. 14

  7. Measuring the Cost of Downtime Cost = R + E

    + C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min @DuvalKingJabub
  8. 18 • “We don’t have time.” • “We don’t need

    to break things — they break on their own.” • “We don’t know how to get started.” Excuses
  9. Reactive Proactive Tim Armandpour Vice President of Engineering, Pagerduty “Operational

    Maturity means being part of a test-driven environment, where high-severity incidents … are very uncommon, and measured” Sean Jacobs Infrastructure & Datacenter Operations Lead, Splunk “Operational Maturity ... is often measured by the effectiveness of our response during a crisis.” Joey Parsons Head of Platform & Operations, Flipboard “Operationally mature… is understanding the ramifications of incidents”
  10. Table Stakes 02 03 Does your company measure downtime? Can

    you quantify damage to the business? Does someone own that number? Program Requirements Technical Requirements 01 02 03 Logging Monitoring Alarming 01
  11. Chaos Engineering 01 Infrastructure Failures 01.01 Local Failures 01.02 External

    Failures 02 Application Failures 03 Continuous Chaos
  12. While the ultimate goal is to test in all of

    production, testing at all stages of development catches failures before they can affect customers. 23
  13. - James Hamilton Distinguished Engineer, AWS “Those unwilling to test

    in production aren't yet confident that the service will continue operating through failures. And, without production testing, recovery won't work when called upon.” 24
  14. Technical Issues Likely Cost Retailers Billions 12.01.16 Macy’s, Lowe’s hit

    by Black Friday technical glitches 11.27.17 Retail outages online leave shoppers frustrated on Black Friday 11.23.18 People.com Black Friday Failures
  15. Wells Fargo accidentally foreclosed hundreds of homeowners 8.7.18 Customers report

    difficulty accessing Chase Bank mobile and online 2.16.19 Citibank Website down, not working 2.28.19 Investopedia Breaking Banks
  16. Computer Problems Blamed For Flight Delays 4.1.19 Major US Airlines

    hit by delays after glitch at vendor 4.1.19 Pilots of doomed Boeing 737 MAX fought the plane’s software and lost 4.4.19 Airlines Incidents
  17. Verify Monitoring with Chaos Engineering to avoid missed alerts or

    prolonged outages • CPU spike on your service to simulate runaway processes • Service unreachable from API server • Slow response from your database • An outage with a cloud notification service • Memory constraints push force your monitoring agent to stop Experiments
  18. • Due to a strict separation of duties, developers don’t

    have direct access to infrastructure. Gremlin allows them to run tests on shared infrastructure. • Introduced latency to ensure their dashboards were functioning properly. They were unable to determine the affected hosts prior to experimentation. Unspecified Security Company
  19. Prepare for dependency failure and reduce the time to resolve

    issues • Database connection loss • DNS resolver connectivity issues • Load balancer failure • Non-critical service lost • SaaS API latency Experiments
  20. Tested that backend requests to prem databases meet business requirements

    during ERP migration. Introduced latency to their rootdb and rabbitmq instances, resulting in queued messages to their picking robots. • Blackholed a service NOT in their critical path, which resulted in all pages serving a 503 error page and ultimately rendered their entire app unusable.
  21. Hone your incident response plans • Recreate a past incident

    to compare your team’s recovery time • Lose connection to a single service, datacenter and region • Run through a playbook with simulated scenarios • Add latency between your database replicas Experiments
  22. • Following a 72-hour SLO breach on Black Friday in

    2017, Backcountry Introduced latency to their rootdb and rabbitmq instances, verifying the fix of an issue in message queuing to their picking robots.
  23. Replicate the most common Kubernetes failures to ensure correct configurations

    and prepare your teams • Push CPU and memory Resource Limits • Simulate slow or lost network connectivity between nodes • Service unable to reach DNS • Node, pod, region loss • Out of memory conflicts Experiments
  24. Tested that backend requests to prem databases meet business requirements

    during ERP migration. Introduced latency to their rootdb and rabbitmq instances, resulting in queued messages to their picking robots. • Consumed all CPU cores on particular instances to determine whether their dashboards would detect the load and their orchestrators would replace the unhealthy instance. They did not. Unspecified Financial Company