Chaos Engineering the 5 W's All Things K8S Meetup DC

1 Chaos Engineering The 5 W’s May 27th, 2020 All
Things Kubernetes Meetup Jacob Plicque Sr. Solutions Architect, Gremlin [email protected] @DuvalKingJabub

WHAT is Chaos Engineering? 2 @DuvalKingJabub

Chaos Engineering is NOT • randomly applying failure modes •
applying failures to your entire infrastructure straight away • applying failure on systems without communication • a one off ﬁx to be run once and then abandoned

4 Thoughtful, planned experiments designed to reveal weakness in our
systems. @DuvalKingJabub

Chaos Engineering IS • carefully applying failures with an explicit
hypothesis • starting small and growing the blast radius • clearly communicating plans with all stakeholders • a well deﬁned practice that requires constant attention

6 The Evolution of Chaos Engineering 2010 2012 2014 Simoorg
F.I.T uDestroy 2016 2018

7 WHY practice Chaos Engineering?

Failures are inherent to complex systems and will cause downtime
unless tested for. 8

By 2023, 40% of organizations will implement chaos engineering practices
as part of DevOps initiatives, reducing unplanned downtime by 20%. Chaos Engineering completes DevOps

10 Complexity Keeps Increasing

@DuvalKingJa@DuvalKingJabubbub

“Our last assessment we performed was with a team that
had been struggling with an issue that had caused incidents in production a number of times over the last few months. They had trouble reproducing it. They hadn’t been able to reproduce it reliably to be able to determine what was happening and get a good fix out. We just walked in carrying this new tool and we’re able to reproduce it.” -Matt Simons, Product Dev Manager, Workiva - Break Things On Purpose Podcast @DuvalKingJabub gremlin.com/podcast twitter.com/btoppod

Fire drills prepare us to respond quickly, calmly, and safely.
@DuvalKingJabub

Measuring the Cost of Downtime Cost = R + E
+ C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantiﬁable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min @DuvalKingJabub

16 WHEN to apply Chaos Engineering?

As systems become increasingly complex it becomes more important to
test for failures early and often. 17

18 • “We don’t have time.” • “We don’t need
to break things — they break on their own.” • “We don’t know how to get started.” Excuses

Reactive Proactive Tim Armandpour Vice President of Engineering, Pagerduty “Operational
Maturity means being part of a test-driven environment, where high-severity incidents … are very uncommon, and measured” Sean Jacobs Infrastructure & Datacenter Operations Lead, Splunk “Operational Maturity ... is often measured by the effectiveness of our response during a crisis.” Joey Parsons Head of Platform & Operations, Flipboard “Operationally mature… is understanding the ramiﬁcations of incidents”

Table Stakes 02 03 Does your company measure downtime? Can
you quantify damage to the business? Does someone own that number? Program Requirements Technical Requirements 01 02 03 Logging Monitoring Alarming 01

Chaos Engineering 01 Infrastructure Failures 01.01 Local Failures 01.02 External
Failures 02 Application Failures 03 Continuous Chaos

22 WHERE to Chaos Engineer?

While the ultimate goal is to test in all of
production, testing at all stages of development catches failures before they can affect customers. 23

- James Hamilton Distinguished Engineer, AWS “Those unwilling to test
in production aren't yet conﬁdent that the service will continue operating through failures. And, without production testing, recovery won't work when called upon.” 24

Start Small & Increase the Blast Radius

Development Staging Production

Gradually increase experiment magnitude

28 WHO should Chaos Engineer?

Any company that relies on the internet to operate their
business. 29

Technical Issues Likely Cost Retailers Billions 12.01.16 Macy’s, Lowe’s hit
by Black Friday technical glitches 11.27.17 Retail outages online leave shoppers frustrated on Black Friday 11.23.18 People.com Black Friday Failures

Wells Fargo accidentally foreclosed hundreds of homeowners 8.7.18 Customers report
difﬁculty accessing Chase Bank mobile and online 2.16.19 Citibank Website down, not working 2.28.19 Investopedia Breaking Banks

Computer Problems Blamed For Flight Delays 4.1.19 Major US Airlines
hit by delays after glitch at vendor 4.1.19 Pilots of doomed Boeing 737 MAX fought the plane’s software and lost 4.4.19 Airlines Incidents

Trusted By Teams Worldwide

34 HOW to Chaos Engineer?

Industry leaders across sectors are using Gremlin to regularly exercise
their systems with Chaos Engineering 35

Verify Monitoring with Chaos Engineering to avoid missed alerts or
prolonged outages • CPU spike on your service to simulate runaway processes • Service unreachable from API server • Slow response from your database • An outage with a cloud notification service • Memory constraints push force your monitoring agent to stop Experiments

• Due to a strict separation of duties, developers don’t
have direct access to infrastructure. Gremlin allows them to run tests on shared infrastructure. • Introduced latency to ensure their dashboards were functioning properly. They were unable to determine the affected hosts prior to experimentation. Unspeciﬁed Security Company

Prepare for dependency failure and reduce the time to resolve
issues • Database connection loss • DNS resolver connectivity issues • Load balancer failure • Non-critical service lost • SaaS API latency Experiments

Tested that backend requests to prem databases meet business requirements
during ERP migration. Introduced latency to their rootdb and rabbitmq instances, resulting in queued messages to their picking robots. • Blackholed a service NOT in their critical path, which resulted in all pages serving a 503 error page and ultimately rendered their entire app unusable.

Hone your incident response plans • Recreate a past incident
to compare your team’s recovery time • Lose connection to a single service, datacenter and region • Run through a playbook with simulated scenarios • Add latency between your database replicas Experiments

• Following a 72-hour SLO breach on Black Friday in
2017, Backcountry Introduced latency to their rootdb and rabbitmq instances, verifying the ﬁx of an issue in message queuing to their picking robots.

Replicate the most common Kubernetes failures to ensure correct configurations
and prepare your teams • Push CPU and memory Resource Limits • Simulate slow or lost network connectivity between nodes • Service unable to reach DNS • Node, pod, region loss • Out of memory conflicts Experiments

Tested that backend requests to prem databases meet business requirements
during ERP migration. Introduced latency to their rootdb and rabbitmq instances, resulting in queued messages to their picking robots. • Consumed all CPU cores on particular instances to determine whether their dashboards would detect the load and their orchestrators would replace the unhealthy instance. They did not. Unspeciﬁed Financial Company

DEMO! 44

Let’s Review 45

46 Experimenting with failure WHAT is Chaos Engineering?

47 Failures are inherent WHY should we practice Chaos Engineering?

48 Early and Often WHEN to apply Chaos Engineering?

49 Everywhere WHERE to Chaos Engineer?

50 Everyone WHO should Chaos Engineer?

51 How to Chaos Engineer?

First 5 Experiments on K8S Whitepaper gremlin.com/ﬁrst-5-chaos-experiments-to -run-on-kubernetes @DuvalKingJabub

The First 5 Experiments on K8S Webinar gremlin.com/webinars/running-your-ﬁrst- 5-experiments-on-k8s/ @DuvalKingJabub

Join the Chaos Engineering Community gremlin.com/slack @DuvalKingJabub

Get Started for Free! gremlin.com/free @DuvalKingJabub

56 Thank You Jacob Plicque Sr. Solutions Architect, Gremlin [email protected]
@DuvalKingJabub

Chaos Engineering the 5 W's All Things K8S Meet...

Chaos Engineering the 5 W's All Things K8S Meetup DC

More Decks by Jacob Plicque

Other Decks in Technology

Featured

Transcript