Embracing Disruption: Adding a Bit of Chaos To Help You Grow!

Paul Balogh Developer Advocate, Grafana Labs @javaducky Embracing Disruption Adding
a Bit of Chaos to Help You Grow!

Overview 1 2 3 4 Why are we here? A
brief history of how we test How fault injection can help us Where do we go from here?

Complex architecture and infrastructure Many potential points of failure Inadequate
tooling and practices Application reliability is hard

High demands on availability SLOs

UX High demands on usability

Distributed Ever increasing complexity (Example of Netflix services)

Fragility Potentially fateful interdependency https://xkcd.com/2347/

You are not alone

Overview 1 2 3 4 Why are we here? A
brief history of how we test How fault injection can help us Where do we go from here?

Release frequency How is initiated Testing environment Testing frequency Checklist
/ OLD WAY Before releases Test and Production Manually Quarterly or biannually The way we test • QA bottleneck • Lower coverage • Late in process

Release frequency How is initiated Testing environment Testing frequency DevOps
/ MODERN WAY Weekly AND nightly, feature branches, continuous, synthetic monitoring Scheduled. Automatically as part of CI/CD AND Staging, Long-lived and Short-lived ephemerals environment Checklist / OLD WAY Before releases Test and Production Manually Quarterly or biannually The way we test

• Unit testing • Integration testing • Contract testing •
Functional testing • E2E testing • Load testing The way we test

The way we test • Applications instrumented • Observability platform
available

• Start simple • Test frequently • Continually expand •
Evolve over time The way we test Time

Yet we learn from failure

Overview 1 2 3 4 A brief history of how
we test How fault injection can help us Where do we go from here? Why are we here?

A software testing technique which introduces errors to a system
to ensure it can withstand and recover from those conditions. Fault Injection

Failure happens Test Release Deploy Operate Production Monitor e Build
DEV OPS Resolve, Inform, Learn

Build more confidence to withstand failures? Chaos Testing ☑ ☑
☒ Shift left Learn from Incidents Production systems Development

From the distributed system perspective, almost all interesting availability experiments
can be driven by affecting latency or response type. Nora Jones Casey Rosenthal - Chaos Engineering, O’Reilly

• Formerly known as Load Impact • Open Source since
2016 • ~21.8k GitHub stars (as of October 2023) • Promotes “shift-left” testing • Acquired by Grafana Labs in 2021 github.com/grafana/k6 Introducing k6 and xk6-disruptor k6, a reliability testing tool • Becomes a project in August 2022, evolved from previous experiments github.com/grafana/xk6-disruptor xk6-disruptor for fault injection

OSS is at the heart of what we do and
helps leave the world a little better than we found it CLI and API designed for automating your tests with pass/fail criteria using JavaScript syntax A k6 engine written in Go making it one of the the best performing tools available Use Go(lang) code to add support for new outputs, protocols, and products from within your test scripts OpenSource Scriptable Performant Extensible k6: a reliability testing tool

Testing with errors

92% of the catastrophic system failures are the result of
incorrect handling of non-fatal errors In 58% of the cases the resulting faults could have been detected through simple testing of error handling code How effective is testing known errors? “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems” Yuan et al. USENIX OSDI 2014

In 35% of the cases, error handling code falls into
one of three patterns: Overreactive. Aborts the system under non-fatal errors Low Context. Was empty or only contained a log printing statement Incomplete. Related comments like “FIXME” or “TODO” How to improve error handling?

Incorporate chaos engineering principles early in the development process Emphasize
verification over experimentation Change focus from uncovering unknown faults to ensuring proper handling of known faults Introduce chaos testing

Continually improve reliability Chaos Testing ☑ ☑ ☒ Chaos Experiments
Incident Enacting Shift left Production systems Development Progress towards Improve Learn from Incidents

Incremental adoption Application Centric Controlled Chaos Chaos as Code <
> ⚙ Four tenets of Chaos Testing

Chaos Testing in action

OpenTelemetry Demo - Astronomy Shop • Microservices architecture • HTTP,
gRPC, Kafka between services • Polyglot (Go, Java, JS, …) • Kubernetes-ready https://github.com/open-telemetry/opentelemetry-demo

https://github.com/open-telemetry/opentelemetry-demo OpenTelemetry Demo - Astronomy Shop • Microservices architecture •
HTTP, gRPC, Kafka between services • Polyglot (Go, Java, JS, …) • Kubernetes-ready How would an incident affect our services?

• Tests can be reused to validate the system under
turbulent conditions • Conditions are defined in familiar terms: latency and error rate • Tests have a controlled effect on the target service • Tests are repeatable with results that are predictable • Fault injection is coordinated from the test code • Fault injection should not add any operational complexity Chaos testing principles in action

Overview 1 2 3 4 A brief history of how
we test How fault injection can help us Where do we go from here? Why are we here?

Integration Testing Contract Testing Reliability testing strategy Browser Automation (E2E)
Load Testing Functional Testing Chaos Testing

PRE-PRODUCTION PRODUCTION Virtual User traffic Real User traffic Virtual User
traffic SUT SUT Proactively improve reliability

Final remarks The ability to operate reliably should not be
a privilege of the technology elite Chaos Engineering can be democratized by promoting the adoption of Chaos Testing To be effective, Chaos Testing must be compatible with the existing testing practices used by development teams

Make Chaos Engineering practices accessible to a broad spectrum of
organizations by building a solid foundation from which they can progress towards more reliable applications Our Goal

Connect with Paul as @javaducky or linkedin/in/pabalogh Thanks for participating!
k6.io/slack grafana/xk6-disruptor

Embracing Disruption: Adding a Bit of Chaos To ...

Embracing Disruption: Adding a Bit of Chaos To Help You Grow!

More Decks by Paul Balogh

Other Decks in Technology

Featured

Transcript