Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Embracing Disruption: Adding a Bit of Chaos To Help You Grow!

Paul Balogh
November 01, 2023

Embracing Disruption: Adding a Bit of Chaos To Help You Grow!

As software systems have become more distributed and complex, the "shift-left" movement brings reliability testing to earlier stages of development. Ensuring reliability goes beyond simple end-to-end tests. 

To ensure the highest levels of reliability, you must perform a suite of testing types. Incorporate contract tests to validate APIs; load tests for scaling predictability. But how to test for those inevitable failures? Let's learn from Chaos Engineering principles by incorporating disruptive behavior into your system before production.

Join Paul as we learn ways to incorporate a plethora of testing into your software development pipeline. We'll discuss the pros and cons of each and what you can do to add these to your processes. By embracing a little disruption, you can significantly improve the reliability of your system.

Paul Balogh

November 01, 2023
Tweet

More Decks by Paul Balogh

Other Decks in Technology

Transcript

  1. Overview 1 2 3 4 Why are we here? A

    brief history of how we test How fault injection can help us Where do we go from here?
  2. Complex architecture and infrastructure Many potential points of failure Inadequate

    tooling and practices Application reliability is hard
  3. Overview 1 2 3 4 Why are we here? A

    brief history of how we test How fault injection can help us Where do we go from here?
  4. Release frequency How is initiated Testing environment Testing frequency Checklist

    / OLD WAY Before releases Test and Production Manually Quarterly or biannually The way we test • QA bottleneck • Lower coverage • Late in process
  5. Release frequency How is initiated Testing environment Testing frequency DevOps

    / MODERN WAY Weekly AND nightly, feature branches, continuous, synthetic monitoring Scheduled. Automatically as part of CI/CD AND Staging, Long-lived and Short-lived ephemerals environment Checklist / OLD WAY Before releases Test and Production Manually Quarterly or biannually The way we test
  6. • Unit testing • Integration testing • Contract testing •

    Functional testing • E2E testing • Load testing The way we test
  7. Overview 1 2 3 4 A brief history of how

    we test How fault injection can help us Where do we go from here? Why are we here?
  8. A software testing technique which introduces errors to a system

    to ensure it can withstand and recover from those conditions. Fault Injection
  9. Build more confidence to withstand failures? Chaos Testing ☑ ☑

    ☒ Shift left Learn from Incidents Production systems Development
  10. From the distributed system perspective, almost all interesting availability experiments

    can be driven by affecting latency or response type. Nora Jones Casey Rosenthal - Chaos Engineering, O’Reilly
  11. • Formerly known as Load Impact • Open Source since

    2016 • ~21.8k GitHub stars (as of October 2023) • Promotes “shift-left” testing • Acquired by Grafana Labs in 2021 github.com/grafana/k6 Introducing k6 and xk6-disruptor k6, a reliability testing tool • Becomes a project in August 2022, evolved from previous experiments github.com/grafana/xk6-disruptor xk6-disruptor for fault injection
  12. OSS is at the heart of what we do and

    helps leave the world a little better than we found it CLI and API designed for automating your tests with pass/fail criteria using JavaScript syntax A k6 engine written in Go making it one of the the best performing tools available Use Go(lang) code to add support for new outputs, protocols, and products from within your test scripts OpenSource Scriptable Performant Extensible k6: a reliability testing tool
  13. 92% of the catastrophic system failures are the result of

    incorrect handling of non-fatal errors In 58% of the cases the resulting faults could have been detected through simple testing of error handling code How effective is testing known errors? “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems” Yuan et al. USENIX OSDI 2014
  14. In 35% of the cases, error handling code falls into

    one of three patterns: Overreactive. Aborts the system under non-fatal errors Low Context. Was empty or only contained a log printing statement Incomplete. Related comments like “FIXME” or “TODO” How to improve error handling?
  15. Incorporate chaos engineering principles early in the development process Emphasize

    verification over experimentation Change focus from uncovering unknown faults to ensuring proper handling of known faults Introduce chaos testing
  16. Continually improve reliability Chaos Testing ☑ ☑ ☒ Chaos Experiments

    Incident Enacting Shift left Production systems Development Progress towards Improve Learn from Incidents
  17. OpenTelemetry Demo - Astronomy Shop • Microservices architecture • HTTP,

    gRPC, Kafka between services • Polyglot (Go, Java, JS, …) • Kubernetes-ready https://github.com/open-telemetry/opentelemetry-demo
  18. https://github.com/open-telemetry/opentelemetry-demo OpenTelemetry Demo - Astronomy Shop • Microservices architecture •

    HTTP, gRPC, Kafka between services • Polyglot (Go, Java, JS, …) • Kubernetes-ready How would an incident affect our services?
  19. • Tests can be reused to validate the system under

    turbulent conditions • Conditions are defined in familiar terms: latency and error rate • Tests have a controlled effect on the target service • Tests are repeatable with results that are predictable • Fault injection is coordinated from the test code • Fault injection should not add any operational complexity Chaos testing principles in action
  20. Overview 1 2 3 4 A brief history of how

    we test How fault injection can help us Where do we go from here? Why are we here?
  21. Final remarks The ability to operate reliably should not be

    a privilege of the technology elite Chaos Engineering can be democratized by promoting the adoption of Chaos Testing To be effective, Chaos Testing must be compatible with the existing testing practices used by development teams
  22. Make Chaos Engineering practices accessible to a broad spectrum of

    organizations by building a solid foundation from which they can progress towards more reliable applications Our Goal