Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: When the network breaks @ NGINX Conf 2019

Chaos Engineering: When the network breaks @ NGINX Conf 2019

Chaos engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos engineering lets you compare what you think will happen to what actually happens in your systems. You literally break things on purpose to learn how to build more resilient systems.

Tammy Butow leads a walk-through of network chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, you’ll learn to identify new ways to use it to improve the resilience of your network and services. You’ll also discover how other companies are using chaos engineering and the positive results the companies have had using chaos to create reliable distributed systems.

Tammy begins by explaining chaos engineering and its principles. She then asks why many engineering teams (including Netflix, Gremlin, Dropbox, National Australia Bank, Under Armour, Twilio, and more) use chaos engineering and how every engineering team can use it to create reliable systems. You’ll learn how to get started using chaos engineering with your own team as you explore the tools to measure success and the chaos tools and new chaos features built into cloud services. You’ll also discover how to use wargame environments to learn about chaos engineering and how to practice chaos engineering on AWS DocumentDB, AWS DynamoDB, AWS RDS, and AWS S3.

Other topics include how to use monitoring tools combined with chaos engineering to help you create reliable distributed systems, where you can learn more, and how to join the chaos community.


Tammy Bryant Butow

September 11, 2019


  1. Chaos Engineering: When The Network Breaks Tammy Butow Principal SRE,

    Gremlin @tammybutow
  2. Every system is becoming a distributed system. THE PROBLEM

  3. None
  4. Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness

    in our systems.
  5. Inject something harmful to build an immunity.

  6. We test proactively, instead of waiting for an outage.

  7. Define the Blast Radius

  8. What is value of Chaos Engineering?

  9. Improved Incident Management

  10. Fire drills prepare us to respond quickly, calmly, and safely.

  11. Measuring the Cost of Downtime Cost = R + E

    + C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min
  12. Network Chaos

  13. None
  14. Network Chaos Engineering Demos 01 02 Latency Injection Packet Loss

    03 Blackhole
  15. Hipster Shop Architecture

  16. Hipster Shop Demo

  17. Latency Injection Demo

  18. Hipster Shop Datadog Latency Attack 1 Container Experiment #4 payments

    200 ms HTTP 400/500 errors
  19. Latency Attack on Payment Container on AWS EKS

  20. Hipster Shop Datadog Latency 1 instance Experiment #5 1 instance

    200 ms HTTP 400/500 errors
  21. Latency Attack 1 instance on AWS EKS

  22. Packet Loss Demo

  23. Kubernetes Dashboard Datadog Gremlin Rise in errors (400/500s) Packet Loss

    60 seconds 70% Experiment #2 `kubernetes-dashboard` Slower responses, but ultimately success
  24. Packet Loss Attack 1 container on AWS EKS

  25. Blackhole Attack Demo

  26. Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 payments

    120 Seconds HTTP 400/500 errors
  27. Blackhole Attack Payment Container on AWS EKS

  28. Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 catalogue

    60 Seconds HTTP 400/500 errors
  29. Blackhole Attack Catalogue Container on AWS EKS

  30. How to communicate results of your Chaos Engineering experiments?

  31. Was it expected? Chaos Engineering uncovers unknown side effects. Was

    it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.
  32. Fix the issues. Whether code, configuration or process - iterate

    and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.
  33. Where can you get started?

  34. Join us @ Chaos Conf chaosconf.io twitter.com/chaosconf San Francisco, September

    26, 2019 Special code: “insider” for $49 tickets @tammybutow @gremlininc
  35. 35 Thank You Tammy Butow Principal SRE, Gremlin tammy@gremlin.com @tammybutow