Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos Engineering: When The Network Breaks - Tammy Bryant Butow (Gremlin) - ETE 2021

Chaos Engineering: When The Network Breaks - Tammy Bryant Butow (Gremlin) - ETE 2021

Chaos engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. Chaos engineering lets you compare what you think will happen to what actually happens in your systems. You literally break things on purpose to learn how to build more resilient systems.

In this session, Tammy leads a walk‑through of network chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Even if you’re already using chaos engineering, she illustrates new ways to use it to improve the resilience of your network and services. She describes how other companies are using chaos engineering and the positive results the companies have had using chaos to create reliable distributed systems.

Tammy begins by explaining chaos engineering and its principles. She then asks why many engineering teams (including Netflix, Gremlin, Dropbox, National Australia Bank, Twilio, and more) use chaos engineering and how every engineering team can use it to create reliable systems. She shows how to get started using chaos engineering with your own team as you explore the tools to measure success and the chaos tools and new chaos features built into cloud services. She explains how to use wargame environments to learn about chaos engineering and how to practice chaos engineering on Kubernetes, Redis, Kafka, and more.

Other topics include how to use monitoring tools combined with chaos engineering to help you create reliable distributed systems, where you can learn more, and how to join the chaos community.


Tammy Bryant Butow

May 05, 2021


  1. Chaos Engineering When The Network Breaks Tammy Bryant Butow, Principal

    SRE @ Gremlin @tambryantbutow
  2. Hello new and old friends @tambryantbutow You’ll find me on

    twitter: @tambryantbutow and on LinkedIn. Happy to answer questions : )) Principal SRE @ Gremlin Co-Founder @ Girl Geek Academy Previously SRE Manager @ Dropbox DigitalOcean, National Australia Bank, Queensland University of Technology… and more.
  3. Have you ever thought the network was causing an incident

    but struggled to prove it? @tambryantbutow
  4. You get out your network tools to try and collect

    evidence... @tambryantbutow
  5. But everyone still says... it’s not “the network” @tambryantbutow

  6. This is when you... Get started with Network Chaos Engineering

  7. Thoughtful planned experiments designed to help you prove networking issues

    and get them resolved quickly @tambryantbutow yes
  8. For example, say somebody is occasionally throttling your traffic, how

    do you discover and prove this? Answer: Network Chaos Engineering @tambryantbutow generate network traffic observe network traffic run latency attack review results
  9. @tambryantbutow wow You realise this looks like misconfigured QoS -

    only throttling you during specific time windows.
  10. What is QoS? Quality of service (QoS) controls and manages

    network resources by setting priorities for specific types of data on the network. If QoS is misconfigured, it can add latency as the wrong packets may be getting delayed in a buffer when other traffic is present @tambryantbutow gtk
  11. @tambryantbutow OH: “a bad QoS policy is worse than no

    QoS policy”
  12. Now we can resolve the issue by fixing the misconfigured

    QoS policy and putting in place a notification system for service owners. We then re-run our Chaos Engineering Scenario to ensure the fix works. @tambryantbutow yes
  13. This is when you... Continue advancing your practice of Network

    Chaos Engineering @tambryantbutow
  14. @tambryantbutow There are lots of other benefits you can get

    from practicing Chaos Engineering on the network. Find monitoring and observability gaps, validate dependencies, train teams for on-call & get more sleep.
  15. @tambryantbutow My favourite benefit is a reduction in MTTD (mean

    time to detection). I care about this so much, I wrote a book on it with friends! gremlin.com/oreilly-reducing-mttd- for-high-severity-incidents/
  16. Now for some demos… Network Chaos Engineering @tambryantbutow

  17. Architecture 12 Services - What matters most to our customers?

  18. What can we remove from the critical path? @tammyxbryant Architecture

  19. @tambryantbutow OH: “Getting out of the critical path is a

    good thing” gtk
  20. Service Not Found Architecture @tammyxbryant Does blackholing a non-critical path

    service like the Recommendation Service cause unexpected failures for critical services like the Product Catalogue or Frontend?
  21. Blackhole → Ads @tambryantbutow

  22. @tambryantbutow

  23. @tambryantbutow

  24. @tambryantbutow Does blackholing a non-critical path service like the Ad

    Service result in graceful degradation of the customer experience?
  25. @tambryantbutow

  26. @tambryantbutow When the 60s blackhole attack ends, will everything return

    to “normal”?
  27. @tambryantbutow

  28. Graceful Degradation @tambryantbutow Yes, our experiment was successful and our

    results were what we expected them to be.
  29. Blackhole → Recommendations @tambryantbutow

  30. @tambryantbutow

  31. @tambryantbutow

  32. @tambryantbutow Does blackholing a non-critical path service like the Recommendations

    Service result in graceful degradation of the customer experience?
  33. None
  34. @tambryantbutow Our requests for product pages are cancelled because the

    first product page request is stalled and unable to complete successfully. This continues for the duration of the recommendation catalogue outage due to a previously unknown dependency on product assets that are unavailable.
  35. Major Incident @tambryantbutow Yes, our experiment was not successful and

    our results were not what we expected them to be. We’ll need to fix these dependency issues to ensure this doesn’t happen again.
  36. 📦 Packet Loss → Cart @tambryantbutow

  37. @tambryantbutow

  38. @tambryantbutow

  39. @tambryantbutow Do we experience data loss if there is packet

    loss impacting the Cart Service?
  40. @tambryantbutow Due to the packet loss attack on the shopping

    cart, when trying to add items to the cart the user will be given a 500 Internal Server “Failed To Add To Cart”.
  41. Learning from our Chaos Engineering Experiments @tambryantbutow

  42. Was it expected, was it detected, was it mitigated, can

    we fix the issues? Can we automate this? How can we best share our findings and results? @tambryantbutow
  43. Join the Chaos Engineering community: gremlin.com/slack 8000+ engineers coming together

    to share knowledge. @tambryantbutow
  44. @tambryantbutow Know someone awesome @ Chaos Engineering? Nominate them for

    a Chaos Champion award gremlin.com/champions
  45. Thanks new and old friends @tambryantbutow You can find me

    on twitter: @tambryantbutow and on LinkedIn. Bonus sticker gift pack from Gremlin: gremlin.com/talk/ete