Chaos Engineering: When The Network Breaks - Tammy Bryant Butow (Gremlin) - ETE 2021

Chaos Engineering When The Network Breaks Tammy Bryant Butow, Principal
SRE @ Gremlin @tambryantbutow

Hello new and old friends @tambryantbutow You’ll find me on
twitter: @tambryantbutow and on LinkedIn. Happy to answer questions : )) Principal SRE @ Gremlin Co-Founder @ Girl Geek Academy Previously SRE Manager @ Dropbox DigitalOcean, National Australia Bank, Queensland University of Technology… and more.

Have you ever thought the network was causing an incident
but struggled to prove it? @tambryantbutow

You get out your network tools to try and collect
evidence... @tambryantbutow

But everyone still says... it’s not “the network” @tambryantbutow

This is when you... Get started with Network Chaos Engineering
@tambryantbutow

Thoughtful planned experiments designed to help you prove networking issues
and get them resolved quickly @tambryantbutow yes

For example, say somebody is occasionally throttling your traffic, how
do you discover and prove this? Answer: Network Chaos Engineering @tambryantbutow generate network traffic observe network traffic run latency attack review results

@tambryantbutow wow You realise this looks like misconfigured QoS -
only throttling you during specific time windows.

What is QoS? Quality of service (QoS) controls and manages
network resources by setting priorities for specific types of data on the network. If QoS is misconfigured, it can add latency as the wrong packets may be getting delayed in a buffer when other traffic is present @tambryantbutow gtk

@tambryantbutow OH: “a bad QoS policy is worse than no
QoS policy”

Now we can resolve the issue by fixing the misconfigured
QoS policy and putting in place a notification system for service owners. We then re-run our Chaos Engineering Scenario to ensure the fix works. @tambryantbutow yes

This is when you... Continue advancing your practice of Network
Chaos Engineering @tambryantbutow

@tambryantbutow There are lots of other benefits you can get
from practicing Chaos Engineering on the network. Find monitoring and observability gaps, validate dependencies, train teams for on-call & get more sleep.

@tambryantbutow My favourite benefit is a reduction in MTTD (mean
time to detection). I care about this so much, I wrote a book on it with friends! gremlin.com/oreilly-reducing-mttd- for-high-severity-incidents/

Now for some demos… Network Chaos Engineering @tambryantbutow

Architecture 12 Services - What matters most to our customers?
@tammyxbryant

What can we remove from the critical path? @tammyxbryant Architecture

@tambryantbutow OH: “Getting out of the critical path is a
good thing” gtk

Service Not Found Architecture @tammyxbryant Does blackholing a non-critical path
service like the Recommendation Service cause unexpected failures for critical services like the Product Catalogue or Frontend?

Blackhole → Ads @tambryantbutow

@tambryantbutow

@tambryantbutow Does blackholing a non-critical path service like the Ad
Service result in graceful degradation of the customer experience?

@tambryantbutow

@tambryantbutow When the 60s blackhole attack ends, will everything return
to “normal”?

@tambryantbutow

Graceful Degradation @tambryantbutow Yes, our experiment was successful and our
results were what we expected them to be.

Blackhole → Recommendations @tambryantbutow

@tambryantbutow

@tambryantbutow Does blackholing a non-critical path service like the Recommendations
Service result in graceful degradation of the customer experience?

@tambryantbutow Our requests for product pages are cancelled because the
first product page request is stalled and unable to complete successfully. This continues for the duration of the recommendation catalogue outage due to a previously unknown dependency on product assets that are unavailable.

Major Incident @tambryantbutow Yes, our experiment was not successful and
our results were not what we expected them to be. We’ll need to fix these dependency issues to ensure this doesn’t happen again.

📦 Packet Loss → Cart @tambryantbutow

@tambryantbutow

@tambryantbutow Do we experience data loss if there is packet
loss impacting the Cart Service?

@tambryantbutow Due to the packet loss attack on the shopping
cart, when trying to add items to the cart the user will be given a 500 Internal Server “Failed To Add To Cart”.

Learning from our Chaos Engineering Experiments @tambryantbutow

Was it expected, was it detected, was it mitigated, can
we fix the issues? Can we automate this? How can we best share our findings and results? @tambryantbutow

Join the Chaos Engineering community: gremlin.com/slack 8000+ engineers coming together
to share knowledge. @tambryantbutow

@tambryantbutow Know someone awesome @ Chaos Engineering? Nominate them for
a Chaos Champion award gremlin.com/champions

Thanks new and old friends @tambryantbutow You can find me
on twitter: @tambryantbutow and on LinkedIn. Bonus sticker gift pack from Gremlin: gremlin.com/talk/ete

Chaos Engineering: When The Network Breaks - Ta...

Chaos Engineering: When The Network Breaks - Tammy Bryant Butow (Gremlin) - ETE 2021

More Decks by Tammy Bryant Butow

Other Decks in Technology

Featured

Transcript