Slide 1

Slide 1 text

Chaos Engineering When The Network Breaks Tammy Bryant Butow, Principal SRE @ Gremlin @tambryantbutow

Slide 2

Slide 2 text

Hello new and old friends @tambryantbutow You’ll find me on twitter: @tambryantbutow and on LinkedIn. Happy to answer questions : )) Principal SRE @ Gremlin Co-Founder @ Girl Geek Academy Previously SRE Manager @ Dropbox DigitalOcean, National Australia Bank, Queensland University of Technology… and more.

Slide 3

Slide 3 text

Have you ever thought the network was causing an incident but struggled to prove it? @tambryantbutow

Slide 4

Slide 4 text

You get out your network tools to try and collect evidence... @tambryantbutow

Slide 5

Slide 5 text

But everyone still says... it’s not “the network” @tambryantbutow

Slide 6

Slide 6 text

This is when you... Get started with Network Chaos Engineering @tambryantbutow

Slide 7

Slide 7 text

Thoughtful planned experiments designed to help you prove networking issues and get them resolved quickly @tambryantbutow yes

Slide 8

Slide 8 text

For example, say somebody is occasionally throttling your traffic, how do you discover and prove this? Answer: Network Chaos Engineering @tambryantbutow generate network traffic observe network traffic run latency attack review results

Slide 9

Slide 9 text

@tambryantbutow wow You realise this looks like misconfigured QoS - only throttling you during specific time windows.

Slide 10

Slide 10 text

What is QoS? Quality of service (QoS) controls and manages network resources by setting priorities for specific types of data on the network. If QoS is misconfigured, it can add latency as the wrong packets may be getting delayed in a buffer when other traffic is present @tambryantbutow gtk

Slide 11

Slide 11 text

@tambryantbutow OH: “a bad QoS policy is worse than no QoS policy”

Slide 12

Slide 12 text

Now we can resolve the issue by fixing the misconfigured QoS policy and putting in place a notification system for service owners. We then re-run our Chaos Engineering Scenario to ensure the fix works. @tambryantbutow yes

Slide 13

Slide 13 text

This is when you... Continue advancing your practice of Network Chaos Engineering @tambryantbutow

Slide 14

Slide 14 text

@tambryantbutow There are lots of other benefits you can get from practicing Chaos Engineering on the network. Find monitoring and observability gaps, validate dependencies, train teams for on-call & get more sleep.

Slide 15

Slide 15 text

@tambryantbutow My favourite benefit is a reduction in MTTD (mean time to detection). I care about this so much, I wrote a book on it with friends! gremlin.com/oreilly-reducing-mttd- for-high-severity-incidents/

Slide 16

Slide 16 text

Now for some demos… Network Chaos Engineering @tambryantbutow

Slide 17

Slide 17 text

Architecture 12 Services - What matters most to our customers? @tammyxbryant

Slide 18

Slide 18 text

What can we remove from the critical path? @tammyxbryant Architecture

Slide 19

Slide 19 text

@tambryantbutow OH: “Getting out of the critical path is a good thing” gtk

Slide 20

Slide 20 text

Service Not Found Architecture @tammyxbryant Does blackholing a non-critical path service like the Recommendation Service cause unexpected failures for critical services like the Product Catalogue or Frontend?

Slide 21

Slide 21 text

Blackhole → Ads @tambryantbutow

Slide 22

Slide 22 text

@tambryantbutow

Slide 23

Slide 23 text

@tambryantbutow

Slide 24

Slide 24 text

@tambryantbutow Does blackholing a non-critical path service like the Ad Service result in graceful degradation of the customer experience?

Slide 25

Slide 25 text

@tambryantbutow

Slide 26

Slide 26 text

@tambryantbutow When the 60s blackhole attack ends, will everything return to “normal”?

Slide 27

Slide 27 text

@tambryantbutow

Slide 28

Slide 28 text

Graceful Degradation @tambryantbutow Yes, our experiment was successful and our results were what we expected them to be.

Slide 29

Slide 29 text

Blackhole → Recommendations @tambryantbutow

Slide 30

Slide 30 text

@tambryantbutow

Slide 31

Slide 31 text

@tambryantbutow

Slide 32

Slide 32 text

@tambryantbutow Does blackholing a non-critical path service like the Recommendations Service result in graceful degradation of the customer experience?

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

@tambryantbutow Our requests for product pages are cancelled because the first product page request is stalled and unable to complete successfully. This continues for the duration of the recommendation catalogue outage due to a previously unknown dependency on product assets that are unavailable.

Slide 35

Slide 35 text

Major Incident @tambryantbutow Yes, our experiment was not successful and our results were not what we expected them to be. We’ll need to fix these dependency issues to ensure this doesn’t happen again.

Slide 36

Slide 36 text

📦 Packet Loss → Cart @tambryantbutow

Slide 37

Slide 37 text

@tambryantbutow

Slide 38

Slide 38 text

@tambryantbutow

Slide 39

Slide 39 text

@tambryantbutow Do we experience data loss if there is packet loss impacting the Cart Service?

Slide 40

Slide 40 text

@tambryantbutow Due to the packet loss attack on the shopping cart, when trying to add items to the cart the user will be given a 500 Internal Server “Failed To Add To Cart”.

Slide 41

Slide 41 text

Learning from our Chaos Engineering Experiments @tambryantbutow

Slide 42

Slide 42 text

Was it expected, was it detected, was it mitigated, can we fix the issues? Can we automate this? How can we best share our findings and results? @tambryantbutow

Slide 43

Slide 43 text

Join the Chaos Engineering community: gremlin.com/slack 8000+ engineers coming together to share knowledge. @tambryantbutow

Slide 44

Slide 44 text

@tambryantbutow Know someone awesome @ Chaos Engineering? Nominate them for a Chaos Champion award gremlin.com/champions

Slide 45

Slide 45 text

Thanks new and old friends @tambryantbutow You can find me on twitter: @tambryantbutow and on LinkedIn. Bonus sticker gift pack from Gremlin: gremlin.com/talk/ete