Slide 1

Slide 1 text

An exploration of black holes: strange failure modes Tammy Bryant Butow, SRE @ Gremlin @tambryantbutow

Slide 2

Slide 2 text

@tambryantbutow

Slide 3

Slide 3 text

@tambryantbutow https://www.youtube.com/watch?v=7TIkE-9pZOk

Slide 4

Slide 4 text

@tambryantbutow Today we will explore strange failure modes that are unexpected and surprising

Slide 5

Slide 5 text

@tambryantbutow What is a black hole?

Slide 6

Slide 6 text

A region of a distributed system where gravity is so strong that nothing—no requests or transactions —can escape from it. All IP packets in this region are unable to escape. What is a black hole? @tambryantbutow

Slide 7

Slide 7 text

Capture IP packets at the transport layer, targeted by supplied port and host arguments. Use existing traffic policing features in the Linux Kernel to drop targeted IP packets. How can we create a black hole? @tambryantbutow

Slide 8

Slide 8 text

@tambryantbutow https://www.youtube.com/watch?v=P5_Msrdg3Hk

Slide 9

Slide 9 text

@tambryantbutow What happens when your banking transactions are involved?

Slide 10

Slide 10 text

https://github.com/GoogleCloudPlatform/bank-of-anthos The architecture of our banking application @tambryantbutow

Slide 11

Slide 11 text

Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

Slide 12

Slide 12 text

https://app.gremlin.com/attacks/new/kubernetes Does blackholing a critical path service like the Balance Reader result in graceful degradation of the customer experience? @tambryantbutow

Slide 13

Slide 13 text

The balance appears as $--- This could make the user think they have no money in their account @tambryantbutow

Slide 14

Slide 14 text

The user is still able to make a deposit of $1000 while the Balance Reader service is in a blackhole. @tambryantbutow

Slide 15

Slide 15 text

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @tambryantbutow

Slide 16

Slide 16 text

The user is unable to send payments. They will see an error that the payment failed due to Balance Reader. @tambryantbutow

Slide 17

Slide 17 text

@tambryantbutow

Slide 18

Slide 18 text

Free demo environment to learn about black holes 1. Use this link to install with minikube on google cloud shell: https://ssh.cloud.google.com/cloudshell/editor?show=ide&cloudshell_git_repo=http s://github.com/GoogleCloudPlatform/bank-of-anthos&cloudshell_workspace=.&clo udshell_tutorial=extras/cloudshell/tutorial.md 2. Click minikube → start 3. In Cloud Shell terminal, run kubectl apply -f extras/jwt/jwt-secret.yaml 4. Click <> Cloud Code → Run on Kubernetes 5. To create black holes, create a namespace for gremlin and install gremlin as helm chart https://github.com/gremlin/helm @tambryantbutow

Slide 19

Slide 19 text

https://github.com/GoogleCloudPlatform/bank-of-anthos Now let’s see how a blackhole impacts our Transaction History service @tambryantbutow

Slide 20

Slide 20 text

Does blackholing transaction history result in graceful degradation of the customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

Slide 21

Slide 21 text

Does blackholing transaction history result in graceful degradation of the customer experience? https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

Slide 22

Slide 22 text

@tambryantbutow kubectl scale deployment transactionhistory --replicas=2 What can we do to mitigate against a blackhole? Depending on the service, scaling replicas may work well

Slide 23

Slide 23 text

@tambryantbutow kubectl get pods Now we have 2 transaction history pods

Slide 24

Slide 24 text

https://app.gremlin.com/attacks/new/kubernetes Let’s send 50% of transaction history pods into a blackhole @tambryantbutow

Slide 25

Slide 25 text

https://github.com/GoogleCloudPlatform/bank-of-anthos There will be a very short outage and then the other pod will take over Pod 2: Transaction History Pod 1: Transaction History Deployment Set: Transaction History replicas=2 @tambryantbutow

Slide 26

Slide 26 text

We are still able to see transaction history and no longer receive error messages. https://4503-f37e5de5-39bf-4406-acbc-9c7f2abb0d16.cs-us-east1-wzxb.cloudshell.dev/home @tambryantbutow

Slide 27

Slide 27 text

@tambryantbutow What happens when your shopping is involved

Slide 28

Slide 28 text

A distributed system: e-commerce example @tambryantbutow

Slide 29

Slide 29 text

@tambryantbutow

Slide 30

Slide 30 text

@tambryantbutow

Slide 31

Slide 31 text

@tambryantbutow

Slide 32

Slide 32 text

@tambryantbutow Does blackholing a non-critical path service like the Ad Service result in graceful degradation of the customer experience? @tambryantbutow

Slide 33

Slide 33 text

@tambryantbutow

Slide 34

Slide 34 text

Graceful Degradation Yes, our experiment was successful and our results were what we expected them to be. The blackhole did not negatively impact the customer experience or our overall SLOs. @tambryantbutow

Slide 35

Slide 35 text

@tambryantbutow What difference sizes of black holes can we experience or create?

Slide 36

Slide 36 text

@tambryantbutow Micro Stellar Supermassive We can experience and create black holes of all sizes. When creating black holes, start micro and gradually expand the blast radius @tambryantbutow

Slide 37

Slide 37 text

@tambryantbutow How do black holes impact observability? @tambryantbutow

Slide 38

Slide 38 text

@tambryantbutow What can we learn from blackhole-related failures? @tambryantbutow

Slide 39

Slide 39 text

@tambryantbutow How can we use black holes to learn how to make systems more reliable? @tambryantbutow

Slide 40

Slide 40 text

@tambryantbutow

Slide 41

Slide 41 text

GoogleCloudPlatform/bank-of-anthos @tambryantbutow

Slide 42

Slide 42 text

Thank you Get a copy of the O’Reilly ebook Reducing MTTD for High-Severity Incidents gremlin.com/talk/black holes @tambryantbutow

Slide 43

Slide 43 text

Thank you! @tambryantbutow @tambryantbutow