Slide 1

Slide 1 text

@tambryantbutow Tammy Bryant Butow @tambryantbutow Site Reliability Engineering For Kubernetes

Slide 2

Slide 2 text

@tambryantbutow WHAT CHANGES WHEN YOU ARE RESPONSIBLE FOR KUBERNETES CLUSTERS? WHAT METRICS MATTER FOR K8s? WHAT KIND OF INCIDENTS ARE COMMON WITH K8s? WHAT SHOULD I AVOID RUNNING ON K8s?

Slide 3

Slide 3 text

@tambryantbutow codeberg.org/hjacobs/kubernetes-failure-stories

Slide 4

Slide 4 text

@tambryantbutow COMMON FAILURE MODES FOR KUBERNETES IN PRODUCTION

Slide 5

Slide 5 text

@tambryantbutow

Slide 6

Slide 6 text

@tambryantbutow

Slide 7

Slide 7 text

@tambryantbutow COMMON FAILURE MODES FOR KUBERNETES ON AWS

Slide 8

Slide 8 text

@tambryantbutow COMMON CPU FAILURE MODES FOR KUBERNETES ON AWS

Slide 9

Slide 9 text

@tambryantbutow COMMON FAILURE MODES FOR KUBERNETES ON GCP

Slide 10

Slide 10 text

@tambryantbutow COMMON FAILURE MODES FOR KUBERNETES ON AZURE

Slide 11

Slide 11 text

@tambryantbutow

Slide 12

Slide 12 text

@tambryantbutow CREATING A RELIABILITY PLAN

Slide 13

Slide 13 text

@tambryantbutow tammybutow.medium.com/site-reliability-engineering- for-kubernetes-b52877c70fb7 TS;WR

Slide 14

Slide 14 text

@tambryantbutow gremlin.com/bootcamp

Slide 15

Slide 15 text

@tambryantbutow KUBERNETES DASHBOARD - RESPONSE TIMES

Slide 16

Slide 16 text

@tambryantbutow gremlin.com/community/tutorials/how-to-create-a-kubernetes-clust er-on-ubuntu-16-04-with-kubeadm-and-weave-net/

Slide 17

Slide 17 text

GOT STICKERS? GREMLIN.COM/TALK/SRE @tambryantbutow

Slide 18

Slide 18 text

THANK YOU! @tambryantbutow