Slide 1

Slide 1 text

Anatomy of a Production
 Kubernetes Outage ! Oliver Beattie Head of Engineering, Monzo Bank

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

You've spent £35.50 today ! £10 at Tiger now MONZO Wednesday, 2 May 5:10

Slide 7

Slide 7 text

> 500 micro services Built on open source software

Slide 8

Slide 8 text

Story of an outage

Slide 9

Slide 9 text

CAST OF CHARACTERS " Kubernetes etcd Linkerd Humans #

Slide 10

Slide 10 text

etcd upgrade ⏰ 2 WEEKS BEFORE THE OUTAGE

Slide 11

Slide 11 text

Deployment of faulty service Scaled to zero replicas ⏰ 1 DAY BEFORE THE OUTAGE

Slide 12

Slide 12 text

⏰ START OF PARTIAL OUTAGE Ledger change deployed

Slide 13

Slide 13 text

Ledger change rolled back ⏰ 2 MINS INTO THE OUTAGE

Slide 14

Slide 14 text

Linkerd identified as unhealthy ⏰ 6 MINS INTO THE OUTAGE

Slide 15

Slide 15 text

Begin restarting Linkerd pods ⏰ 16 MINS INTO THE OUTAGE

Slide 16

Slide 16 text

New Linkerd pods cannot start Kubernetes apiserver restarted ⏰ 27 MINS INTO THE OUTAGE

Slide 17

Slide 17 text

Finish restarting Linkerd pods ⏰ ESCALATED TO TOTAL OUTAGE 1 HR 3 MINS INTO THE OUTAGE

Slide 18

Slide 18 text

Linkerd NullPointerException
 observed on start up ⏰ 1 HR 17 MINS INTO THE OUTAGE

Slide 19

Slide 19 text

Linkerd/k8s incompatibility found Empty services deleted ⏰ END OF OUTAGE 1 HR 21 MINS

Slide 20

Slide 20 text

IMPACT % 1 hour, 21 mins of cluster downtime Vast majority of payments succeeded throughout

Slide 21

Slide 21 text

ROOT CAUSES ! Bug in gRPC client library affecting etcd Incompatibility between Kubernetes + Linkerd

Slide 22

Slide 22 text

"endpoints": [] V K8S < 1.6

Slide 23

Slide 23 text

"endpoints": [] VS. "endpoints": null K8S < 1.6 K8S 1.6+

Slide 24

Slide 24 text

ROOT CAUSES ! Bug in gRPC client library affecting etcd Incompatibility between Kubernetes + Linkerd Human error

Slide 25

Slide 25 text

LESSONS & Defence in depth

Slide 26

Slide 26 text

Mastercard
 Banknet 
 Mastercard
 processor AWS PHYSICAL
 DATA CENTRES 
 Mastercard
 proxy 
 Ledger

Slide 27

Slide 27 text

Mastercard
 Banknet 
 Mastercard
 processor AWS PHYSICAL
 DATA CENTRES 
 Mastercard
 proxy 
 Ledger '

Slide 28

Slide 28 text

LESSONS & Chaos engineering

Slide 29

Slide 29 text

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

Slide 30

Slide 30 text

LESSONS ( More monitoring, more visible

Slide 31

Slide 31 text

LESSONS & Be transparent; embrace the community

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

monzo.com/careers

Slide 35

Slide 35 text

@obeattie