×
Copy
Open
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
Anatomy of a Production Kubernetes Outage ! Oliver Beattie Head of Engineering, Monzo Bank
Slide 2
Slide 2 text
No content
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
You've spent £35.50 today ! £10 at Tiger now MONZO Wednesday, 2 May 5:10
Slide 7
Slide 7 text
> 500 micro services Built on open source software
Slide 8
Slide 8 text
Story of an outage
Slide 9
Slide 9 text
CAST OF CHARACTERS " Kubernetes etcd Linkerd Humans #
Slide 10
Slide 10 text
etcd upgrade ⏰ 2 WEEKS BEFORE THE OUTAGE
Slide 11
Slide 11 text
Deployment of faulty service Scaled to zero replicas ⏰ 1 DAY BEFORE THE OUTAGE
Slide 12
Slide 12 text
⏰ START OF PARTIAL OUTAGE Ledger change deployed
Slide 13
Slide 13 text
Ledger change rolled back ⏰ 2 MINS INTO THE OUTAGE
Slide 14
Slide 14 text
Linkerd identified as unhealthy ⏰ 6 MINS INTO THE OUTAGE
Slide 15
Slide 15 text
Begin restarting Linkerd pods ⏰ 16 MINS INTO THE OUTAGE
Slide 16
Slide 16 text
New Linkerd pods cannot start Kubernetes apiserver restarted ⏰ 27 MINS INTO THE OUTAGE
Slide 17
Slide 17 text
Finish restarting Linkerd pods ⏰ ESCALATED TO TOTAL OUTAGE 1 HR 3 MINS INTO THE OUTAGE
Slide 18
Slide 18 text
Linkerd NullPointerException observed on start up ⏰ 1 HR 17 MINS INTO THE OUTAGE
Slide 19
Slide 19 text
Linkerd/k8s incompatibility found Empty services deleted ⏰ END OF OUTAGE 1 HR 21 MINS
Slide 20
Slide 20 text
IMPACT % 1 hour, 21 mins of cluster downtime Vast majority of payments succeeded throughout
Slide 21
Slide 21 text
ROOT CAUSES ! Bug in gRPC client library affecting etcd Incompatibility between Kubernetes + Linkerd
Slide 22
Slide 22 text
"endpoints": [] V K8S < 1.6
Slide 23
Slide 23 text
"endpoints": [] VS. "endpoints": null K8S < 1.6 K8S 1.6+
Slide 24
Slide 24 text
ROOT CAUSES ! Bug in gRPC client library affecting etcd Incompatibility between Kubernetes + Linkerd Human error
Slide 25
Slide 25 text
LESSONS & Defence in depth
Slide 26
Slide 26 text
Mastercard Banknet Mastercard processor AWS PHYSICAL DATA CENTRES Mastercard proxy Ledger
Slide 27
Slide 27 text
Mastercard Banknet Mastercard processor AWS PHYSICAL DATA CENTRES Mastercard proxy Ledger '
Slide 28
Slide 28 text
LESSONS & Chaos engineering
Slide 29
Slide 29 text
“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
Slide 30
Slide 30 text
LESSONS ( More monitoring, more visible
Slide 31
Slide 31 text
LESSONS & Be transparent; embrace the community
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
No content
Slide 34
Slide 34 text
monzo.com/careers
Slide 35
Slide 35 text
@obeattie