Slide 1

Slide 1 text

Keeping a modern bank online ⚡ Chris Evans Platform / On-call Lead @evnsio

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

@evnsio

Slide 7

Slide 7 text

So how do you keep a modern bank online?

Slide 8

Slide 8 text

Monitoring and Alerting On-call Incident Response @evnsio

Slide 9

Slide 9 text

Monitoring and Alerting Making sure we know what’s going on

Slide 10

Slide 10 text

Monzo Platform @evnsio

Slide 11

Slide 11 text

Monitor liberally; alert judiciously Monetary Costs Human Costs @evnsio

Slide 12

Slide 12 text

Monitoring liberally Physical DCs Cloud Infrastructure Kubernetes Monzo Services Social Media Customer Queries Alerts Prometheus Monitoring @evnsio

Slide 13

Slide 13 text

~8000 unique scrape targets @evnsio

Slide 14

Slide 14 text

42 million active time series @evnsio

Slide 15

Slide 15 text

1.7 million samples per second @evnsio

Slide 16

Slide 16 text

Shared Dashboards @evnsio

Slide 17

Slide 17 text

Integrated Metrics @evnsio

Slide 18

Slide 18 text

Aim for alerts that are sensitive and specific @evnsio

Slide 19

Slide 19 text

✅ Sensitive ❌ Specific These go off all the time. When they go off, people typically ignore them Alerts Car Alarm ✅ Sensitive ✅ Specific When there’s an issue, they go off quickly. When they go off, people pay attention and leave a building. Building Alarm @evnsio

Slide 20

Slide 20 text

Decay Ownership Classification Routing Globals (Some of!) Our Alert Issues @evnsio

Slide 21

Slide 21 text

Alerts in Version Control Fetch alerts + reload @evnsio

Slide 22

Slide 22 text

Goal: Alert Zero @evnsio

Slide 23

Slide 23 text

On-call Building the team to support our systems

Slide 24

Slide 24 text

@evnsio

Slide 25

Slide 25 text

@evnsio

Slide 26

Slide 26 text

Early Days @evnsio

Slide 27

Slide 27 text

@evnsio

Slide 28

Slide 28 text

Expanding the Pool @evnsio

Slide 29

Slide 29 text

@evnsio

Slide 30

Slide 30 text

Introducing Specialists @evnsio

Slide 31

Slide 31 text

Shadow rotations to encourage learning Runbooks to document the undocumented People on-call when it makes sense for you @evnsio

Slide 32

Slide 32 text

Incident Response Restoring service as fast as possible

Slide 33

Slide 33 text

response.pagerduty.com @evnsio

Slide 34

Slide 34 text

Incident Response Workflow

Slide 35

Slide 35 text

Monzo Incident ⚡

Slide 36

Slide 36 text

Bring as much as reasonably practical into the conversation @evnsio

Slide 37

Slide 37 text

Make it so easy to do the right thing that nobody would have reason to do otherwise @evnsio

Slide 38

Slide 38 text

@evnsio

Slide 39

Slide 39 text

@evnsio

Slide 40

Slide 40 text

@evnsio

Slide 41

Slide 41 text

The Headline Post @evnsio

Slide 42

Slide 42 text

The Incident Doc @evnsio

Slide 43

Slide 43 text

The Comms Channel @evnsio

Slide 44

Slide 44 text

The Comms Channel @evnsio

Slide 45

Slide 45 text

@incident ... @evnsio

Slide 46

Slide 46 text

@evnsio

Slide 47

Slide 47 text

@evnsio

Slide 48

Slide 48 text

@evnsio

Slide 49

Slide 49 text

@evnsio

Slide 50

Slide 50 text

@evnsio

Slide 51

Slide 51 text

@evnsio

Slide 52

Slide 52 text

@evnsio

Slide 53

Slide 53 text

@evnsio

Slide 54

Slide 54 text

Integrating with PagerDuty @evnsio

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

...and that’s how we keep Monzo online @evnsio github.com/monzo/response