Slide 1

Slide 1 text

Doing things the hard way @ChrisSinjo

Slide 2

Slide 2 text

Hi

Slide 3

Slide 3 text

@ChrisSinjo

Slide 4

Slide 4 text

@ChrisSinjo

Slide 5

Slide 5 text

An SRE

Slide 6

Slide 6 text

GOCARDLESS

Slide 7

Slide 7 text

“Obvious” mistakes and why we make them

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Conference talks favour certain structures

Slide 10

Slide 10 text

Conference talks favour self-contained narratives

Slide 11

Slide 11 text

–Fixing Things Ltd “How we fixed the unfixable”

Slide 12

Slide 12 text

–ScaleCorp “How we scaled our system 100x”

Slide 13

Slide 13 text

These are great stories to tell!

Slide 14

Slide 14 text

But there’s more…

Slide 15

Slide 15 text

Mistakes

Slide 16

Slide 16 text

The ones that were “obvious”

Slide 17

Slide 17 text

The mistakes you never thought you’d make

Slide 18

Slide 18 text

Except you did

Slide 19

Slide 19 text

And I hope I can convince you

Slide 20

Slide 20 text

This is normal

Slide 21

Slide 21 text

The reasons are often reasonable

Slide 22

Slide 22 text

Talking openly is important

Slide 23

Slide 23 text

Context & biases

Slide 24

Slide 24 text

Size: 25 → 215 total (8 → 60 eng)

Slide 25

Slide 25 text

GOCARDLESS

Slide 26

Slide 26 text

Hindsight

Slide 27

Slide 27 text

Structure: 3 examples

Slide 28

Slide 28 text

foreach(example):

Slide 29

Slide 29 text

foreach(example): Define it

Slide 30

Slide 30 text

foreach(example): Define it What it looks like

Slide 31

Slide 31 text

foreach(example): Define it What it looks like Problems caused

Slide 32

Slide 32 text

foreach(example): Define it What it looks like Problems caused Fixes

Slide 33

Slide 33 text

Common themes Q&A

Slide 34

Slide 34 text

Common themes Q&A

Slide 35

Slide 35 text

So let’s get to it

Slide 36

Slide 36 text

Early Infra/ Product Divide Failure mode 1

Slide 37

Slide 37 text

You’re a young company

Slide 38

Slide 38 text

You’ve built a product

Slide 39

Slide 39 text

Your userbase is growing

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

You’ve also built this other thing

Slide 42

Slide 42 text

Your product needs it to work

Slide 43

Slide 43 text

It caught you by surprise

Slide 44

Slide 44 text

You have an infra!

Slide 45

Slide 45 text

It takes up dev time

Slide 46

Slide 46 text

No content

Slide 47

Slide 47 text

You weren’t ready for this

Slide 48

Slide 48 text

“Can’t someone make this go away?”

Slide 49

Slide 49 text

“We need to hire a DevOps”

Slide 50

Slide 50 text

It sounds silly

Slide 51

Slide 51 text

But it literally happens

Slide 52

Slide 52 text

–The least appealing job description ever “We have all this rubbish that’s distracting our devs.”

Slide 53

Slide 53 text

"

Slide 54

Slide 54 text

The phrasing was clunky

Slide 55

Slide 55 text

But the framing is common

Slide 56

Slide 56 text

Convenience

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

An understandable lever to pull

Slide 59

Slide 59 text

But…

Slide 60

Slide 60 text

Problems

Slide 61

Slide 61 text

Now you have organisational problems

Slide 62

Slide 62 text

Disconnect devs from production

Slide 63

Slide 63 text

A new bottleneck

Slide 64

Slide 64 text

Too much infra Too soon

Slide 65

Slide 65 text

Solutions

Slide 66

Slide 66 text

Assuming you can’t un-split

Slide 67

Slide 67 text

Make infra contributions easy

Slide 68

Slide 68 text

Make it obvious what needs changing

Slide 69

Slide 69 text

Make experimentation easy

Slide 70

Slide 70 text

Set aside time to coach

Slide 71

Slide 71 text

Breaking my own rules

Slide 72

Slide 72 text

Some up-front advice

Slide 73

Slide 73 text

First infra hire: dev background

Slide 74

Slide 74 text

Embed them in the existing team

Slide 75

Slide 75 text

Don’t give them sole ownership of the pager

Slide 76

Slide 76 text

Distracted by hard problems Failure mode 2

Slide 77

Slide 77 text

We hear it so often

Slide 78

Slide 78 text

–Every job ad “Join us and solve hard problems”

Slide 79

Slide 79 text

We assume hard problems are most important

Slide 80

Slide 80 text

They frequently aren’t

Slide 81

Slide 81 text

Outcome: we neglect the basics

Slide 82

Slide 82 text

When I say “basics”…

Slide 83

Slide 83 text

Observability

Slide 84

Slide 84 text

Metrics Monitoring (Structured) Events/Logs

Slide 85

Slide 85 text

Metrics Monitoring SLOs (Structured) Events/Logs

Slide 86

Slide 86 text

Metrics Monitoring Goals (Structured) Events/Logs

Slide 87

Slide 87 text

Metrics Monitoring Uptime (Structured) Events/Logs

Slide 88

Slide 88 text

Metrics Monitoring Error rate (Structured) Events/Logs

Slide 89

Slide 89 text

Metrics Monitoring Latency (Structured) Events/Logs

Slide 90

Slide 90 text

Easy to defer

Slide 91

Slide 91 text

It feels mundane - As a project - As ongoing work

Slide 92

Slide 92 text

It feels mundane - As a project - As ongoing work

Slide 93

Slide 93 text

“So how does this improve the service?”

Slide 94

Slide 94 text

“So how does this improve the service?” “We can measure it better.”

Slide 95

Slide 95 text

“So how does this improve the service?” “We can measure it better.” “How does that improve it?”

Slide 96

Slide 96 text

“So how does this improve the service?” “We can measure it better.” “How does that improve it?”

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

- Faster debugging - Shorter outages - Better project choice

Slide 99

Slide 99 text

- Faster debugging - Shorter outages - Better project choice

Slide 100

Slide 100 text

- Faster debugging - Shorter outages - Better project choice

Slide 101

Slide 101 text

It feels mundane - As a project - As ongoing work

Slide 102

Slide 102 text

It feels mundane - As a project - As ongoing work

Slide 103

Slide 103 text

Observability is ongoing work

Slide 104

Slide 104 text

Problems

Slide 105

Slide 105 text

Previously… https://www.youtube.com/watch?v=SAkNBiZzEX8

Slide 106

Slide 106 text

Was 10-15s of downtime okay?

Slide 107

Slide 107 text

Back to basics?

Slide 108

Slide 108 text

- Faster debugging - Shorter outages - Better project choice

Slide 109

Slide 109 text

- Slower debugging - Longer outages - Worse project choice

Slide 110

Slide 110 text

Lack of confidence

Slide 111

Slide 111 text

Solutions

Slide 112

Slide 112 text

Post-mortem meta-analysis

Slide 113

Slide 113 text

“It wasn’t clear where the problem was.” –Post-mortems 1, 2, 3

Slide 114

Slide 114 text

“We couldn’t break the errors down by user.” –Post-mortems 2, 3, 4

Slide 115

Slide 115 text

“It was a false alarm. Again.” –Post-mortems 3, 4, 5

Slide 116

Slide 116 text

You can do better at the basics

Slide 117

Slide 117 text

A cultural shift

Slide 118

Slide 118 text

Definition of done

Slide 119

Slide 119 text

Done when it’s shipped ↓ Done when it’s measured

Slide 120

Slide 120 text

A huge shift

Slide 121

Slide 121 text

Cultural change takes time

Slide 122

Slide 122 text

Start somewhere

Slide 123

Slide 123 text

There are other basics

Slide 124

Slide 124 text

- Post-mortem analysis - Tracking toil - Tracking pages per shift

Slide 125

Slide 125 text

- Post-mortem analysis - Tracking toil - Tracking pages per shift

Slide 126

Slide 126 text

- Post-mortem analysis - Tracking toil - Tracking pages per shift

Slide 127

Slide 127 text

No content

Slide 128

Slide 128 text

The everything project Failure mode 3

Slide 129

Slide 129 text

Story-based

Slide 130

Slide 130 text

Kinda painful to tell

Slide 131

Slide 131 text

The most immediate impact

Slide 132

Slide 132 text

You have an infra!

Slide 133

Slide 133 text

You’re not happy with it :(

Slide 134

Slide 134 text

It evolved haphazardly

Slide 135

Slide 135 text

You know where the problems are

Slide 136

Slide 136 text

You want to fix them

Slide 137

Slide 137 text

Reshaping the core

Slide 138

Slide 138 text

https://www.usenix.org/conference/srecon17americas/program/presentation/sinjakli Previously…

Slide 139

Slide 139 text

The precursor

Slide 140

Slide 140 text

Goal: Better deployment

Slide 141

Slide 141 text

Containers Orchestrator (Mesos) Load balancing Staging-per-developer Developer UI

Slide 142

Slide 142 text

Containers Orchestrator (Mesos) Load balancing Staging-per-developer Developer UI

Slide 143

Slide 143 text

Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

Slide 144

Slide 144 text

Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

Slide 145

Slide 145 text

Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

Slide 146

Slide 146 text

Everything Project We were working on an

Slide 147

Slide 147 text

Problems

Slide 148

Slide 148 text

The New World Everything is seen in terms of

Slide 149

Slide 149 text

The Old World So nothing happens back in

Slide 150

Slide 150 text

It feels efficient

Slide 151

Slide 151 text

But it’s not

Slide 152

Slide 152 text

Loss of impact

Slide 153

Slide 153 text

Loss of confidence

Slide 154

Slide 154 text

Loss of team morale

Slide 155

Slide 155 text

No content

Slide 156

Slide 156 text

Solutions

Slide 157

Slide 157 text

No content

Slide 158

Slide 158 text

Look for the smallest version

Slide 159

Slide 159 text

Look for the valuable part

Slide 160

Slide 160 text

For us: deployment

Slide 161

Slide 161 text

Containers Orchestrator (Mesos) Load balancing Staging-per-developer Self-serve developer UI

Slide 162

Slide 162 text

Containers Orchestrator (Mesos) Load balancing Staging-per-developer Developer UI

Slide 163

Slide 163 text

https://gocardless.com/blog/from-idea-to-reality-containers-in-production-at- gocardless/

Slide 164

Slide 164 text

Efficiency cannot come at the cost of everything else

Slide 165

Slide 165 text

No stopping the world

Slide 166

Slide 166 text

✅ Long-term goals Short-term reality

Slide 167

Slide 167 text

✅ Long-term goals % Short-term reality

Slide 168

Slide 168 text

No content

Slide 169

Slide 169 text

Mistakes

Slide 170

Slide 170 text

No content

Slide 171

Slide 171 text

I’ve presented 3 “obvious” mistakes

Slide 172

Slide 172 text

Not first Not last

Slide 173

Slide 173 text

Each has an internal logic

Slide 174

Slide 174 text

Conference talks favour self-contained narratives

Slide 175

Slide 175 text

Even when talking about mistakes

Slide 176

Slide 176 text

Technical mistakes are self-contained

Slide 177

Slide 177 text

Us vs Them

Slide 178

Slide 178 text

Us vs Them

Slide 179

Slide 179 text

And I hope I have convinced you

Slide 180

Slide 180 text

You won’t avoid every mistake

Slide 181

Slide 181 text

We certainly didn’t

Slide 182

Slide 182 text

It’s never perfect

Slide 183

Slide 183 text

It’s perfectly fine to correct course

Slide 184

Slide 184 text

Thank you &❤ @ChrisSinjo @GoCardlessEng

Slide 185

Slide 185 text

https://gocardless.com/schemes

Slide 186

Slide 186 text

We’re hiring &❤ @ChrisSinjo @GoCardlessEng

Slide 187

Slide 187 text

Image credits • XOXO Festival Day 2 - CC-BY - https://www.flickr.com/photos/textfiles/15237123601/ • USS Barry conducts a practice pipe-patching drills during MultiSail 17 - CC-BY - https:// www.flickr.com/photos/usnavy/32480491984/ • Train lever - CC-BY - https://www.flickr.com/photos/darkbuffet/2309897403/ • Calendar - CC-BY - https://www.flickr.com/photos/dafnecholet/5374200948/

Slide 188

Slide 188 text

Image credits • Unhappy man - CC0 - https://pixabay.com/en/unhappy-man-mask-sad-face- sitting-389944/ • Stop sign - CC-BY - https://www.flickr.com/photos/wolfsavard/4812833180/ • Rope - CC-BY - https://www.flickr.com/photos/49140926@N07/6798304070/

Slide 189

Slide 189 text

Questions? &❤ @ChrisSinjo @GoCardlessEng