Slide 1

Slide 1 text

Making the Impossible Impossible Improving Reliability by Preventing Classes of Problems @ChrisSinjo Impossible

Slide 2

Slide 2 text

Hi

Slide 3

Slide 3 text

Hi Greetings

Slide 4

Slide 4 text

@ChrisSinjo

Slide 5

Slide 5 text

@ChrisSinjo

Slide 6

Slide 6 text

Infra Engineer

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Making the Impossible Impossible Improving Reliability by Preventing Classes of Problems @ChrisSinjo Impossible

Slide 9

Slide 9 text

We are at SREcon

Slide 10

Slide 10 text

We likely share: - Job titles 
 - Skills - Ways of thinking

Slide 11

Slide 11 text

Common ground/ "Best practices"

Slide 12

Slide 12 text

Some ideas have outsized impact

Slide 13

Slide 13 text

In SRE: SLOs (Service Level Objectives)

Slide 14

Slide 14 text

A refresher: Measuring the performance of a service as a percentage of successful operations

Slide 15

Slide 15 text

Successful requests Total requests Example: HTTP requests x 100 ≥ 99.9%

Slide 16

Slide 16 text

So why am I here today?

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

The perils of success

Slide 19

Slide 19 text

The way we measure shapes The way we think

Slide 20

Slide 20 text

The way we think shapes The solutions we explore

Slide 21

Slide 21 text

SLOs encourage percentage thinking

Slide 22

Slide 22 text

Instances go unhealthy ‑ Add health checks & route traf fi c away

Slide 23

Slide 23 text

Instances go unhealthy ‑ Add health checks & route traf fi c away

Slide 24

Slide 24 text

Regional network issues ‑ Serve from multiple regions

Slide 25

Slide 25 text

Regional network issues ‑ Serve from multiple regions

Slide 26

Slide 26 text

Rare slow requests ‑ Add timeouts to protect majority of traf fi c

Slide 27

Slide 27 text

Rare slow requests ‑ Add timeouts to protect majority of traf fi c

Slide 28

Slide 28 text

Successful requests Total requests Example: HTTP requests x 100 ≥ 99.9%

Slide 29

Slide 29 text

Reliability is a percentage game

Slide 30

Slide 30 text

We can stack the odds in our favour

Slide 31

Slide 31 text

Not all solutions 
 look the 
 same

Slide 32

Slide 32 text

Not all solutions 
 are about 
 percentages

Slide 33

Slide 33 text

Some solutions prevent problems entirely

Slide 34

Slide 34 text

Today's talk: - Another lens for reliability - Examples in the wild 
 - How to spot problems of this shape

Slide 35

Slide 35 text

Today's talk: - Another lens for reliability - Examples in the wild 
 - How to spot problems of this shape

Slide 36

Slide 36 text

Today's talk: - Another lens for reliability - Examples in the wild 
 - How to spot problems of this shape

Slide 37

Slide 37 text

This is not: - An attack on SLOs 
 - One-size- fi ts all solution - Possible if you can't edit software

Slide 38

Slide 38 text

This is not: - An attack on SLOs 
 - One-size- fi ts all solution - Possible if you can't edit software

Slide 39

Slide 39 text

This is not: - An attack on SLOs 
 - One-size- fi ts all solution - Possible if you can't edit software

Slide 40

Slide 40 text

Examples: - State machines - Type systems & memory safety 
 - Database migrations

Slide 41

Slide 41 text

Examples: - State machines - Memory safety 
 - Database migrations 


Slide 42

Slide 42 text

Examples: - State machines - Memory safety 
 - Database migrations 


Slide 43

Slide 43 text

Example 1 State machines

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Collect from customer ‑ Pay out to merchant

Slide 46

Slide 46 text

Collect from customer ‑ Pay out to merchant

Slide 47

Slide 47 text

Payment 💸

Slide 48

Slide 48 text

Payment 💸 Created Submitted Collected Paid out Failed

Slide 49

Slide 49 text

Simple model id description state 1 Laptop submitted 2 Phone collected 3 Unused domain renewal collected

Slide 50

Slide 50 text

Simple model id description state 1 Laptop submitted 2 Phone collected 3 Unused domain renewal collected

Slide 51

Slide 51 text

Simple model id description state 1 Laptop collected 2 Phone collected 3 Unused domain renewal collected

Slide 52

Slide 52 text

Simple model id description state 1 Laptop paid_out 2 Phone collected 3 Unused domain renewal collected

Slide 53

Slide 53 text

Simple model id description state 1 Laptop submitted 2 Phone collected 3 Unused domain renewal collected

Slide 54

Slide 54 text

Simple model id description state 1 Laptop failed 2 Phone collected 3 Unused domain renewal collected

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

Submitted ➡ Failed Collected ➡ Failed?

Slide 57

Slide 57 text

Submitted ➡ Failed Collected ➡ Failed?

Slide 58

Slide 58 text

Submitted ➡ Failed Paid out ➡ Failed?

Slide 59

Slide 59 text

We want some restrictions

Slide 60

Slide 60 text

class Payment def fail() state = "failed" State restriction pseudocode

Slide 61

Slide 61 text

class Payment def fail() if state == "submitted" state = "failed" else raise "Cannot fail from state: #{state}" State restriction pseudocode

Slide 62

Slide 62 text

class Payment def submit() if state == "created" state = "submitted" else raise "Cannot submit from state: #{state}" State restriction pseudocode

Slide 63

Slide 63 text

Payment 💸 Created Submitted Collected Paid out Failed

Slide 64

Slide 64 text

Payment 💸 Created Submitted Collected Payout submitted Paid out Failed

Slide 65

Slide 65 text

class Payment def fail() if state in ["submitted", "payout_submitted"] state = "failed" else raise "Cannot fail from state: #{state}" State restriction pseudocode

Slide 66

Slide 66 text

An ad-hoc mess

Slide 67

Slide 67 text

Bugs 📈 Maintenance 📈

Slide 68

Slide 68 text

Computer Science has an answer

Slide 69

Slide 69 text

We can use a state machine

Slide 70

Slide 70 text

State machine: - A set of states - A set of allowed transitions between those states

Slide 71

Slide 71 text

class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted", "failed") ... State machine pseudocode

Slide 72

Slide 72 text

Created Collected Paid out Failed Submitted

Slide 73

Slide 73 text

Created Collected Paid out Failed Submitted

Slide 74

Slide 74 text

class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted", "failed") ... State machine pseudocode

Slide 75

Slide 75 text

Error: cannot transition from "paid out" to "failed"

Slide 76

Slide 76 text

class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted", "failed") ... State machine pseudocode

Slide 77

Slide 77 text

class Payment states(["created", "submitted", ...]) allow_transition("created", "submitted") allow_transition("submitted", "collected") allow_transition("submitted", "failed") allow_transition("failed", "submitted") ... State machine pseudocode

Slide 78

Slide 78 text

Created Collected Paid out Failed Submitted

Slide 79

Slide 79 text

Often dismissed: 
 "Too academic"

Slide 80

Slide 80 text

https://github.com/gocardless/statesman

Slide 81

Slide 81 text

Make the problem impossible

Slide 82

Slide 82 text

Example 2 Memory safety

Slide 83

Slide 83 text

Not here to sell you Rust

Slide 84

Slide 84 text

Something we often take for granted

Slide 85

Slide 85 text

But fi rst, some C

Slide 86

Slide 86 text

char *ptr = malloc(SIZE); do_stuff(ptr); free(ptr); Memory allocation in C

Slide 87

Slide 87 text

char *ptr = malloc(SIZE); do_stuff(ptr); free(ptr); // Many lines more code do_other_stuff(ptr); Use-after-free in C

Slide 88

Slide 88 text

Unde fi ned behaviour (You don't know what your program will do)

Slide 89

Slide 89 text

Unde fi ned behaviour (An attacker might be able to abuse it)

Slide 90

Slide 90 text

https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=use+after+free+2022 A non-scienti fi c study

Slide 91

Slide 91 text

https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-41849 A non-scienti fi c study

Slide 92

Slide 92 text

You don't know which one will be serious

Slide 93

Slide 93 text

The assertion that we can simply code better is nonsense

Slide 94

Slide 94 text

Something we often take for granted

Slide 95

Slide 95 text

Garbage collected languages

Slide 96

Slide 96 text

def main() name = "Chris" greet(name) def greet(name) puts("Hello #{name}") Garbage collection pseudocode

Slide 97

Slide 97 text

Garbage collection pseudocode def main() name = "Chris" greet(name) def greet(name) puts("Hello #{name}") Falls out of scope

Slide 98

Slide 98 text

The computer does it for you

Slide 99

Slide 99 text

Garbage collection is outrageously successful

Slide 100

Slide 100 text

Java Go Ruby Python JavaScript C# Haskell Lisp PHP Erlang

Slide 101

Slide 101 text

But what about...

Slide 102

Slide 102 text

You don't always want a runtime

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

Stuck with manual memory management

Slide 106

Slide 106 text

Until...

Slide 107

Slide 107 text

No content

Slide 108

Slide 108 text

No content

Slide 109

Slide 109 text

No content

Slide 110

Slide 110 text

No content

Slide 111

Slide 111 text

Okay so hear me out

Slide 112

Slide 112 text

Ownership & borrow-checking

Slide 113

Slide 113 text

Tl;dr: Every value in memory has at most one owner

Slide 114

Slide 114 text

def main() name = "Chris" greet(name) def greet(name) puts("Hello #{name}") Garbage collection pseudocode

Slide 115

Slide 115 text

fn main() { let name = String::from("Chris"); greet(name); } fn greet(name: String) { println!("Hello {}", name); } Rust greetings

Slide 116

Slide 116 text

fn main() { let name = String::from("Chris"); greet(name); } fn greet(name: String) { println!("Hello {}", name); } Rust greetings Owner transferred

Slide 117

Slide 117 text

fn main() { let name = String::from("Chris"); greet(name); } fn greet(name: String) { println!("Hello {}", name); } Rust greetings Falls out of scope Owner transferred

Slide 118

Slide 118 text

Owner out-of-scope ‑ Value dropped

Slide 119

Slide 119 text

fn main() { let name = String::from("Chris"); greet(name); say_goodbye(name); } fn greet(name: String) { println!("Hello {}", name); } Rust greetings Compiler error

Slide 120

Slide 120 text

fn main() { let name = String::from("Chris"); greet(&name); say_goodbye(name); } fn greet(name: &String) { println!("Hello {}", name); } Rust greetings Borrow

Slide 121

Slide 121 text

No manual memory management

Slide 122

Slide 122 text

The computer does it for you

Slide 123

Slide 123 text

No GC

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

Make the problem impossible

Slide 126

Slide 126 text

Example 3 Database migrations

Slide 127

Slide 127 text

MySQL (but also true in Postgres)

Slide 128

Slide 128 text

-- Create a table CREATE TABLE payments ( id int NOT NULL, ... ) -- Realise `int` isn't large enough (232) -- You're going to run out of IDs ALTER TABLE payments MODIFY id bigint;

Slide 129

Slide 129 text

-- Create a table CREATE TABLE payments ( id int NOT NULL, ... ) -- Realise `int` isn't large enough (232) -- You're going to run out of IDs ALTER TABLE payments MODIFY id bigint;

Slide 130

Slide 130 text

-- Create a table CREATE TABLE payments ( id int NOT NULL, ... ) -- Realise `int` isn't large enough (232) -- You're going to run out of IDs ALTER TABLE payments MODIFY id bigint; Blocks all other queries

Slide 131

Slide 131 text

🕵 The migrations reviewer

Slide 132

Slide 132 text

Add a new column or Recreate the table

Slide 133

Slide 133 text

No content

Slide 134

Slide 134 text

🕵 The migrations reviewer

Slide 135

Slide 135 text

😰 The migrations reviewer

Slide 136

Slide 136 text

🕵 🕵 🕵 The migrations reviewers

Slide 137

Slide 137 text

😰 😰 😰 The migrations reviewers

Slide 138

Slide 138 text

It doesn't scale

Slide 139

Slide 139 text

and it's still not enough

Slide 140

Slide 140 text

Seemingly innocuous ALTER TABLE payments ADD COLUMN refunded boolean;

Slide 141

Slide 141 text

But can still be dangerous

Slide 142

Slide 142 text

-- Slow transaction START TRANSACTION; SELECT * FROM payments; -- Forces this to queue ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these SELECT * FROM payments WHERE id = 123;

Slide 143

Slide 143 text

-- Slow transaction START TRANSACTION; SELECT * FROM payments; -- Forces this to queue ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these SELECT * FROM payments WHERE id = 123;

Slide 144

Slide 144 text

-- Slow transaction START TRANSACTION; SELECT * FROM payments; -- Forces this to queue ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these SELECT * FROM payments WHERE id = 123;

Slide 145

Slide 145 text

No content

Slide 146

Slide 146 text

No content

Slide 147

Slide 147 text

No content

Slide 148

Slide 148 text

Tl;dr: - MySQL-compatible - Scalability (sharding) 
 - High-availability

Slide 149

Slide 149 text

Tl;dr: - MySQL-compatible - Scalability (sharding) 
 - High-availability

Slide 150

Slide 150 text

Tl;dr: - MySQL-compatible - Scalability (sharding) 
 - High-availability

Slide 151

Slide 151 text

No content

Slide 152

Slide 152 text

No content

Slide 153

Slide 153 text

VReplication A stream of changes

Slide 154

Slide 154 text

Delete Insert Update

Slide 155

Slide 155 text

ALTER TABLE payments MODIFY id bigint;

Slide 156

Slide 156 text

ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone

Slide 157

Slide 157 text

id (bigint) description ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone

Slide 158

Slide 158 text

id (bigint) description 1 Laptop ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone

Slide 159

Slide 159 text

id (bigint) description 1 Laptop ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal

Slide 160

Slide 160 text

id (bigint) description 1 Laptop 2 Phone ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal

Slide 161

Slide 161 text

id (bigint) description 1 Laptop 2 Phone 3 Unused domain renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal

Slide 162

Slide 162 text

id (bigint) description 1 Laptop 2 Phone 3 Unused domain renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal

Slide 163

Slide 163 text

id (bigint) description 1 Laptop 2 Phone 3 Unused domain renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal User queries (via proxy)

Slide 164

Slide 164 text

id (bigint) description 1 Laptop 2 Phone 3 Unused domain renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal User queries (via proxy)

Slide 165

Slide 165 text

id (bigint) description 1 Laptop 2 Phone 3 Unused domain renewal ALTER TABLE payments MODIFY id bigint; id (int) description 1 Laptop 2 Phone 3 Unused domain renewal User queries (via proxy)

Slide 166

Slide 166 text

Fully-online schema migrations

Slide 167

Slide 167 text

😰 😰 😰 The migrations reviewers

Slide 168

Slide 168 text

People doing their actual job 😀 😀 😀

Slide 169

Slide 169 text

Make the problem impossible

Slide 170

Slide 170 text

Examples

Slide 171

Slide 171 text

No content

Slide 172

Slide 172 text

Take aways: - Complementary technique - You have to write software 
 - It's not easy to spot

Slide 173

Slide 173 text

SLOs are alive and well

Slide 174

Slide 174 text

Percentage solutions are too

Slide 175

Slide 175 text

Percentage solutions

Slide 176

Slide 176 text

A complementary technique

Slide 177

Slide 177 text

No content

Slide 178

Slide 178 text

https://gocardless.com/blog/fear-free-postgresql-migrations-for-rails/

Slide 179

Slide 179 text

Take aways: - Complementary technique - You have to write software 
 - It's not easy to spot

Slide 180

Slide 180 text

No code changes

Slide 181

Slide 181 text

This is not one of them

Slide 182

Slide 182 text

Sometimes BIG Sometimes small

Slide 183

Slide 183 text

Not everyone can build a database

Slide 184

Slide 184 text

https://github.com/gocardless/statesman

Slide 185

Slide 185 text

Maybe someone already solved it

Slide 186

Slide 186 text

Take aways: - Complementary technique - You have to write software 
 - It's not easy to spot - But there are some tells

Slide 187

Slide 187 text

Take aways: - Complementary technique - You have to write software 
 - It's not easy to spot - But there are some tells

Slide 188

Slide 188 text

🕵 The migrations reviewer

Slide 189

Slide 189 text

No content

Slide 190

Slide 190 text

🙄 Smug internet 
 comments

Slide 191

Slide 191 text

No content

Slide 192

Slide 192 text

🙄 Smug internet 
 comments

Slide 193

Slide 193 text

Examples: - State machines - Memory safety 
 - Database migrations 
 Add more unit tests Write better C Just hire

Slide 194

Slide 194 text

Smug comments: - State machines - Memory safety 
 - Database migrations 
 Write better C Just hire

Slide 195

Slide 195 text

Smug comments: - State machines - Memory safety 
 - Database migrations 
 Add more unit tests Write better C Just hire

Slide 196

Slide 196 text

Smug comments: - State machines - Memory safety 
 - Database migrations 
 Add more unit tests Write better C Just hire

Slide 197

Slide 197 text

Smug comments: - State machines - Memory safety 
 - Database migrations 
 Add more unit tests Write better C Just hire a DBA

Slide 198

Slide 198 text

There's probably more to it

Slide 199

Slide 199 text

The assertion that we can simply code better is nonsense

Slide 200

Slide 200 text

We can do better

Slide 201

Slide 201 text

Thank you ✌❤ @ChrisSinjo @planetscaledata

Slide 202

Slide 202 text

Image credits • Poker Winnings - slgckgc - CC-BY - https://www. fl ickr.com/photos/slgc/42157896194/ • Thinking Face - Twemoji - CC-BY - https://github.com/twitter/twemoji • Ferris (Extra-cute) - Unof fi cial Rust mascot - Copyright waived - https://rustacean.net/ • A350 Board - Mark Turnauckas - CC-BY - https://www. fl ickr.com/photos/marktee/ 17118767669/ • Play - Annie Roi - CC-BY - https://www. fl ickr.com/photos/annieroi/4421442720/

Slide 203

Slide 203 text

Image credits • White jigsaw puzzle with missing piece - Marco Verch Professional Photographer - CC-BY - https://www. fl ickr.com/photos/30478819@N08/50605134766/ • Hedge maze - claumoho - CC-BY - https:// fl ickr.com/photos/claudiah/3929921991/ • photo_1405_20060410 - Robo Android - CC-BY - https://www. fl ickr.com/photos/ 49140926@N07/6798304070/ • Gears - Mustang Joe - Public Domain - https://www. fl ickr.com/photos/mustangjoe/ 20437315996/

Slide 204

Slide 204 text

Questions? ✌❤ @ChrisSinjo @planetscaledata