Slide 1

Slide 1 text

Lessons learned the hard way Postgres in production at GoCardless @ChrisSinjo

Slide 2

Slide 2 text

GOCARDLESS

Slide 3

Slide 3 text

POST /cash/monies HTTP/1.1 { amount: 100 }

Slide 4

Slide 4 text

High per-request

Slide 5

Slide 5 text

Uptime is

Slide 6

Slide 6 text

You will have different concerns

Slide 7

Slide 7 text

The distributed system as a whole

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Learning through the medium of post-mortems

Slide 13

Slide 13 text

What is a post-mortem? https://www.infoq.com/articles/postmortems-etsy

Slide 14

Slide 14 text

Failure can be a great teacher

Slide 15

Slide 15 text

5 stories

Slide 16

Slide 16 text

Resource utilisation ORMs Clustering

Slide 17

Slide 17 text

foreach(incident): What happened Something we learned Ideas you can use

Slide 18

Slide 18 text

3 things to do

Slide 19

Slide 19 text

Resource utilisation ORMs Clustering

Slide 20

Slide 20 text

Incident 1 The Fast Migration That Wasn’t

Slide 21

Slide 21 text

~10-20 schema changes per-month

Slide 22

Slide 22 text

Problem: locks

Slide 23

Slide 23 text

Many guides on safe schema changes

Slide 24

Slide 24 text

Don’t ADD COLUMN with DEFAULT VALIDATE constraints after adding Add indexes CONCURRENTLY

Slide 25

Slide 25 text

ALTER TABLE payments ADD COLUMN refunded boolean; No DEFAULT

Slide 26

Slide 26 text

The lock queue

Slide 27

Slide 27 text

-- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments;

Slide 28

Slide 28 text

-- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean;

Slide 29

Slide 29 text

-- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these (AccessShare) SELECT * FROM payments WHERE id = 123;

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

15 seconds of API downtime

Slide 32

Slide 32 text

Online Analytical Processing

Slide 33

Slide 33 text

Online Transaction Processing

Slide 34

Slide 34 text

Don’t OLAP where you OLTP

Slide 35

Slide 35 text

Set appropriate bounds on the system

Slide 36

Slide 36 text

SET lock_timeout… SET statement_timeout…

Slide 37

Slide 37 text

https://gocardless.com/blog/zero-downtime-postgres-migrations-a-little-help/

Slide 38

Slide 38 text

-- Wishlist SET transaction_timeout… -- In 9.6 SET idle_in_transaction_session_timeout…

Slide 39

Slide 39 text

log_lock_waits = on

Slide 40

Slide 40 text

Incident 2 Did we just use 1.5TB of disk?

Slide 41

Slide 41 text

Not big data

Slide 42

Slide 42 text

2 hours chasing timeouts

Slide 43

Slide 43 text

Oops

Slide 44

Slide 44 text

… LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11166", size 1245184 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11165", size 8675328 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp10729.11129", size 47734784 …

Slide 45

Slide 45 text

A runaway analytics query

Slide 46

Slide 46 text

Don’t OLAP where you OLTP

Slide 47

Slide 47 text

Near miss

Slide 48

Slide 48 text

Set appropriate bounds on the system

Slide 49

Slide 49 text

temp_file_limit = … # size in kB

Slide 50

Slide 50 text

But…

Slide 51

Slide 51 text

Data growth

Slide 52

Slide 52 text

Measure as you approach the limit

Slide 53

Slide 53 text

But…

Slide 54

Slide 54 text

> CREATE INDEX pay_merch ON payments (merchant_id); ERROR: temporary file size exceeds temp_file_limit

Slide 55

Slide 55 text

SET temp_file_limit = -1; # disable CREATE INDEX pay_merch ON payments (merchant_id); RESET temp_file_limit; # re-enable

Slide 56

Slide 56 text

Resource utilisation ORMs Clustering

Slide 57

Slide 57 text

Incident 3 Revenge of the ORM

Slide 58

Slide 58 text

Postgres Client 1 Client 2

Slide 59

Slide 59 text

Postgres Client 1 Client 2 Client 3 Client 4 Client 5

Slide 60

Slide 60 text

Postgres Client 1 Client 2 Client 3 Client 4 Client 5 . . .

Slide 61

Slide 61 text

PgBouncer with transaction-pooling

Slide 62

Slide 62 text

Postgres Client 1 Client 2 Client 3 Client 4 Client 5 PgBouncer

Slide 63

Slide 63 text

No session-level features

Slide 64

Slide 64 text

SET Session-level advisory locks Prepared statements

Slide 65

Slide 65 text

We prepared our apps for this

Slide 66

Slide 66 text

NoMethodError: undefined method 'fields' for nil:NilClass

Slide 67

Slide 67 text

–Literally everyone on the internet “You should disable prepared statements.”

Slide 68

Slide 68 text

3 days 3 people

Slide 69

Slide 69 text

On deployment On violating constraints

Slide 70

Slide 70 text

tcpdump + Wireshark

Slide 71

Slide 71 text

A bunch of extra statements!!!

Slide 72

Slide 72 text

SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings = on; SET client_min_messages TO "warning"; SET time zone "UTC";

Slide 73

Slide 73 text

SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings = on; SET client_min_messages TO "warning"; SET time zone "UTC";

Slide 74

Slide 74 text

Postgres Client 1 Client 2 Client 3 Client 4 Client 5 PgBouncer

Slide 75

Slide 75 text

Postgres Client 1 PgBouncer

Slide 76

Slide 76 text

Postgres Client 1 PgBouncer SET … "panic"; SET … "warning";

Slide 77

Slide 77 text

Postgres Client 1 PgBouncer SET … "panic"; SET … "warning"; SET … "panic";

Slide 78

Slide 78 text

Postgres Client 1 PgBouncer SET … "panic"; SET … "warning"; SET … "panic"; SET … "warning";

Slide 79

Slide 79 text

So what if client_min_messages is set to "panic"?

Slide 80

Slide 80 text

ERROR: duplicate key value violates unique constraint "index_users_on_email"

Slide 81

Slide 81 text

$ git show a456acb2f2 commit a456acb2f2af8365eb9151c7cd2d5a10c189d191 Author: Harry Marr Date: Wed Oct 28 16:30:02 2015 +0000 Avoid disabling postgres errors # Enable standard-conforming strings if available. def set_standard_conforming_strings - old, self.client_min_messages = client_min_messages, 'panic' - execute('SET standard_conforming_strings = on', 'SCHEMA') rescue nil - ensure - self.client_min_messages = old + execute(<<-SQL, 'SCHEMA') + UPDATE pg_settings + SET setting = 'on' + WHERE name = 'standard_conforming_strings' + SQL end

Slide 82

Slide 82 text

https://gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres- connections/

Slide 83

Slide 83 text

https://twitter.com/rbranson/status/675005104701870080

Slide 84

Slide 84 text

There’s a lot more than /usr/bin/postgresql

Slide 85

Slide 85 text

- PgBouncer - libpq - Ruby pg gem - ActiveRecord (ORM)

Slide 86

Slide 86 text

It’s just code

Slide 87

Slide 87 text

Don’t trust the ORM

Slide 88

Slide 88 text

Resource utilisation ORMs Clustering

Slide 89

Slide 89 text

How soon are we back online? Is all our data there?

Slide 90

Slide 90 text

What’s in a cluster?

Slide 91

Slide 91 text

Consensus State machine Health checks Actions

Slide 92

Slide 92 text

Pacemaker

Slide 93

Slide 93 text

Incident 4 Sorry, I cannot hear you I’m kinda busy

Slide 94

Slide 94 text

It starts with a page

Slide 95

Slide 95 text

crm mon status - demote event for Postgresql on : OK

Slide 96

Slide 96 text

psql: FATAL: sorry, too many clients already

Slide 97

Slide 97 text

30 seconds of API downtime

Slide 98

Slide 98 text

Roll back the last change

Slide 99

Slide 99 text

# Should only be set for schema migrations USE_DIRECT_PG_CONNECTION=true

Slide 100

Slide 100 text

What about superuser_reserved_connections?

Slide 101

Slide 101 text

70 Normal App

Slide 102

Slide 102 text

70 26 Normal App Unused

Slide 103

Slide 103 text

70 3 1 26 Normal App Unused Superuser (non-reserved) Superuser (reserved)

Slide 104

Slide 104 text

70 3 1 26 Normal Incident App Unused Superuser (non-reserved) Superuser (reserved) App 97

Slide 105

Slide 105 text

70 3 1 26 3 1 Normal Incident App Unused Superuser (non-reserved) Superuser (reserved) App Superuser (rejected) Superuser (reserved) 97

Slide 106

Slide 106 text

Set appropriate bounds on the system

Slide 107

Slide 107 text

No content

Slide 108

Slide 108 text

Incident 5 What’s in a health check?

Slide 109

Slide 109 text

su postgres -c 'psql -c "select now()"' Change to the Postgres user Run a simple query Connect to the database

Slide 110

Slide 110 text

Humans in LDAP Robots locally

Slide 111

Slide 111 text

su postgres -c 'psql -c "select now()"' Postgres is a local user

Slide 112

Slide 112 text

Safe to restart an LDAP node, right?

Slide 113

Slide 113 text

Wrong

Slide 114

Slide 114 text

Cluster transition due to health-check timeout

Slide 115

Slide 115 text

su postgres -c 'psql -c "select now()"'

Slide 116

Slide 116 text

su postgres -c 'echo "hello"' Command doesn’t matter

Slide 117

Slide 117 text

Enter strace, ltrace

Slide 118

Slide 118 text

strace - print every system call made by a process

Slide 119

Slide 119 text

ltrace - same, but for C library calls

Slide 120

Slide 120 text

Opinion: the most useful debugging tool you can learn http://jvns.ca/blog/2015/04/14/strace-zine/

Slide 121

Slide 121 text

# strace su postgres -c 'echo "hello"' getuid() = 0 # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks

Slide 122

Slide 122 text

# strace su postgres -c 'echo "hello"' getuid() = 0 # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks !?

Slide 123

Slide 123 text

getlogin

Slide 124

Slide 124 text

the user name associated by the login activity with the controlling terminal of the current process

Slide 125

Slide 125 text

the user name associated by the login activity with the controlling terminal of the current process

Slide 126

Slide 126 text

$ ./test_getlogin User is: chris $ sudo -i # ./test_getlogin User is: chris

Slide 127

Slide 127 text

/var/run/utmp /proc//loginuid

Slide 128

Slide 128 text

# ps aux | grep '[p]acemakerd' | awk '{print $2}' 9001

Slide 129

Slide 129 text

# ps aux | grep '[p]acemakerd' | awk '{print $2}' 9001 # cat /proc/9001/loginuid 1234

Slide 130

Slide 130 text

# ps aux | grep '[p]acemakerd' | awk '{print $2}' 9001 # cat /proc/9001/loginuid 1234 # getent passwd 1234 | cut -d':' -f1 chris

Slide 131

Slide 131 text

No content

Slide 132

Slide 132 text

No content

Slide 133

Slide 133 text

No content

Slide 134

Slide 134 text

service pacemaker restart changes the user

Slide 135

Slide 135 text

Hold up. What is su doing?

Slide 136

Slide 136 text

The code is subtle

Slide 137

Slide 137 text

Getting a passwd struct for the calling user

Slide 138

Slide 138 text

uid = getuid() 0 (root)

Slide 139

Slide 139 text

uid = getuid() login = getlogin() 0 (root) chris

Slide 140

Slide 140 text

uid = getuid() login = getlogin() passwd = passwd_from_name(login) 0 (root) chris {…}

Slide 141

Slide 141 text

uid = getuid() login = getlogin() passwd = passwd_from_name(login) if (passwd.uid == uid) return passwd 0 (root) chris {…} 0 == 1234

Slide 142

Slide 142 text

uid = getuid() login = getlogin() passwd = passwd_from_name(login) if (passwd.uid == uid) return passwd else return passwd_from_uid(uid) 0 (root) chris {…} 0 == 1234

Slide 143

Slide 143 text

Only does anything if you have duplicate UIDs

Slide 144

Slide 144 text

Why does the login activity matter?

Slide 145

Slide 145 text

There may be Reasons™

Slide 146

Slide 146 text

Set appropriate bounds on the system

Slide 147

Slide 147 text

bind_timelimit

Slide 148

Slide 148 text

There’s a lot more than /usr/bin/postgresql

Slide 149

Slide 149 text

Run game days https://stripe.com/blog/game-day-exercises-at-stripe

Slide 150

Slide 150 text

3 things to do

Slide 151

Slide 151 text

Watch the logs Set appropriate bounds Check what your ORM’s up to

Slide 152

Slide 152 text

And two thoughts…

Slide 153

Slide 153 text

Operating a database is about much more than the database itself

Slide 154

Slide 154 text

https://twitter.com/cscotta/status/535898821914419200

Slide 155

Slide 155 text

Thank you '❤ @ChrisSinjo @GoCardlessEng

Slide 156

Slide 156 text

Questions? '❤ @ChrisSinjo @GoCardlessEng

Slide 157

Slide 157 text

We’re hiring '❤ @ChrisSinjo @GoCardlessEng

Slide 158

Slide 158 text

Image credits • Heart emoji - https://github.com/mozilla/fxemoji/blob/ 9f68ca9c5bc51521f9ffe284e00ba8b7308e2c41/svgs/ FirefoxEmoji/u2764-redheart.svg • Elephants - https://www.flickr.com/photos/makeitkenya/ 22047623331/ • Kitten - https://www.flickr.com/photos/aigle_dore/7787096102

Slide 159

Slide 159 text

References (Incident 1) • Safe Operations For High Volume PostgreSQL (Braintree) - https://www.braintreepayments.com/blog/safe-operations-for- high-volume-postgresql/ • Zero-downtime Postgres migrations - the hard parts - https:// gocardless.com/blog/zero-downtime-postgres-migrations-the- hard-parts/ • Zero-downtime Postgres migrations - a little help - https:// gocardless.com/blog/zero-downtime-postgres-migrations-a-little- help/

Slide 160

Slide 160 text

References (Incident 1) • Postgres 9.5 Client Connection Defaults (lock_timeout, statement_timeout) - https://www.postgresql.org/docs/9.5/ static/runtime-config-client.html • Postgres 9.6 Client Connection Defaults (idle_in_transaction_session_timeout) - https:// www.postgresql.org/docs/9.5/static/runtime-config-client.html • Postgres 9.5 Error Reporting and Logging (log_lock_waits) - https://www.postgresql.org/docs/9.5/static/runtime-config- logging.html

Slide 161

Slide 161 text

References (Incident 2) • Postgres 9.5 Resource Consumption (temp_file_limit) - https:// www.postgresql.org/docs/current/static/runtime-config- resource.html

Slide 162

Slide 162 text

References (Incident 3) • PgBouncer feature matrix for pooling modes - https:// wiki.postgresql.org/wiki/ PgBouncer#Feature_matrix_for_pooling_modes • The Troubleshooting Tales: issues scaling Postgres connections - https:// gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres- connections/ • Rails commit that fixed client_min_messages issue - https://github.com/ rails/rails/commit/a456acb2f2af8365eb9151c7cd2d5a10c189d191 • Rick Branson's tweet about the same problem in Django - https:// twitter.com/rbranson/status/675005104701870080

Slide 163

Slide 163 text

References (Incident 4) • Postgres 9.5 Connections and Authentication (superuser_reserved_connections) - https:// www.postgresql.org/docs/9.5/static/runtime-config- connection.html

Slide 164

Slide 164 text

References (Incident 5) • Julia Evans' strace zine - http://jvns.ca/blog/2015/04/14/ strace-zine/ • Game Day Exercises at Stripe: Learning from `kill -9` - https:// stripe.com/blog/game-day-exercises-at-stripe

Slide 165

Slide 165 text

References (Incident 5) • Current code in `su` to look up user - https://github.com/ shadow-maint/shadow/blob/ ef45bb2496182b5df90ad0323bef75d1a5d69887/contrib/ pwdauth.c#L127 • Much older similar code in `su` (1994 or earlier) - https:// github.com/freebsd/freebsd/blob/ b5c3fb9427806b740f7df3e43a1513e1f5fa840b/usr.bin/su/ su.c#L255

Slide 166

Slide 166 text

References (Incident 5) • getuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getuid.html • getlogin manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getlogin.html • getpwnam manual - http://pubs.opengroup.org/onlinepubs/ 9699919799/functions/getpwnam.html • getpwuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getpwuid.html

Slide 167

Slide 167 text

References (Misc) • Practical Postmortems at Etsy - https://www.infoq.com/articles/ postmortems-etsy • Scott Andreas' tweet about investigating things - https:// twitter.com/cscotta/status/535898821914419200