Slide 1

Slide 1 text

Downtimeless PostgreSQL server replacement Maciej Pasternacki Warsaw PostgreSQL User Group, 2019-02-07 @mpasternacki 3ofcoins

Slide 2

Slide 2 text

Maciej Pasternacki • Freelance Web infrastructure / DevOps engineer since 2010 • Automation, Configuration Management, Infrastructure As Code • PostgreSQL user & admin

Slide 3

Slide 3 text

Codility […] helps tech recruiters and hiring managers assess their candidates’ skills by testing their code online. — codility.com

Slide 4

Slide 4 text

No such thing as a “5–minute downtime” Scheduled downtime cannot happen during candidate’s test session. New test cannot be started even 2 hours before planned maintenance. Still need to do regular system patching, configuration changes, …

Slide 5

Slide 5 text

Most servers can be replaced without downtime Load balancer App App App Worker

Slide 6

Slide 6 text

Most servers can be replaced without downtime Load balancer App App App Worker What about database?

Slide 7

Slide 7 text

Database Setup client pgbouncer database host

Slide 8

Slide 8 text

PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new client connection • Must–have for webapp workload https://pgbouncer.github.io/

Slide 9

Slide 9 text

PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new client connection • Must–have for webapp workload • Can pause client connections https://pgbouncer.github.io/

Slide 10

Slide 10 text

PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new client connection • Must–have for webapp workload • Can pause client connections • Can do live configuration reload https://pgbouncer.github.io/

Slide 11

Slide 11 text

PgBouncer The PAUSE command • Tries to disconnect from all servers, first waiting for all current queries (transactions / sessions) to complete • Returns after servers have been safely disconnected • New client connections will wait until RESUME is called https://pgbouncer.github.io/usage.html

Slide 12

Slide 12 text

The General Idea client pgbouncer master db pgbouncer standby replica replication

Slide 13

Slide 13 text

The General Idea client pgbouncer master db pgbouncer standby replica replication PAUSE

Slide 14

Slide 14 text

The General Idea client pgbouncer master db pgbouncer replication Promote

Slide 15

Slide 15 text

The General Idea client pgbouncer master db pgbouncer new master replication

Slide 16

Slide 16 text

The General Idea client pgbouncer master db pgbouncer new master replication reconfigure

Slide 17

Slide 17 text

The General Idea client pgbouncer master db pgbouncer new master replication RESUME

Slide 18

Slide 18 text

The General Idea client pgbouncer master db pgbouncer new master replication

Slide 19

Slide 19 text

The General Idea client pgbouncer master db pgbouncer new master replication

Slide 20

Slide 20 text

The General Idea 1. PAUSE pgbouncer 2. Promote standby replica to new master 3. Reconfigure pgbouncer for new master 4. RESUME pgbouncer 5. Reconfigure clients one by one to use new server’s pgbouncer

Slide 21

Slide 21 text

What Could Possibly Go Wrong? 1. PAUSE pgbouncer 2. Promote standby replica to new master 3. Reconfigure pgbouncer for new master 4. RESUME pgbouncer 5. Reconfigure clients one by one to use new server’s pgbouncer

Slide 22

Slide 22 text

What Could Possibly Go Wrong? • PAUSE hangs because some client holds transaction • Replica’s lagging, promoted too early • Replica’s not even a replica • New server broken or misconfigured • Failure leaves pgbouncer paused Too long PAUSE is as bad as “real” downtime!

Slide 23

Slide 23 text

The Revised Idea 1. Stop all non–essential db clients 2. With 5s timeout, catching errors, try: 2.1 PAUSE clients (2s timeout) 2.2 Wait for replica to match master WAL position 2.3 Promote replica, wait until writable 2.4 Reconfigure pgbouncer to use replica 3. Automatically RESUME clients on success or failure (worst case: they continue on the old server)

Slide 24

Slide 24 text

What Could Possibly Go Wrong? 1. Stop all non–essential db clients 2. With 5s timeout, catching errors, try: 2.1 PAUSE clients (2s timeout) 2.2 Wait for replica to match master WAL position 2.3 Promote replica, wait until writable 2.4 Reconfigure pgbouncer to use replica 3. Automatically RESUME clients on success or failure (worst case: they continue on the old server)

Slide 25

Slide 25 text

What Could Possibly Go Wrong? The procedure is safe – failure won’t affect user. But it will affect us! Failure during the procedure means we have to clean up. Refusing to start is cheaper than failing.

Slide 26

Slide 26 text

Early sanity checks Before doing anything: • Show list of database processes to review • Open and test SSH to replica • Open and test admin connection to pgbouncer and both PostgreSQLs • Test replica’s pgbouncer • Check that replica is connected to master and doesn’t lag

Slide 27

Slide 27 text

The Golden Rules • Fail safe. It will break. Plan for that, don’t let that affect the user. • Look before you leap. The earlier you fail, the easier it is to clean up. Don’t even start if you see you won’t finish.