Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Downtimeless PostgreSQL server replacement

Downtimeless PostgreSQL server replacement

Talk given at last Warsaw PostgreSQL Users Group meetup: https://www.meetup.com/Warsaw-PostgreSQL-Users-Group/events/257956915/

Maciej Pasternacki

February 07, 2019
Tweet

More Decks by Maciej Pasternacki

Other Decks in Programming

Transcript

  1. Maciej Pasternacki • Freelance Web infrastructure / DevOps engineer since

    2010 • Automation, Configuration Management, Infrastructure As Code • PostgreSQL user & admin
  2. Codility […] helps tech recruiters and hiring managers assess their

    candidates’ skills by testing their code online. — codility.com
  3. No such thing as a “5–minute downtime” Scheduled downtime cannot

    happen during candidate’s test session. New test cannot be started even 2 hours before planned maintenance. Still need to do regular system patching, configuration changes, …
  4. PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new

    client connection • Must–have for webapp workload https://pgbouncer.github.io/
  5. PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new

    client connection • Must–have for webapp workload • Can pause client connections https://pgbouncer.github.io/
  6. PgBouncer Connection pooling PostgreSQL proxy • Reduces overhead of new

    client connection • Must–have for webapp workload • Can pause client connections • Can do live configuration reload https://pgbouncer.github.io/
  7. PgBouncer The PAUSE command • Tries to disconnect from all

    servers, first waiting for all current queries (transactions / sessions) to complete • Returns after servers have been safely disconnected • New client connections will wait until RESUME is called https://pgbouncer.github.io/usage.html
  8. The General Idea 1. PAUSE pgbouncer 2. Promote standby replica

    to new master 3. Reconfigure pgbouncer for new master 4. RESUME pgbouncer 5. Reconfigure clients one by one to use new server’s pgbouncer
  9. What Could Possibly Go Wrong? 1. PAUSE pgbouncer 2. Promote

    standby replica to new master 3. Reconfigure pgbouncer for new master 4. RESUME pgbouncer 5. Reconfigure clients one by one to use new server’s pgbouncer
  10. What Could Possibly Go Wrong? • PAUSE hangs because some

    client holds transaction • Replica’s lagging, promoted too early • Replica’s not even a replica • New server broken or misconfigured • Failure leaves pgbouncer paused Too long PAUSE is as bad as “real” downtime!
  11. The Revised Idea 1. Stop all non–essential db clients 2.

    With 5s timeout, catching errors, try: 2.1 PAUSE clients (2s timeout) 2.2 Wait for replica to match master WAL position 2.3 Promote replica, wait until writable 2.4 Reconfigure pgbouncer to use replica 3. Automatically RESUME clients on success or failure (worst case: they continue on the old server)
  12. What Could Possibly Go Wrong? 1. Stop all non–essential db

    clients 2. With 5s timeout, catching errors, try: 2.1 PAUSE clients (2s timeout) 2.2 Wait for replica to match master WAL position 2.3 Promote replica, wait until writable 2.4 Reconfigure pgbouncer to use replica 3. Automatically RESUME clients on success or failure (worst case: they continue on the old server)
  13. What Could Possibly Go Wrong? The procedure is safe –

    failure won’t affect user. But it will affect us! Failure during the procedure means we have to clean up. Refusing to start is cheaper than failing.
  14. Early sanity checks Before doing anything: • Show list of

    database processes to review • Open and test SSH to replica • Open and test admin connection to pgbouncer and both PostgreSQLs • Test replica’s pgbouncer • Check that replica is connected to master and doesn’t lag
  15. The Golden Rules • Fail safe. It will break. Plan

    for that, don’t let that affect the user. • Look before you leap. The earlier you fail, the easier it is to clean up. Don’t even start if you see you won’t finish.