Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero-downtime Postgres upgrades (PGDay UK edition)

Zero-downtime Postgres upgrades (PGDay UK edition)

At GoCardless, we use Postgres as the primary store for data that matters - records of merchants, customers and payments.

As a payments API, it's important to our users that we maintain a high level of uptime. At the same time, we believe that performing upgrades is an important reality of running software in production - databases included. Even the most stable software has critical bugs from time to time, and you have to deploy patches.

When it came to Postgres, we found ourselves caught between our desire to minimise downtime and our need to keep our software stack up-to-date. Postgres doesn't ship with all the machinery you need to do zero-downtime upgrades, so we knew we had work to do.

In the talk, we'll look at the problems faced when trying to upgrade Postgres without downtime, and explore our approach to building automation to upgrade Postgres without the apps noticing.

Chris Sinjakli

July 04, 2017
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Zero-downtime Postgres upgrades Restarting databases without the apps noticing @ChrisSinjo

  2. GOCARDLESS

  3. POST /cash/monies HTTP/1.1 { amount: 100 }

  4. High per-request

  5. Uptime is

  6. None
  7. Good durability guarantees

  8. Good durability guarantees Feature-cautious

  9. Good durability guarantees Feature-cautious Transactions are cool

  10. –Postgres “Speak to this one node.”

  11. Client Postgres

  12. Client Postgres Postgres Replication

  13. Client Postgres Postgres Replication

  14. Wake a human up

  15. Client Postgres Postgres Replication

  16. Client Postgres Postgres

  17. Client Postgres Postgres

  18. Client Postgres Postgres Replication

  19. Awful time-to-recovery Error-prone

  20. You gotta perform: - Many steps - In the right

    order - Perfectly
  21. Don’t make a tired SRE think

  22. Add automation

  23. Pacemaker A clustering tool

  24. Client Postgres Postgres Replication

  25. How do we know a node has failed?

  26. B A

  27. B A

  28. B A ? ?

  29. B A

  30. B A

  31. B A

  32. The system cannot progress safely

  33. Quorum

  34. A majority of nodes must be available

  35. n+1 2 ( ) round up

  36. n+1 2 ⌈ ⌉

  37. Some numbers

  38. Nodes Quorum 2 2

  39. Nodes Quorum 2 3 2 2

  40. Nodes Quorum 2 3 4 2 2 3

  41. Nodes Quorum 2 3 4 5 2 2 3 3

  42. B A

  43. B A C

  44. B A C

  45. B A C

  46. But…

  47. B A C

  48. B A C

  49. It gets complicated

  50. Jepsen https://aphyr.com/tags/jepsen

  51. https://aphyr.com/posts/317-jepsen-elasticsearch

  52. Client Postgres Postgres Replication

  53. Client Postgres Postgres Postgres Repl Repl

  54. Client Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker

  55. Client Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP

  56. Client Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP

  57. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP

  58. Postgres Postgres Postgres Repl Pacemaker Pacemaker Pacemaker Client VIP

  59. Postgres Postgres Postgres Repl Pacemaker Pacemaker Pacemaker Client VIP

  60. Postgres Postgres Postgres Repl Pacemaker Pacemaker Pacemaker Client VIP

  61. Client Postgres Postgres Postgres Repl Repl VIP Pacemaker Pacemaker Pacemaker

  62. $

  63. Seems hard, right?

  64. It kinda is

  65. You gotta know: - Postgres - Distributed systems - Pacemaker

  66. Get someone else to run it for you

  67. Client Postgres Postgres Postgres Repl Repl Pacemaker Pacemaker Pacemaker VIP

  68. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP

  69. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP

  70. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP

  71. Every move means a connection reset

  72. Every move means dropped requests

  73. POST /cash/monies HTTP/1.1 { amount: 100 }

  74. POST /cash/monies HTTP/1.1 { amount: 100 } 500 Internal Server

    Error
  75. What does this mean for upgrades?

  76. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP

  77. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker 9.4.9 9.4.9 9.4.9

    VIP
  78. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker 9.4.9 9.4.9 9.4.9

    Repl Repl VIP
  79. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker 9.4.10 9.4.9 9.4.10

    Repl Repl VIP
  80. Client Postgres Postgres Postgres Repl Repl VIP Pacemaker Pacemaker Pacemaker

    9.4.10 9.4.9 9.4.10
  81. Every upgrade means a connection reset

  82. Every upgrade means dropped requests

  83. POST /cash/monies HTTP/1.1 { amount: 100 } 500 Internal Server

    Error
  84. Solution: never upgrade

  85. None
  86. Not upgrading is never an option

  87. Solution: never upgrade

  88. Solution: never upgrade

  89. Solution: ???

  90. 1thing missing

  91. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker VIP

  92. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP
  93. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP
  94. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP VIP
  95. PgBouncer has This One Weird Trick™

  96. PAUSE;

  97. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP VIP
  98. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP VIP PAUSE;
  99. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP PAUSE; VIP
  100. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP PAUSE; VIP
  101. So what does this mean for upgrades?

  102. Client Postgres Postgres Postgres Pacemaker Pacemaker Pacemaker PgBouncer PgBouncer PgBouncer

    VIP VIP
  103. Client Postgres Postgres Postgres PgBouncer PgBouncer PgBouncer VIP VIP

  104. Client Postgres Postgres Postgres PgBouncer PgBouncer PgBouncer VIP VIP 9.4.10

    9.4.9 9.4.10
  105. Client Postgres Postgres Postgres PgBouncer PgBouncer PgBouncer VIP VIP 9.4.10

    9.4.9 9.4.10 PAUSE;
  106. Client Postgres Postgres Postgres PgBouncer PgBouncer PgBouncer VIP 9.4.10 9.4.9

    9.4.10 VIP PAUSE;
  107. Client Postgres Postgres Postgres PgBouncer PgBouncer PgBouncer VIP 9.4.10 9.4.9

    9.4.10 VIP RESUME;
  108. Client Postgres Postgres Postgres PgBouncer PgBouncer PgBouncer VIP 9.4.10 9.4.10

    9.4.10 VIP RESUME;
  109. $

  110. Caveats

  111. Minor versions

  112. 9.4.9 → 9.4.10

  113. pglogical

  114. Minor versions Long-running transactions

  115. while(running_queries): if(now > timeout): abandon_migration else: sleep(0.1) promote_new_primary

  116. Minor versions Long-running transactions Pause length

  117. 7-10s total

  118. $

  119. One more thing… (#sorrynotsorry)

  120. github.com/gocardless/our-postgresql-setup

  121. We’re hiring '❤ @ChrisSinjo @GoCardlessEng

  122. Thank you '❤ @ChrisSinjo @GoCardlessEng

  123. Questions? '❤ @ChrisSinjo @GoCardlessEng