Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zero-downtime Postgres upgrades (PGDay UK edition)

Zero-downtime Postgres upgrades (PGDay UK edition)

At GoCardless, we use Postgres as the primary store for data that matters - records of merchants, customers and payments.

As a payments API, it's important to our users that we maintain a high level of uptime. At the same time, we believe that performing upgrades is an important reality of running software in production - databases included. Even the most stable software has critical bugs from time to time, and you have to deploy patches.

When it came to Postgres, we found ourselves caught between our desire to minimise downtime and our need to keep our software stack up-to-date. Postgres doesn't ship with all the machinery you need to do zero-downtime upgrades, so we knew we had work to do.

In the talk, we'll look at the problems faced when trying to upgrade Postgres without downtime, and explore our approach to building automation to upgrade Postgres without the apps noticing.

Chris Sinjakli

July 04, 2017
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Zero-downtime Postgres upgrades
    Restarting databases without the apps noticing
    @ChrisSinjo

    View Slide

  2. GOCARDLESS

    View Slide

  3. POST /cash/monies HTTP/1.1
    { amount: 100 }

    View Slide

  4. High per-request

    View Slide

  5. Uptime is

    View Slide

  6. View Slide

  7. Good durability guarantees

    View Slide

  8. Good durability guarantees
    Feature-cautious

    View Slide

  9. Good durability guarantees
    Feature-cautious
    Transactions are cool

    View Slide

  10. –Postgres
    “Speak to this one node.”

    View Slide

  11. Client
    Postgres

    View Slide

  12. Client
    Postgres
    Postgres
    Replication

    View Slide

  13. Client
    Postgres
    Postgres
    Replication

    View Slide

  14. Wake a human up

    View Slide

  15. Client
    Postgres
    Postgres
    Replication

    View Slide

  16. Client
    Postgres
    Postgres

    View Slide

  17. Client
    Postgres
    Postgres

    View Slide

  18. Client
    Postgres
    Postgres
    Replication

    View Slide

  19. Awful time-to-recovery
    Error-prone

    View Slide

  20. You gotta perform:
    - Many steps
    - In the right order
    - Perfectly

    View Slide

  21. Don’t make a
    tired
    SRE think

    View Slide

  22. Add automation

    View Slide

  23. Pacemaker
    A clustering tool

    View Slide

  24. Client
    Postgres
    Postgres
    Replication

    View Slide

  25. How do we know a
    node has failed?

    View Slide

  26. B
    A

    View Slide

  27. B
    A

    View Slide

  28. B
    A
    ? ?

    View Slide

  29. B
    A

    View Slide

  30. B
    A

    View Slide

  31. B
    A

    View Slide

  32. The system cannot
    progress safely

    View Slide

  33. Quorum

    View Slide

  34. A majority of nodes
    must be available

    View Slide

  35. n+1
    2
    ( )
    round
    up

    View Slide

  36. n+1
    2
    ⌈ ⌉

    View Slide

  37. Some numbers

    View Slide

  38. Nodes Quorum
    2 2

    View Slide

  39. Nodes Quorum
    2
    3
    2
    2

    View Slide

  40. Nodes Quorum
    2
    3
    4
    2
    2
    3

    View Slide

  41. Nodes Quorum
    2
    3
    4
    5
    2
    2
    3
    3

    View Slide

  42. B
    A

    View Slide

  43. B
    A
    C

    View Slide

  44. B
    A
    C

    View Slide

  45. B
    A
    C

    View Slide

  46. But…

    View Slide

  47. B
    A
    C

    View Slide

  48. B
    A
    C

    View Slide

  49. It gets
    complicated

    View Slide

  50. Jepsen
    https://aphyr.com/tags/jepsen

    View Slide

  51. https://aphyr.com/posts/317-jepsen-elasticsearch

    View Slide

  52. Client
    Postgres
    Postgres
    Replication

    View Slide

  53. Client
    Postgres
    Postgres
    Postgres
    Repl Repl

    View Slide

  54. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker

    View Slide

  55. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  56. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  57. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  58. Postgres
    Postgres
    Postgres
    Repl
    Pacemaker Pacemaker Pacemaker
    Client
    VIP

    View Slide

  59. Postgres
    Postgres
    Postgres
    Repl
    Pacemaker Pacemaker Pacemaker
    Client
    VIP

    View Slide

  60. Postgres
    Postgres
    Postgres
    Repl
    Pacemaker Pacemaker Pacemaker
    Client
    VIP

    View Slide

  61. Client
    Postgres
    Postgres
    Postgres Repl
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker

    View Slide

  62. $

    View Slide

  63. Seems hard,
    right?

    View Slide

  64. It kinda is

    View Slide

  65. You gotta know:
    - Postgres
    - Distributed systems
    - Pacemaker

    View Slide

  66. Get someone else
    to run it for you

    View Slide

  67. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  68. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  69. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  70. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  71. Every move means
    a connection reset

    View Slide

  72. Every move means
    dropped requests

    View Slide

  73. POST /cash/monies HTTP/1.1
    { amount: 100 }

    View Slide

  74. POST /cash/monies HTTP/1.1
    { amount: 100 }
    500 Internal Server Error

    View Slide

  75. What does this
    mean for upgrades?

    View Slide

  76. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  77. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    9.4.9 9.4.9 9.4.9
    VIP

    View Slide

  78. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    9.4.9 9.4.9 9.4.9
    Repl Repl
    VIP

    View Slide

  79. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    9.4.10 9.4.9 9.4.10
    Repl Repl
    VIP

    View Slide

  80. Client
    Postgres
    Postgres
    Postgres Repl
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker
    9.4.10 9.4.9 9.4.10

    View Slide

  81. Every upgrade means
    a connection reset

    View Slide

  82. Every upgrade means
    dropped requests

    View Slide

  83. POST /cash/monies HTTP/1.1
    { amount: 100 }
    500 Internal Server Error

    View Slide

  84. Solution:
    never upgrade

    View Slide

  85. View Slide

  86. Not upgrading is
    never
    an option

    View Slide

  87. Solution:
    never upgrade

    View Slide

  88. Solution:
    never upgrade

    View Slide

  89. Solution:
    ???

    View Slide

  90. 1thing
    missing

    View Slide

  91. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  92. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP

    View Slide

  93. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP

    View Slide

  94. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  95. PgBouncer has
    This One Weird Trick™

    View Slide

  96. PAUSE;

    View Slide

  97. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  98. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP
    PAUSE;

    View Slide

  99. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    PAUSE;
    VIP

    View Slide

  100. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    PAUSE;
    VIP

    View Slide

  101. So what does this
    mean for upgrades?

    View Slide

  102. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  103. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  104. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP
    9.4.10 9.4.9 9.4.10

    View Slide

  105. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP
    9.4.10 9.4.9 9.4.10
    PAUSE;

    View Slide

  106. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    9.4.10 9.4.9 9.4.10
    VIP
    PAUSE;

    View Slide

  107. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    9.4.10 9.4.9 9.4.10
    VIP
    RESUME;

    View Slide

  108. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    9.4.10 9.4.10 9.4.10
    VIP
    RESUME;

    View Slide

  109. $

    View Slide

  110. Caveats

    View Slide

  111. Minor versions

    View Slide

  112. 9.4.9 → 9.4.10

    View Slide

  113. pglogical

    View Slide

  114. Minor versions
    Long-running transactions

    View Slide

  115. while(running_queries):
    if(now > timeout):
    abandon_migration
    else:
    sleep(0.1)
    promote_new_primary

    View Slide

  116. Minor versions
    Long-running transactions
    Pause length

    View Slide

  117. 7-10s total

    View Slide

  118. $

    View Slide

  119. One more thing…
    (#sorrynotsorry)

    View Slide

  120. github.com/gocardless/our-postgresql-setup

    View Slide

  121. We’re hiring
    '❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  122. Thank you
    '❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  123. Questions?
    '❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide