$30 off During Our Annual Pro Sale. View Details »

Zero-downtime Postgres upgrades (DoxLon Edition)

Zero-downtime Postgres upgrades (DoxLon Edition)

At GoCardless, we use Postgres as the primary store for data that matters - records of merchants, customers and payments.

As a payments API, it's important to our users that we maintain a high level of uptime. At the same time, we believe that performing upgrades is an important reality of running software in production - databases included. Even the most stable software has critical bugs from time to time, and you have to deploy patches.

When it came to Postgres, we found ourselves caught between our desire to minimise downtime and our need to keep our software stack up-to-date. Postgres doesn't ship with all the machinery you need to do zero-downtime upgrades, so we knew we had work to do.

In the talk, we'll look at the problems faced when trying to upgrade Postgres without downtime, and explore our approach to building automation to upgrade Postgres without the apps noticing.

Chris Sinjakli

January 26, 2017
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Zero-downtime Postgres upgrades
    Restarting databases without the apps noticing
    @ChrisSinjo

    View Slide

  2. GOCARDLESS

    View Slide

  3. POST /cash/monies HTTP/1.1
    { amount: 100 }

    View Slide

  4. High per-request

    View Slide

  5. Uptime is

    View Slide

  6. View Slide

  7. Good durability guarantees

    View Slide

  8. Good durability guarantees
    Feature-cautious

    View Slide

  9. Good durability guarantees
    Feature-cautious
    Transactions are cool

    View Slide

  10. –Postgres
    “Speak to this one node.”

    View Slide

  11. Client
    Postgres

    View Slide

  12. Client
    Postgres
    Postgres
    Replication

    View Slide

  13. Client
    Postgres
    Postgres
    Replication

    View Slide

  14. Wake a human up

    View Slide

  15. Client
    Postgres
    Postgres
    Replication

    View Slide

  16. Client
    Postgres
    Postgres

    View Slide

  17. Client
    Postgres
    Postgres

    View Slide

  18. Client
    Postgres
    Postgres
    Replication

    View Slide

  19. Awful time-to-recovery
    Error-prone

    View Slide

  20. You gotta perform:
    - Many steps
    - In the right order
    - Perfectly

    View Slide

  21. Don’t make a
    tired
    SRE think

    View Slide

  22. Add automation

    View Slide

  23. Pacemaker
    A clustering tool

    View Slide

  24. Client
    Postgres
    Postgres
    Replication

    View Slide

  25. How do we know a
    node has failed?

    View Slide

  26. Jepsen
    https://aphyr.com/tags/jepsen

    View Slide

  27. https://aphyr.com/posts/317-jepsen-elasticsearch

    View Slide

  28. Client
    Postgres
    Postgres
    Replication

    View Slide

  29. Client
    Postgres
    Postgres
    Postgres
    Repl Repl

    View Slide

  30. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker

    View Slide

  31. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  32. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  33. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  34. Postgres
    Postgres
    Postgres
    Repl
    Pacemaker Pacemaker Pacemaker
    Client
    VIP

    View Slide

  35. Postgres
    Postgres
    Postgres
    Repl
    Pacemaker Pacemaker Pacemaker
    Client
    VIP

    View Slide

  36. Postgres
    Postgres
    Postgres
    Repl
    Pacemaker Pacemaker Pacemaker
    Client
    VIP

    View Slide

  37. Client
    Postgres
    Postgres
    Postgres Repl
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker

    View Slide

  38. $

    View Slide

  39. Seems hard,
    right?

    View Slide

  40. It kinda is

    View Slide

  41. You gotta know:
    - Postgres
    - Distributed systems
    - Pacemaker

    View Slide

  42. Get someone else
    to run it for you

    View Slide

  43. Client
    Postgres
    Postgres
    Postgres Repl Repl
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  44. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  45. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  46. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  47. Every move means
    a connection reset

    View Slide

  48. Every move means
    dropped requests

    View Slide

  49. POST /cash/monies HTTP/1.1
    { amount: 100 }

    View Slide

  50. POST /cash/monies HTTP/1.1
    { amount: 100 }
    500 Internal Server Error

    View Slide

  51. What does this
    mean for upgrades?

    View Slide

  52. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  53. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    9.4.9 9.4.9 9.4.9
    VIP

    View Slide

  54. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    9.4.9 9.4.9 9.4.9
    Repl Repl
    VIP

    View Slide

  55. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    9.4.10 9.4.9 9.4.10
    Repl Repl
    VIP

    View Slide

  56. Client
    Postgres
    Postgres
    Postgres Repl
    Repl
    VIP
    Pacemaker Pacemaker Pacemaker
    9.4.10 9.4.9 9.4.10

    View Slide

  57. Every upgrade means
    a connection reset

    View Slide

  58. Every upgrade means
    dropped requests

    View Slide

  59. POST /cash/monies HTTP/1.1
    { amount: 100 }
    500 Internal Server Error

    View Slide

  60. Solution:
    never upgrade

    View Slide

  61. View Slide

  62. Not upgrading is
    never
    an option

    View Slide

  63. Solution:
    never upgrade

    View Slide

  64. Solution:
    never upgrade

    View Slide

  65. Solution:
    ???

    View Slide

  66. 1thing
    missing

    View Slide

  67. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    VIP

    View Slide

  68. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP

    View Slide

  69. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP

    View Slide

  70. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  71. PgBouncer has
    This One Weird Trick™

    View Slide

  72. PAUSE;

    View Slide

  73. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  74. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP
    PAUSE;

    View Slide

  75. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    PAUSE;
    VIP

    View Slide

  76. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    PAUSE;
    VIP

    View Slide

  77. So what does this
    mean for upgrades?

    View Slide

  78. Client
    Postgres
    Postgres
    Postgres
    Pacemaker Pacemaker Pacemaker
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  79. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP

    View Slide

  80. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP
    9.4.10 9.4.9 9.4.10

    View Slide

  81. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    VIP
    9.4.10 9.4.9 9.4.10
    PAUSE;

    View Slide

  82. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    9.4.10 9.4.9 9.4.10
    VIP
    PAUSE;

    View Slide

  83. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    9.4.10 9.4.9 9.4.10
    VIP
    RESUME;

    View Slide

  84. Client
    Postgres
    Postgres
    Postgres
    PgBouncer
    PgBouncer PgBouncer
    VIP
    9.4.10 9.4.10 9.4.10
    VIP
    RESUME;

    View Slide

  85. $

    View Slide

  86. Caveats

    View Slide

  87. Minor versions

    View Slide

  88. 9.4.9 → 9.4.10

    View Slide

  89. pglogical

    View Slide

  90. Minor versions
    Long-running transactions

    View Slide

  91. while(running_queries):
    if(now > timeout):
    abandon_migration
    else:
    sleep(0.1)
    promote_new_primary

    View Slide

  92. Minor versions
    Long-running transactions
    Pause length

    View Slide

  93. 7-10s total

    View Slide

  94. $

    View Slide

  95. One more thing…
    (#sorrynotsorry)

    View Slide

  96. github.com/gocardless/our-postgresql-setup

    View Slide

  97. We’re hiring
    '❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  98. Thank you
    '❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  99. Questions?
    '❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide