Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned the Hard Way: Postgres in Production at GoCardless (PGConf US Edition)

Lessons Learned the Hard Way: Postgres in Production at GoCardless (PGConf US Edition)

**New content in this edition: better answers for incident 5**

From the beginning, GoCardless has used Postgres for the stuff that matters. It’s the primary store behind our APIs and dashboards.

After 5 years, it’s fair to say we’ve seen some things. From spikes in latency as resources run out, through the weird contortions ORMs put your database through, and on to the perils of running a high-availability cluster - this talk has it all.

In the talk, we’ll run through some war stories from the last two years at GoCardless. By gathering them together, we’ll spot some recurring themes, and hopefully avoid fighting the same battles again!

Chris Sinjakli

March 29, 2017
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Lessons learned the hard way
    Postgres in production at GoCardless
    @ChrisSinjo

    View Slide

  2. GOCARDLESS

    View Slide

  3. POST /cash/monies HTTP/1.1
    { amount: 100 }

    View Slide

  4. High per-request

    View Slide

  5. Uptime is

    View Slide

  6. You will have
    different concerns

    View Slide

  7. The distributed system
    as a whole

    View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. Learning through the
    medium of post-mortems

    View Slide

  13. What is a post-mortem?
    https://www.infoq.com/articles/postmortems-etsy

    View Slide

  14. Failure can be a
    great teacher

    View Slide

  15. 5 stories

    View Slide

  16. Resource utilisation

    View Slide

  17. Resource utilisation
    ORMs

    View Slide

  18. Resource utilisation
    ORMs
    Clustering

    View Slide

  19. foreach(incident):
    What happened
    Something we learned
    Ideas you can use

    View Slide

  20. 3 things to do

    View Slide

  21. Resource utilisation
    ORMs
    Clustering

    View Slide

  22. Incident 1
    The Fast Migration
    That Wasn’t

    View Slide

  23. ~10-20 schema changes
    per-month

    View Slide

  24. Problem: locks

    View Slide

  25. Many guides on safe
    schema changes

    View Slide

  26. Don’t ADD COLUMN with DEFAULT

    View Slide

  27. Don’t ADD COLUMN with DEFAULT
    VALIDATE constraints after adding

    View Slide

  28. Don’t ADD COLUMN with DEFAULT
    VALIDATE constraints after adding
    Add indexes CONCURRENTLY

    View Slide

  29. ALTER TABLE payments
    ADD COLUMN refunded boolean;
    No DEFAULT

    View Slide

  30. The lock queue

    View Slide

  31. -- Slow query (AccessShare)
    SELECT DISTINCT(customer_id) FROM payments;

    View Slide

  32. -- Slow query (AccessShare)
    SELECT DISTINCT(customer_id) FROM payments;
    -- Forces this to queue (AccessExclusive)
    ALTER TABLE payments ADD COLUMN refunded boolean;

    View Slide

  33. -- Slow query (AccessShare)
    SELECT DISTINCT(customer_id) FROM payments;
    -- Forces this to queue (AccessExclusive)
    ALTER TABLE payments ADD COLUMN refunded boolean;
    -- Which blocks these (AccessShare)
    SELECT * FROM payments WHERE id = 123;

    View Slide

  34. View Slide

  35. 15 seconds of API
    downtime

    View Slide

  36. So what happened?

    View Slide

  37. Online
    Analytical
    Processing

    View Slide

  38. Online
    Transaction
    Processing

    View Slide

  39. Don’t OLAP
    where you OLTP

    View Slide

  40. Set appropriate
    bounds on the system

    View Slide

  41. SET lock_timeout…
    SET statement_timeout…

    View Slide

  42. https://gocardless.com/blog/zero-downtime-postgres-migrations-a-little-help/

    View Slide

  43. -- Wishlist
    SET transaction_timeout…
    -- In 9.6
    SET idle_in_transaction_session_timeout…

    View Slide

  44. log_lock_waits = on

    View Slide

  45. Incident 2
    Did we just use 1.5TB of disk?

    View Slide

  46. Not big data

    View Slide

  47. 2 hours chasing timeouts

    View Slide

  48. Oops

    View Slide


  49. LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11166", size 1245184
    LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11165", size 8675328
    LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp10729.11129", size 47734784

    View Slide

  50. A runaway analytics
    query

    View Slide

  51. Don’t OLAP
    where you OLTP

    View Slide

  52. Near miss

    View Slide

  53. Set appropriate
    bounds on the system

    View Slide

  54. temp_file_limit = … # size in kB

    View Slide

  55. But…

    View Slide

  56. Data growth

    View Slide

  57. Measure as you
    approach the limit

    View Slide

  58. But…

    View Slide

  59. > CREATE INDEX pay_merch ON payments (merchant_id);
    ERROR: temporary file size exceeds temp_file_limit

    View Slide

  60. SET temp_file_limit = -1; # disable
    CREATE INDEX pay_merch ON payments (merchant_id);
    RESET temp_file_limit; # re-enable

    View Slide

  61. Resource utilisation
    ORMs
    Clustering

    View Slide

  62. Resource utilisation
    ORMs
    Clustering

    View Slide

  63. View Slide

  64. Incident 3
    Revenge of the ORM

    View Slide

  65. Postgres
    Client 1
    Client 2

    View Slide

  66. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5

    View Slide

  67. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5
    .
    .
    .

    View Slide

  68. PgBouncer with
    transaction-pooling

    View Slide

  69. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5
    PgBouncer

    View Slide

  70. No session-level
    features

    View Slide

  71. SET

    View Slide

  72. SET
    Session-level advisory locks

    View Slide

  73. SET
    Session-level advisory locks
    Prepared statements

    View Slide

  74. We prepared our
    apps for this

    View Slide

  75. NoMethodError: undefined method
    'fields' for nil:NilClass

    View Slide

  76. –Literally everyone on the internet
    “You should disable prepared statements.”

    View Slide

  77. 3 days
    3 people

    View Slide

  78. On deployment
    On violating constraints

    View Slide

  79. tcpdump + Wireshark

    View Slide

  80. A bunch of extra statements!!!

    View Slide

  81. SET client_encoding TO "utf8";
    SET client_min_messages TO "panic";
    SET standard_conforming_strings = on;
    SET client_min_messages TO "warning";
    SET time zone "UTC";

    View Slide

  82. SET client_encoding TO "utf8";
    SET client_min_messages TO "panic";
    SET standard_conforming_strings = on;
    SET client_min_messages TO "warning";
    SET time zone "UTC";

    View Slide

  83. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5
    PgBouncer

    View Slide

  84. Postgres
    Client 1 PgBouncer

    View Slide

  85. Postgres
    Client 1 PgBouncer
    SET … "panic";
    SET … "warning";

    View Slide

  86. Postgres
    Client 1 PgBouncer
    SET … "panic";
    SET … "warning";
    SET … "panic";

    View Slide

  87. Postgres
    Client 1 PgBouncer
    SET … "panic";
    SET … "warning";
    SET … "panic";
    SET … "warning";

    View Slide

  88. So what
    if client_min_messages
    is set to "panic"?

    View Slide

  89. ERROR: duplicate key value violates
    unique constraint "index_users_on_email"

    View Slide

  90. $ git show a456acb2f2
    commit a456acb2f2af8365eb9151c7cd2d5a10c189d191
    Author: Harry Marr
    Date: Wed Oct 28 16:30:02 2015 +0000
    Avoid disabling postgres errors
    # Enable standard-conforming strings if available.
    def set_standard_conforming_strings
    - old, self.client_min_messages = client_min_messages, 'panic'
    - execute('SET standard_conforming_strings = on', 'SCHEMA') rescue nil
    - ensure
    - self.client_min_messages = old
    + execute(<+ UPDATE pg_settings
    + SET setting = 'on'
    + WHERE name = 'standard_conforming_strings'
    + SQL
    end

    View Slide

  91. https://gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres-
    connections/

    View Slide

  92. https://twitter.com/rbranson/status/675005104701870080

    View Slide

  93. There’s a lot
    more than
    /usr/bin/postgresql

    View Slide

  94. - PgBouncer
    - libpq
    - Ruby pg gem
    - ActiveRecord (ORM)

    View Slide

  95. It’s just code

    View Slide

  96. Don’t trust the ORM

    View Slide

  97. Resource utilisation
    ORMs
    Clustering

    View Slide

  98. Resource utilisation
    ORMs
    Clustering

    View Slide

  99. How soon are we back online?
    Is all our data there?

    View Slide

  100. What’s in a cluster?

    View Slide

  101. Consensus

    View Slide

  102. Consensus
    State machine

    View Slide

  103. Consensus
    State machine
    Health checks

    View Slide

  104. Consensus
    State machine
    Health checks
    Actions

    View Slide

  105. A mostly true
    simplification

    View Slide

  106. Pacemaker

    View Slide

  107. Incident 4
    Sorry, I cannot hear you
    I’m kinda busy

    View Slide

  108. It starts with a page

    View Slide

  109. crm mon status - demote event for
    Postgresql on : OK

    View Slide

  110. psql: FATAL: sorry, too many clients already

    View Slide

  111. 30 seconds of API
    downtime

    View Slide

  112. Roll back the last
    change

    View Slide

  113. # Should only be set for schema migrations
    USE_DIRECT_PG_CONNECTION=true

    View Slide

  114. What about
    superuser_reserved_connections?

    View Slide

  115. 70
    Normal
    App

    View Slide

  116. 70
    26
    Normal
    App
    Unused

    View Slide

  117. 70
    3
    1
    26
    Normal
    App
    Unused
    Superuser (non-reserved)
    Superuser (reserved)

    View Slide

  118. 70
    3
    1
    26
    Normal Incident
    App
    Unused
    Superuser (non-reserved)
    Superuser (reserved)
    App
    97

    View Slide

  119. 70
    3
    1
    26
    3
    1
    Normal Incident
    App
    Unused
    Superuser (non-reserved)
    Superuser (reserved)
    App
    Superuser (rejected)
    Superuser (reserved)
    97

    View Slide

  120. Set appropriate
    bounds on the system

    View Slide

  121. View Slide

  122. Incident 5
    What’s in a health check?

    View Slide

  123. su postgres -c 'psql -c "select now()"'
    Change to the
    Postgres user
    Run a simple query
    Connect to the
    database

    View Slide

  124. Humans in LDAP
    Robots locally

    View Slide

  125. su postgres -c 'psql -c "select now()"'
    Postgres is a
    local user

    View Slide

  126. Safe to
    restart
    an LDAP node,
    right?

    View Slide

  127. Wrong

    View Slide

  128. Health-check timeout

    Cluster transition

    View Slide

  129. su postgres -c 'psql -c "select now()"'

    View Slide

  130. su postgres -c 'echo "hello"'
    Command doesn’t
    matter

    View Slide

  131. Enter strace, ltrace

    View Slide

  132. strace - print every system
    call made by a process

    View Slide

  133. ltrace - same, but for
    C library calls

    View Slide

  134. Opinion: the most useful
    debugging tool you can learn
    http://jvns.ca/blog/2015/04/14/strace-zine/

    View Slide

  135. # strace su postgres -c 'echo "hello"'
    getuid() = 0
    # ltrace su postgres -c 'echo "hello"'
    getlogin() = "chris"
    getpwnam_r(…) # blocks

    View Slide

  136. # strace su postgres -c 'echo "hello"'
    getuid() = 0
    # ltrace su postgres -c 'echo "hello"'
    getlogin() = "chris"
    getpwnam_r(…) # blocks
    !?

    View Slide

  137. getlogin

    View Slide

  138. the user name associated by the
    login activity with the controlling
    terminal of the current process

    View Slide

  139. the user name associated by the
    login activity with the controlling
    terminal of the current process

    View Slide

  140. $ ./test_getlogin
    User is: chris
    $ sudo -i
    # ./test_getlogin
    User is: chris

    View Slide

  141. /var/run/utmp
    /proc//loginuid

    View Slide

  142. # ps aux | grep '[p]acemakerd' | awk '{print $2}'
    9001

    View Slide

  143. # ps aux | grep '[p]acemakerd' | awk '{print $2}'
    9001
    # cat /proc/9001/loginuid
    1234

    View Slide

  144. # ps aux | grep '[p]acemakerd' | awk '{print $2}'
    9001
    # cat /proc/9001/loginuid
    1234
    # getent passwd 1234 | cut -d':' -f1
    chris

    View Slide

  145. View Slide

  146. View Slide

  147. View Slide

  148. service pacemaker restart
    changes the user

    View Slide

  149. Hold up.
    What is su doing?

    View Slide

  150. The code is subtle

    View Slide

  151. Getting a passwd struct for
    the calling user

    View Slide

  152. uid = getuid() 0 (root)

    View Slide

  153. uid = getuid()
    login = getlogin()
    0 (root)
    chris

    View Slide

  154. uid = getuid()
    login = getlogin()
    passwd = passwd_from_name(login)
    0 (root)
    chris
    {…}

    View Slide

  155. uid = getuid()
    login = getlogin()
    passwd = passwd_from_name(login)
    if (passwd.uid == uid)
    return passwd
    0 (root)
    chris
    {…}
    0 == 1234

    View Slide

  156. uid = getuid()
    login = getlogin()
    passwd = passwd_from_name(login)
    if (passwd.uid == uid)
    return passwd
    else
    return passwd_from_uid(uid)
    0 (root)
    chris
    {…}
    0 == 1234

    View Slide

  157. Only does anything if you
    have duplicate UIDs

    View Slide

  158. Why does the
    login activity matter?

    View Slide

  159. There may be
    Reasons™

    View Slide

  160. View Slide

  161. View Slide

  162. freebsd-hackers
    https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050757.html

    View Slide

  163. There are
    Reasons™
    https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050757.html

    View Slide

  164. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050761.html

    View Slide

  165. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050765.html

    View Slide

  166. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050766.html

    View Slide

  167. https://github.com/dspinellis/unix-history-repo

    View Slide

  168. Code archeology

    View Slide

  169. Anyway…
    we need to fix
    our setup

    View Slide

  170. Set appropriate
    bounds on the system

    View Slide

  171. bind_timelimit

    View Slide

  172. There’s a lot
    more than
    /usr/bin/postgresql

    View Slide

  173. Run game days

    View Slide

  174. https://gocardless.com/blog/game-days-at-gc/

    View Slide

  175. Resource utilisation
    ORMs
    Clustering

    View Slide

  176. 3 things to do

    View Slide

  177. Check your logs

    View Slide

  178. Check your logs
    Set appropriate bounds

    View Slide

  179. Check your logs
    Set appropriate bounds
    Check what your ORM’s up to

    View Slide

  180. Check your logs
    Set appropriate bounds
    Check what your ORM’s up to

    View Slide

  181. And two thoughts…

    View Slide

  182. Operating a database is about
    much more
    than the database itself

    View Slide

  183. There’s a lot
    more than
    /usr/bin/postgresql

    View Slide

  184. https://twitter.com/cscotta/status/535898821914419200

    View Slide

  185. Thank you
    (❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  186. We’re hiring
    (❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  187. Questions?
    (❤
    @ChrisSinjo
    @GoCardlessEng

    View Slide

  188. Image credits
    • Heart emoji - https://github.com/mozilla/fxemoji/blob/
    9f68ca9c5bc51521f9ffe284e00ba8b7308e2c41/svgs/
    FirefoxEmoji/u2764-redheart.svg
    • Elephants - https://www.flickr.com/photos/makeitkenya/
    22047623331/
    • Detour - https://www.flickr.com/photos/
    wwward0/8257660963/
    • Kitten - https://www.flickr.com/photos/aigle_dore/7787096102

    View Slide

  189. References (Incident 1)
    • Safe Operations For High Volume PostgreSQL (Braintree) -
    https://www.braintreepayments.com/blog/safe-operations-for-
    high-volume-postgresql/
    • Zero-downtime Postgres migrations - the hard parts - https://
    gocardless.com/blog/zero-downtime-postgres-migrations-the-
    hard-parts/
    • Zero-downtime Postgres migrations - a little help - https://
    gocardless.com/blog/zero-downtime-postgres-migrations-a-little-
    help/

    View Slide

  190. References (Incident 1)
    • Postgres 9.5 Client Connection Defaults (lock_timeout,
    statement_timeout) - https://www.postgresql.org/docs/9.5/
    static/runtime-config-client.html
    • Postgres 9.6 Client Connection Defaults
    (idle_in_transaction_session_timeout) - https://
    www.postgresql.org/docs/9.5/static/runtime-config-client.html
    • Postgres 9.5 Error Reporting and Logging (log_lock_waits) -
    https://www.postgresql.org/docs/9.5/static/runtime-config-
    logging.html

    View Slide

  191. References (Incident 2)
    • Postgres 9.5 Resource Consumption (temp_file_limit) - https://
    www.postgresql.org/docs/current/static/runtime-config-
    resource.html

    View Slide

  192. References (Incident 3)
    • PgBouncer feature matrix for pooling modes - https://
    wiki.postgresql.org/wiki/
    PgBouncer#Feature_matrix_for_pooling_modes
    • The Troubleshooting Tales: issues scaling Postgres connections - https://
    gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres-
    connections/
    • Rails commit that fixed client_min_messages issue - https://github.com/
    rails/rails/commit/a456acb2f2af8365eb9151c7cd2d5a10c189d191
    • Rick Branson's tweet about the same problem in Django - https://
    twitter.com/rbranson/status/675005104701870080

    View Slide

  193. References (Incident 4)
    • Postgres 9.5 Connections and Authentication
    (superuser_reserved_connections) - https://
    www.postgresql.org/docs/9.5/static/runtime-config-
    connection.html

    View Slide

  194. References (Incident 5)
    • Julia Evans' strace zine - http://jvns.ca/blog/2015/04/14/
    strace-zine/
    • All fun and games until you start with GameDays - https://
    gocardless.com/blog/game-days-at-gc/
    • freebsd-hackers thread - https://lists.freebsd.org/pipermail/
    freebsd-hackers/2017-March/050757.html

    View Slide

  195. References (Incident 5)
    • Current code in `su` to look up user - https://github.com/
    shadow-maint/shadow/blob/
    ef45bb2496182b5df90ad0323bef75d1a5d69887/contrib/
    pwdauth.c#L127
    • Much older similar code in `su` (1994 or earlier) - https://
    github.com/freebsd/freebsd/blob/
    b5c3fb9427806b740f7df3e43a1513e1f5fa840b/usr.bin/su/
    su.c#L255

    View Slide

  196. References (Incident 5)
    • getuid manual - http://pubs.opengroup.org/onlinepubs/
    009695399/functions/getuid.html
    • getlogin manual - http://pubs.opengroup.org/onlinepubs/
    009695399/functions/getlogin.html
    • getpwnam manual - http://pubs.opengroup.org/onlinepubs/
    9699919799/functions/getpwnam.html
    • getpwuid manual - http://pubs.opengroup.org/onlinepubs/
    009695399/functions/getpwuid.html

    View Slide

  197. References (Misc)
    • Practical Postmortems at Etsy - https://www.infoq.com/articles/
    postmortems-etsy
    • Scott Andreas' tweet about investigating things - https://
    twitter.com/cscotta/status/535898821914419200

    View Slide