Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned the Hard Way: Postgres in Production at GoCardless (PGConf US Edition)

Lessons Learned the Hard Way: Postgres in Production at GoCardless (PGConf US Edition)

**New content in this edition: better answers for incident 5**

From the beginning, GoCardless has used Postgres for the stuff that matters. It’s the primary store behind our APIs and dashboards.

After 5 years, it’s fair to say we’ve seen some things. From spikes in latency as resources run out, through the weird contortions ORMs put your database through, and on to the perils of running a high-availability cluster - this talk has it all.

In the talk, we’ll run through some war stories from the last two years at GoCardless. By gathering them together, we’ll spot some recurring themes, and hopefully avoid fighting the same battles again!

Chris Sinjakli

March 29, 2017
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Lessons learned the hard way
    Postgres in production at GoCardless
    @ChrisSinjo

    View full-size slide

  2. POST /cash/monies HTTP/1.1
    { amount: 100 }

    View full-size slide

  3. High per-request

    View full-size slide

  4. You will have
    different concerns

    View full-size slide

  5. The distributed system
    as a whole

    View full-size slide

  6. Learning through the
    medium of post-mortems

    View full-size slide

  7. What is a post-mortem?
    https://www.infoq.com/articles/postmortems-etsy

    View full-size slide

  8. Failure can be a
    great teacher

    View full-size slide

  9. Resource utilisation

    View full-size slide

  10. Resource utilisation
    ORMs

    View full-size slide

  11. Resource utilisation
    ORMs
    Clustering

    View full-size slide

  12. foreach(incident):
    What happened
    Something we learned
    Ideas you can use

    View full-size slide

  13. 3 things to do

    View full-size slide

  14. Resource utilisation
    ORMs
    Clustering

    View full-size slide

  15. Incident 1
    The Fast Migration
    That Wasn’t

    View full-size slide

  16. ~10-20 schema changes
    per-month

    View full-size slide

  17. Problem: locks

    View full-size slide

  18. Many guides on safe
    schema changes

    View full-size slide

  19. Don’t ADD COLUMN with DEFAULT

    View full-size slide

  20. Don’t ADD COLUMN with DEFAULT
    VALIDATE constraints after adding

    View full-size slide

  21. Don’t ADD COLUMN with DEFAULT
    VALIDATE constraints after adding
    Add indexes CONCURRENTLY

    View full-size slide

  22. ALTER TABLE payments
    ADD COLUMN refunded boolean;
    No DEFAULT

    View full-size slide

  23. The lock queue

    View full-size slide

  24. -- Slow query (AccessShare)
    SELECT DISTINCT(customer_id) FROM payments;

    View full-size slide

  25. -- Slow query (AccessShare)
    SELECT DISTINCT(customer_id) FROM payments;
    -- Forces this to queue (AccessExclusive)
    ALTER TABLE payments ADD COLUMN refunded boolean;

    View full-size slide

  26. -- Slow query (AccessShare)
    SELECT DISTINCT(customer_id) FROM payments;
    -- Forces this to queue (AccessExclusive)
    ALTER TABLE payments ADD COLUMN refunded boolean;
    -- Which blocks these (AccessShare)
    SELECT * FROM payments WHERE id = 123;

    View full-size slide

  27. 15 seconds of API
    downtime

    View full-size slide

  28. So what happened?

    View full-size slide

  29. Online
    Analytical
    Processing

    View full-size slide

  30. Online
    Transaction
    Processing

    View full-size slide

  31. Don’t OLAP
    where you OLTP

    View full-size slide

  32. Set appropriate
    bounds on the system

    View full-size slide

  33. SET lock_timeout…
    SET statement_timeout…

    View full-size slide

  34. https://gocardless.com/blog/zero-downtime-postgres-migrations-a-little-help/

    View full-size slide

  35. -- Wishlist
    SET transaction_timeout…
    -- In 9.6
    SET idle_in_transaction_session_timeout…

    View full-size slide

  36. log_lock_waits = on

    View full-size slide

  37. Incident 2
    Did we just use 1.5TB of disk?

    View full-size slide

  38. Not big data

    View full-size slide

  39. 2 hours chasing timeouts

    View full-size slide


  40. LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11166", size 1245184
    LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11165", size 8675328
    LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp10729.11129", size 47734784

    View full-size slide

  41. A runaway analytics
    query

    View full-size slide

  42. Don’t OLAP
    where you OLTP

    View full-size slide

  43. Set appropriate
    bounds on the system

    View full-size slide

  44. temp_file_limit = … # size in kB

    View full-size slide

  45. Measure as you
    approach the limit

    View full-size slide

  46. > CREATE INDEX pay_merch ON payments (merchant_id);
    ERROR: temporary file size exceeds temp_file_limit

    View full-size slide

  47. SET temp_file_limit = -1; # disable
    CREATE INDEX pay_merch ON payments (merchant_id);
    RESET temp_file_limit; # re-enable

    View full-size slide

  48. Resource utilisation
    ORMs
    Clustering

    View full-size slide

  49. Resource utilisation
    ORMs
    Clustering

    View full-size slide

  50. Incident 3
    Revenge of the ORM

    View full-size slide

  51. Postgres
    Client 1
    Client 2

    View full-size slide

  52. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5

    View full-size slide

  53. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5
    .
    .
    .

    View full-size slide

  54. PgBouncer with
    transaction-pooling

    View full-size slide

  55. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5
    PgBouncer

    View full-size slide

  56. No session-level
    features

    View full-size slide

  57. SET
    Session-level advisory locks

    View full-size slide

  58. SET
    Session-level advisory locks
    Prepared statements

    View full-size slide

  59. We prepared our
    apps for this

    View full-size slide

  60. NoMethodError: undefined method
    'fields' for nil:NilClass

    View full-size slide

  61. –Literally everyone on the internet
    “You should disable prepared statements.”

    View full-size slide

  62. 3 days
    3 people

    View full-size slide

  63. On deployment
    On violating constraints

    View full-size slide

  64. tcpdump + Wireshark

    View full-size slide

  65. A bunch of extra statements!!!

    View full-size slide

  66. SET client_encoding TO "utf8";
    SET client_min_messages TO "panic";
    SET standard_conforming_strings = on;
    SET client_min_messages TO "warning";
    SET time zone "UTC";

    View full-size slide

  67. SET client_encoding TO "utf8";
    SET client_min_messages TO "panic";
    SET standard_conforming_strings = on;
    SET client_min_messages TO "warning";
    SET time zone "UTC";

    View full-size slide

  68. Postgres
    Client 1
    Client 2
    Client 3
    Client 4
    Client 5
    PgBouncer

    View full-size slide

  69. Postgres
    Client 1 PgBouncer

    View full-size slide

  70. Postgres
    Client 1 PgBouncer
    SET … "panic";
    SET … "warning";

    View full-size slide

  71. Postgres
    Client 1 PgBouncer
    SET … "panic";
    SET … "warning";
    SET … "panic";

    View full-size slide

  72. Postgres
    Client 1 PgBouncer
    SET … "panic";
    SET … "warning";
    SET … "panic";
    SET … "warning";

    View full-size slide

  73. So what
    if client_min_messages
    is set to "panic"?

    View full-size slide

  74. ERROR: duplicate key value violates
    unique constraint "index_users_on_email"

    View full-size slide

  75. $ git show a456acb2f2
    commit a456acb2f2af8365eb9151c7cd2d5a10c189d191
    Author: Harry Marr
    Date: Wed Oct 28 16:30:02 2015 +0000
    Avoid disabling postgres errors
    # Enable standard-conforming strings if available.
    def set_standard_conforming_strings
    - old, self.client_min_messages = client_min_messages, 'panic'
    - execute('SET standard_conforming_strings = on', 'SCHEMA') rescue nil
    - ensure
    - self.client_min_messages = old
    + execute(<<-SQL, 'SCHEMA')
    + UPDATE pg_settings
    + SET setting = 'on'
    + WHERE name = 'standard_conforming_strings'
    + SQL
    end

    View full-size slide

  76. https://gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres-
    connections/

    View full-size slide

  77. https://twitter.com/rbranson/status/675005104701870080

    View full-size slide

  78. There’s a lot
    more than
    /usr/bin/postgresql

    View full-size slide

  79. - PgBouncer
    - libpq
    - Ruby pg gem
    - ActiveRecord (ORM)

    View full-size slide

  80. It’s just code

    View full-size slide

  81. Don’t trust the ORM

    View full-size slide

  82. Resource utilisation
    ORMs
    Clustering

    View full-size slide

  83. Resource utilisation
    ORMs
    Clustering

    View full-size slide

  84. How soon are we back online?
    Is all our data there?

    View full-size slide

  85. What’s in a cluster?

    View full-size slide

  86. Consensus
    State machine

    View full-size slide

  87. Consensus
    State machine
    Health checks

    View full-size slide

  88. Consensus
    State machine
    Health checks
    Actions

    View full-size slide

  89. A mostly true
    simplification

    View full-size slide

  90. Incident 4
    Sorry, I cannot hear you
    I’m kinda busy

    View full-size slide

  91. It starts with a page

    View full-size slide

  92. crm mon status - demote event for
    Postgresql on : OK

    View full-size slide

  93. psql: FATAL: sorry, too many clients already

    View full-size slide

  94. 30 seconds of API
    downtime

    View full-size slide

  95. Roll back the last
    change

    View full-size slide

  96. # Should only be set for schema migrations
    USE_DIRECT_PG_CONNECTION=true

    View full-size slide

  97. What about
    superuser_reserved_connections?

    View full-size slide

  98. 70
    26
    Normal
    App
    Unused

    View full-size slide

  99. 70
    3
    1
    26
    Normal
    App
    Unused
    Superuser (non-reserved)
    Superuser (reserved)

    View full-size slide

  100. 70
    3
    1
    26
    Normal Incident
    App
    Unused
    Superuser (non-reserved)
    Superuser (reserved)
    App
    97

    View full-size slide

  101. 70
    3
    1
    26
    3
    1
    Normal Incident
    App
    Unused
    Superuser (non-reserved)
    Superuser (reserved)
    App
    Superuser (rejected)
    Superuser (reserved)
    97

    View full-size slide

  102. Set appropriate
    bounds on the system

    View full-size slide

  103. Incident 5
    What’s in a health check?

    View full-size slide

  104. su postgres -c 'psql -c "select now()"'
    Change to the
    Postgres user
    Run a simple query
    Connect to the
    database

    View full-size slide

  105. Humans in LDAP
    Robots locally

    View full-size slide

  106. su postgres -c 'psql -c "select now()"'
    Postgres is a
    local user

    View full-size slide

  107. Safe to
    restart
    an LDAP node,
    right?

    View full-size slide

  108. Health-check timeout

    Cluster transition

    View full-size slide

  109. su postgres -c 'psql -c "select now()"'

    View full-size slide

  110. su postgres -c 'echo "hello"'
    Command doesn’t
    matter

    View full-size slide

  111. Enter strace, ltrace

    View full-size slide

  112. strace - print every system
    call made by a process

    View full-size slide

  113. ltrace - same, but for
    C library calls

    View full-size slide

  114. Opinion: the most useful
    debugging tool you can learn
    http://jvns.ca/blog/2015/04/14/strace-zine/

    View full-size slide

  115. # strace su postgres -c 'echo "hello"'
    getuid() = 0
    # ltrace su postgres -c 'echo "hello"'
    getlogin() = "chris"
    getpwnam_r(…) # blocks

    View full-size slide

  116. # strace su postgres -c 'echo "hello"'
    getuid() = 0
    # ltrace su postgres -c 'echo "hello"'
    getlogin() = "chris"
    getpwnam_r(…) # blocks
    !?

    View full-size slide

  117. the user name associated by the
    login activity with the controlling
    terminal of the current process

    View full-size slide

  118. the user name associated by the
    login activity with the controlling
    terminal of the current process

    View full-size slide

  119. $ ./test_getlogin
    User is: chris
    $ sudo -i
    # ./test_getlogin
    User is: chris

    View full-size slide

  120. /var/run/utmp
    /proc//loginuid

    View full-size slide

  121. # ps aux | grep '[p]acemakerd' | awk '{print $2}'
    9001

    View full-size slide

  122. # ps aux | grep '[p]acemakerd' | awk '{print $2}'
    9001
    # cat /proc/9001/loginuid
    1234

    View full-size slide

  123. # ps aux | grep '[p]acemakerd' | awk '{print $2}'
    9001
    # cat /proc/9001/loginuid
    1234
    # getent passwd 1234 | cut -d':' -f1
    chris

    View full-size slide

  124. service pacemaker restart
    changes the user

    View full-size slide

  125. Hold up.
    What is su doing?

    View full-size slide

  126. The code is subtle

    View full-size slide

  127. Getting a passwd struct for
    the calling user

    View full-size slide

  128. uid = getuid() 0 (root)

    View full-size slide

  129. uid = getuid()
    login = getlogin()
    0 (root)
    chris

    View full-size slide

  130. uid = getuid()
    login = getlogin()
    passwd = passwd_from_name(login)
    0 (root)
    chris
    {…}

    View full-size slide

  131. uid = getuid()
    login = getlogin()
    passwd = passwd_from_name(login)
    if (passwd.uid == uid)
    return passwd
    0 (root)
    chris
    {…}
    0 == 1234

    View full-size slide

  132. uid = getuid()
    login = getlogin()
    passwd = passwd_from_name(login)
    if (passwd.uid == uid)
    return passwd
    else
    return passwd_from_uid(uid)
    0 (root)
    chris
    {…}
    0 == 1234

    View full-size slide

  133. Only does anything if you
    have duplicate UIDs

    View full-size slide

  134. Why does the
    login activity matter?

    View full-size slide

  135. There may be
    Reasons™

    View full-size slide

  136. freebsd-hackers
    https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050757.html

    View full-size slide

  137. There are
    Reasons™
    https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050757.html

    View full-size slide

  138. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050761.html

    View full-size slide

  139. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050765.html

    View full-size slide

  140. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050766.html

    View full-size slide

  141. https://github.com/dspinellis/unix-history-repo

    View full-size slide

  142. Code archeology

    View full-size slide

  143. Anyway…
    we need to fix
    our setup

    View full-size slide

  144. Set appropriate
    bounds on the system

    View full-size slide

  145. bind_timelimit

    View full-size slide

  146. There’s a lot
    more than
    /usr/bin/postgresql

    View full-size slide

  147. Run game days

    View full-size slide

  148. https://gocardless.com/blog/game-days-at-gc/

    View full-size slide

  149. Resource utilisation
    ORMs
    Clustering

    View full-size slide

  150. 3 things to do

    View full-size slide

  151. Check your logs

    View full-size slide

  152. Check your logs
    Set appropriate bounds

    View full-size slide

  153. Check your logs
    Set appropriate bounds
    Check what your ORM’s up to

    View full-size slide

  154. Check your logs
    Set appropriate bounds
    Check what your ORM’s up to

    View full-size slide

  155. And two thoughts…

    View full-size slide

  156. Operating a database is about
    much more
    than the database itself

    View full-size slide

  157. There’s a lot
    more than
    /usr/bin/postgresql

    View full-size slide

  158. https://twitter.com/cscotta/status/535898821914419200

    View full-size slide

  159. Thank you
    (❤
    @ChrisSinjo
    @GoCardlessEng

    View full-size slide

  160. We’re hiring
    (❤
    @ChrisSinjo
    @GoCardlessEng

    View full-size slide

  161. Questions?
    (❤
    @ChrisSinjo
    @GoCardlessEng

    View full-size slide

  162. Image credits
    • Heart emoji - https://github.com/mozilla/fxemoji/blob/
    9f68ca9c5bc51521f9ffe284e00ba8b7308e2c41/svgs/
    FirefoxEmoji/u2764-redheart.svg
    • Elephants - https://www.flickr.com/photos/makeitkenya/
    22047623331/
    • Detour - https://www.flickr.com/photos/
    wwward0/8257660963/
    • Kitten - https://www.flickr.com/photos/aigle_dore/7787096102

    View full-size slide

  163. References (Incident 1)
    • Safe Operations For High Volume PostgreSQL (Braintree) -
    https://www.braintreepayments.com/blog/safe-operations-for-
    high-volume-postgresql/
    • Zero-downtime Postgres migrations - the hard parts - https://
    gocardless.com/blog/zero-downtime-postgres-migrations-the-
    hard-parts/
    • Zero-downtime Postgres migrations - a little help - https://
    gocardless.com/blog/zero-downtime-postgres-migrations-a-little-
    help/

    View full-size slide

  164. References (Incident 1)
    • Postgres 9.5 Client Connection Defaults (lock_timeout,
    statement_timeout) - https://www.postgresql.org/docs/9.5/
    static/runtime-config-client.html
    • Postgres 9.6 Client Connection Defaults
    (idle_in_transaction_session_timeout) - https://
    www.postgresql.org/docs/9.5/static/runtime-config-client.html
    • Postgres 9.5 Error Reporting and Logging (log_lock_waits) -
    https://www.postgresql.org/docs/9.5/static/runtime-config-
    logging.html

    View full-size slide

  165. References (Incident 2)
    • Postgres 9.5 Resource Consumption (temp_file_limit) - https://
    www.postgresql.org/docs/current/static/runtime-config-
    resource.html

    View full-size slide

  166. References (Incident 3)
    • PgBouncer feature matrix for pooling modes - https://
    wiki.postgresql.org/wiki/
    PgBouncer#Feature_matrix_for_pooling_modes
    • The Troubleshooting Tales: issues scaling Postgres connections - https://
    gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres-
    connections/
    • Rails commit that fixed client_min_messages issue - https://github.com/
    rails/rails/commit/a456acb2f2af8365eb9151c7cd2d5a10c189d191
    • Rick Branson's tweet about the same problem in Django - https://
    twitter.com/rbranson/status/675005104701870080

    View full-size slide

  167. References (Incident 4)
    • Postgres 9.5 Connections and Authentication
    (superuser_reserved_connections) - https://
    www.postgresql.org/docs/9.5/static/runtime-config-
    connection.html

    View full-size slide

  168. References (Incident 5)
    • Julia Evans' strace zine - http://jvns.ca/blog/2015/04/14/
    strace-zine/
    • All fun and games until you start with GameDays - https://
    gocardless.com/blog/game-days-at-gc/
    • freebsd-hackers thread - https://lists.freebsd.org/pipermail/
    freebsd-hackers/2017-March/050757.html

    View full-size slide

  169. References (Incident 5)
    • Current code in `su` to look up user - https://github.com/
    shadow-maint/shadow/blob/
    ef45bb2496182b5df90ad0323bef75d1a5d69887/contrib/
    pwdauth.c#L127
    • Much older similar code in `su` (1994 or earlier) - https://
    github.com/freebsd/freebsd/blob/
    b5c3fb9427806b740f7df3e43a1513e1f5fa840b/usr.bin/su/
    su.c#L255

    View full-size slide

  170. References (Incident 5)
    • getuid manual - http://pubs.opengroup.org/onlinepubs/
    009695399/functions/getuid.html
    • getlogin manual - http://pubs.opengroup.org/onlinepubs/
    009695399/functions/getlogin.html
    • getpwnam manual - http://pubs.opengroup.org/onlinepubs/
    9699919799/functions/getpwnam.html
    • getpwuid manual - http://pubs.opengroup.org/onlinepubs/
    009695399/functions/getpwuid.html

    View full-size slide

  171. References (Misc)
    • Practical Postmortems at Etsy - https://www.infoq.com/articles/
    postmortems-etsy
    • Scott Andreas' tweet about investigating things - https://
    twitter.com/cscotta/status/535898821914419200

    View full-size slide