Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned the Hard Way: Postgres in Production at GoCardless (PGConf US Edition)

Lessons Learned the Hard Way: Postgres in Production at GoCardless (PGConf US Edition)

**New content in this edition: better answers for incident 5**

From the beginning, GoCardless has used Postgres for the stuff that matters. It’s the primary store behind our APIs and dashboards.

After 5 years, it’s fair to say we’ve seen some things. From spikes in latency as resources run out, through the weird contortions ORMs put your database through, and on to the perils of running a high-availability cluster - this talk has it all.

In the talk, we’ll run through some war stories from the last two years at GoCardless. By gathering them together, we’ll spot some recurring themes, and hopefully avoid fighting the same battles again!

Chris Sinjakli

March 29, 2017
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Lessons learned the hard way Postgres in production at GoCardless

    @ChrisSinjo
  2. GOCARDLESS

  3. POST /cash/monies HTTP/1.1 { amount: 100 }

  4. High per-request

  5. Uptime is

  6. You will have different concerns

  7. The distributed system as a whole

  8. None
  9. None
  10. None
  11. None
  12. Learning through the medium of post-mortems

  13. What is a post-mortem? https://www.infoq.com/articles/postmortems-etsy

  14. Failure can be a great teacher

  15. 5 stories

  16. Resource utilisation

  17. Resource utilisation ORMs

  18. Resource utilisation ORMs Clustering

  19. foreach(incident): What happened Something we learned Ideas you can use

  20. 3 things to do

  21. Resource utilisation ORMs Clustering

  22. Incident 1 The Fast Migration That Wasn’t

  23. ~10-20 schema changes per-month

  24. Problem: locks

  25. Many guides on safe schema changes

  26. Don’t ADD COLUMN with DEFAULT

  27. Don’t ADD COLUMN with DEFAULT VALIDATE constraints after adding

  28. Don’t ADD COLUMN with DEFAULT VALIDATE constraints after adding Add

    indexes CONCURRENTLY
  29. ALTER TABLE payments ADD COLUMN refunded boolean; No DEFAULT

  30. The lock queue

  31. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments;

  32. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces

    this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean;
  33. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces

    this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these (AccessShare) SELECT * FROM payments WHERE id = 123;
  34. None
  35. 15 seconds of API downtime

  36. So what happened?

  37. Online Analytical Processing

  38. Online Transaction Processing

  39. Don’t OLAP where you OLTP

  40. Set appropriate bounds on the system

  41. SET lock_timeout… SET statement_timeout…

  42. https://gocardless.com/blog/zero-downtime-postgres-migrations-a-little-help/

  43. -- Wishlist SET transaction_timeout… -- In 9.6 SET idle_in_transaction_session_timeout…

  44. log_lock_waits = on

  45. Incident 2 Did we just use 1.5TB of disk?

  46. Not big data

  47. 2 hours chasing timeouts

  48. Oops

  49. … LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11166", size 1245184 LOG: temporary

    file: path "base/pgsql_tmp/pgsql_tmp23325.11165", size 8675328 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp10729.11129", size 47734784 …
  50. A runaway analytics query

  51. Don’t OLAP where you OLTP

  52. Near miss

  53. Set appropriate bounds on the system

  54. temp_file_limit = … # size in kB

  55. But…

  56. Data growth

  57. Measure as you approach the limit

  58. But…

  59. > CREATE INDEX pay_merch ON payments (merchant_id); ERROR: temporary file

    size exceeds temp_file_limit
  60. SET temp_file_limit = -1; # disable CREATE INDEX pay_merch ON

    payments (merchant_id); RESET temp_file_limit; # re-enable
  61. Resource utilisation ORMs Clustering

  62. Resource utilisation ORMs Clustering

  63. None
  64. Incident 3 Revenge of the ORM

  65. Postgres Client 1 Client 2

  66. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5
  67. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5 . . .
  68. PgBouncer with transaction-pooling

  69. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5 PgBouncer
  70. No session-level features

  71. SET

  72. SET Session-level advisory locks

  73. SET Session-level advisory locks Prepared statements

  74. We prepared our apps for this

  75. NoMethodError: undefined method 'fields' for nil:NilClass

  76. –Literally everyone on the internet “You should disable prepared statements.”

  77. 3 days 3 people

  78. On deployment On violating constraints

  79. tcpdump + Wireshark

  80. A bunch of extra statements!!!

  81. SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings

    = on; SET client_min_messages TO "warning"; SET time zone "UTC";
  82. SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings

    = on; SET client_min_messages TO "warning"; SET time zone "UTC";
  83. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5 PgBouncer
  84. Postgres Client 1 PgBouncer

  85. Postgres Client 1 PgBouncer SET … "panic"; SET … "warning";

  86. Postgres Client 1 PgBouncer SET … "panic"; SET … "warning";

    SET … "panic";
  87. Postgres Client 1 PgBouncer SET … "panic"; SET … "warning";

    SET … "panic"; SET … "warning";
  88. So what if client_min_messages is set to "panic"?

  89. ERROR: duplicate key value violates unique constraint "index_users_on_email"

  90. $ git show a456acb2f2 commit a456acb2f2af8365eb9151c7cd2d5a10c189d191 Author: Harry Marr <harry.marr@gmail.com>

    Date: Wed Oct 28 16:30:02 2015 +0000 Avoid disabling postgres errors # Enable standard-conforming strings if available. def set_standard_conforming_strings - old, self.client_min_messages = client_min_messages, 'panic' - execute('SET standard_conforming_strings = on', 'SCHEMA') rescue nil - ensure - self.client_min_messages = old + execute(<<-SQL, 'SCHEMA') + UPDATE pg_settings + SET setting = 'on' + WHERE name = 'standard_conforming_strings' + SQL end
  91. https://gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres- connections/

  92. https://twitter.com/rbranson/status/675005104701870080

  93. There’s a lot more than /usr/bin/postgresql

  94. - PgBouncer - libpq - Ruby pg gem - ActiveRecord

    (ORM)
  95. It’s just code

  96. Don’t trust the ORM

  97. Resource utilisation ORMs Clustering

  98. Resource utilisation ORMs Clustering

  99. How soon are we back online? Is all our data

    there?
  100. What’s in a cluster?

  101. Consensus

  102. Consensus State machine

  103. Consensus State machine Health checks

  104. Consensus State machine Health checks Actions

  105. A mostly true simplification

  106. Pacemaker

  107. Incident 4 Sorry, I cannot hear you I’m kinda busy

  108. It starts with a page

  109. crm mon status - demote event for Postgresql on <node>:

    OK
  110. psql: FATAL: sorry, too many clients already

  111. 30 seconds of API downtime

  112. Roll back the last change

  113. # Should only be set for schema migrations USE_DIRECT_PG_CONNECTION=true

  114. What about superuser_reserved_connections?

  115. 70 Normal App

  116. 70 26 Normal App Unused

  117. 70 3 1 26 Normal App Unused Superuser (non-reserved) Superuser

    (reserved)
  118. 70 3 1 26 Normal Incident App Unused Superuser (non-reserved)

    Superuser (reserved) App 97
  119. 70 3 1 26 3 1 Normal Incident App Unused

    Superuser (non-reserved) Superuser (reserved) App Superuser (rejected) Superuser (reserved) 97
  120. Set appropriate bounds on the system

  121. None
  122. Incident 5 What’s in a health check?

  123. su postgres -c 'psql -c "select now()"' Change to the

    Postgres user Run a simple query Connect to the database
  124. Humans in LDAP Robots locally

  125. su postgres -c 'psql -c "select now()"' Postgres is a

    local user
  126. Safe to restart an LDAP node, right?

  127. Wrong

  128. Health-check timeout ↓ Cluster transition

  129. su postgres -c 'psql -c "select now()"'

  130. su postgres -c 'echo "hello"' Command doesn’t matter

  131. Enter strace, ltrace

  132. strace - print every system call made by a process

  133. ltrace - same, but for C library calls

  134. Opinion: the most useful debugging tool you can learn http://jvns.ca/blog/2015/04/14/strace-zine/

  135. # strace su postgres -c 'echo "hello"' getuid() = 0

    # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks
  136. # strace su postgres -c 'echo "hello"' getuid() = 0

    # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks !?
  137. getlogin

  138. the user name associated by the login activity with the

    controlling terminal of the current process
  139. the user name associated by the login activity with the

    controlling terminal of the current process
  140. $ ./test_getlogin User is: chris $ sudo -i # ./test_getlogin

    User is: chris
  141. /var/run/utmp /proc/<pid>/loginuid

  142. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001
  143. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001 # cat /proc/9001/loginuid 1234
  144. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001 # cat /proc/9001/loginuid 1234 # getent passwd 1234 | cut -d':' -f1 chris
  145. None
  146. None
  147. None
  148. service pacemaker restart changes the user

  149. Hold up. What is su doing?

  150. The code is subtle

  151. Getting a passwd struct for the calling user

  152. uid = getuid() 0 (root)

  153. uid = getuid() login = getlogin() 0 (root) chris

  154. uid = getuid() login = getlogin() passwd = passwd_from_name(login) 0

    (root) chris {…}
  155. uid = getuid() login = getlogin() passwd = passwd_from_name(login) if

    (passwd.uid == uid) return passwd 0 (root) chris {…} 0 == 1234
  156. uid = getuid() login = getlogin() passwd = passwd_from_name(login) if

    (passwd.uid == uid) return passwd else return passwd_from_uid(uid) 0 (root) chris {…} 0 == 1234
  157. Only does anything if you have duplicate UIDs

  158. Why does the login activity matter?

  159. There may be Reasons™

  160. None
  161. None
  162. freebsd-hackers https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050757.html

  163. There are Reasons™ https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050757.html

  164. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050761.html

  165. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050765.html

  166. https://lists.freebsd.org/pipermail/freebsd-hackers/2017-March/050766.html

  167. https://github.com/dspinellis/unix-history-repo

  168. Code archeology

  169. Anyway… we need to fix our setup

  170. Set appropriate bounds on the system

  171. bind_timelimit

  172. There’s a lot more than /usr/bin/postgresql

  173. Run game days

  174. https://gocardless.com/blog/game-days-at-gc/

  175. Resource utilisation ORMs Clustering

  176. 3 things to do

  177. Check your logs

  178. Check your logs Set appropriate bounds

  179. Check your logs Set appropriate bounds Check what your ORM’s

    up to
  180. Check your logs Set appropriate bounds Check what your ORM’s

    up to
  181. And two thoughts…

  182. Operating a database is about much more than the database

    itself
  183. There’s a lot more than /usr/bin/postgresql

  184. https://twitter.com/cscotta/status/535898821914419200

  185. Thank you (❤ @ChrisSinjo @GoCardlessEng

  186. We’re hiring (❤ @ChrisSinjo @GoCardlessEng

  187. Questions? (❤ @ChrisSinjo @GoCardlessEng

  188. Image credits • Heart emoji - https://github.com/mozilla/fxemoji/blob/ 9f68ca9c5bc51521f9ffe284e00ba8b7308e2c41/svgs/ FirefoxEmoji/u2764-redheart.svg •

    Elephants - https://www.flickr.com/photos/makeitkenya/ 22047623331/ • Detour - https://www.flickr.com/photos/ wwward0/8257660963/ • Kitten - https://www.flickr.com/photos/aigle_dore/7787096102
  189. References (Incident 1) • Safe Operations For High Volume PostgreSQL

    (Braintree) - https://www.braintreepayments.com/blog/safe-operations-for- high-volume-postgresql/ • Zero-downtime Postgres migrations - the hard parts - https:// gocardless.com/blog/zero-downtime-postgres-migrations-the- hard-parts/ • Zero-downtime Postgres migrations - a little help - https:// gocardless.com/blog/zero-downtime-postgres-migrations-a-little- help/
  190. References (Incident 1) • Postgres 9.5 Client Connection Defaults (lock_timeout,

    statement_timeout) - https://www.postgresql.org/docs/9.5/ static/runtime-config-client.html • Postgres 9.6 Client Connection Defaults (idle_in_transaction_session_timeout) - https:// www.postgresql.org/docs/9.5/static/runtime-config-client.html • Postgres 9.5 Error Reporting and Logging (log_lock_waits) - https://www.postgresql.org/docs/9.5/static/runtime-config- logging.html
  191. References (Incident 2) • Postgres 9.5 Resource Consumption (temp_file_limit) -

    https:// www.postgresql.org/docs/current/static/runtime-config- resource.html
  192. References (Incident 3) • PgBouncer feature matrix for pooling modes

    - https:// wiki.postgresql.org/wiki/ PgBouncer#Feature_matrix_for_pooling_modes • The Troubleshooting Tales: issues scaling Postgres connections - https:// gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres- connections/ • Rails commit that fixed client_min_messages issue - https://github.com/ rails/rails/commit/a456acb2f2af8365eb9151c7cd2d5a10c189d191 • Rick Branson's tweet about the same problem in Django - https:// twitter.com/rbranson/status/675005104701870080
  193. References (Incident 4) • Postgres 9.5 Connections and Authentication (superuser_reserved_connections)

    - https:// www.postgresql.org/docs/9.5/static/runtime-config- connection.html
  194. References (Incident 5) • Julia Evans' strace zine - http://jvns.ca/blog/2015/04/14/

    strace-zine/ • All fun and games until you start with GameDays - https:// gocardless.com/blog/game-days-at-gc/ • freebsd-hackers thread - https://lists.freebsd.org/pipermail/ freebsd-hackers/2017-March/050757.html
  195. References (Incident 5) • Current code in `su` to look

    up user - https://github.com/ shadow-maint/shadow/blob/ ef45bb2496182b5df90ad0323bef75d1a5d69887/contrib/ pwdauth.c#L127 • Much older similar code in `su` (1994 or earlier) - https:// github.com/freebsd/freebsd/blob/ b5c3fb9427806b740f7df3e43a1513e1f5fa840b/usr.bin/su/ su.c#L255
  196. References (Incident 5) • getuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getuid.html •

    getlogin manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getlogin.html • getpwnam manual - http://pubs.opengroup.org/onlinepubs/ 9699919799/functions/getpwnam.html • getpwuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getpwuid.html
  197. References (Misc) • Practical Postmortems at Etsy - https://www.infoq.com/articles/ postmortems-etsy

    • Scott Andreas' tweet about investigating things - https:// twitter.com/cscotta/status/535898821914419200