Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned the Hard Way: Postgres in Production at GoCardless (GoCardless DevOps Meetup Edition)

Lessons Learned the Hard Way: Postgres in Production at GoCardless (GoCardless DevOps Meetup Edition)

From the beginning, GoCardless has used Postgres for the stuff that matters. It’s the primary store behind our APIs and dashboards.

After 5 years, it’s fair to say we’ve seen some things. From spikes in latency as resources run out, through the weird contortions ORMs put your database through, and on to the perils of running a high-availability cluster - this talk has it all.

In the talk, we’ll run through some war stories from the last two years at GoCardless. By gathering them together, we’ll spot some recurring themes, and hopefully avoid fighting the same battles again!

Chris Sinjakli

November 03, 2016
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Lessons learned the hard way Postgres in production at GoCardless

    @ChrisSinjo
  2. GOCARDLESS

  3. POST /cash/monies HTTP/1.1 { amount: 100 }

  4. High per-request

  5. Uptime is

  6. You will have different concerns

  7. The distributed system as a whole

  8. None
  9. None
  10. None
  11. None
  12. Learning through the medium of post-mortems

  13. What is a post-mortem? https://www.infoq.com/articles/postmortems-etsy

  14. Failure can be a great teacher

  15. 5 stories

  16. Resource utilisation ORMs Clustering

  17. foreach(incident): What happened Something we learned Ideas you can use

  18. 3 things to do

  19. Resource utilisation ORMs Clustering

  20. Incident 1 The Fast Migration That Wasn’t

  21. ~10-20 schema changes per-month

  22. Problem: locks

  23. Many guides on safe schema changes

  24. Don’t ADD COLUMN with DEFAULT VALIDATE constraints after adding Add

    indexes CONCURRENTLY
  25. ALTER TABLE payments ADD COLUMN refunded boolean; No DEFAULT

  26. The lock queue

  27. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments;

  28. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces

    this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean;
  29. -- Slow query (AccessShare) SELECT DISTINCT(customer_id) FROM payments; -- Forces

    this to queue (AccessExclusive) ALTER TABLE payments ADD COLUMN refunded boolean; -- Which blocks these (AccessShare) SELECT * FROM payments WHERE id = 123;
  30. None
  31. 15 seconds of API downtime

  32. Online Analytical Processing

  33. Online Transaction Processing

  34. Don’t OLAP where you OLTP

  35. Set appropriate bounds on the system

  36. SET lock_timeout… SET statement_timeout…

  37. https://gocardless.com/blog/zero-downtime-postgres-migrations-a-little-help/

  38. -- Wishlist SET transaction_timeout… -- In 9.6 SET idle_in_transaction_session_timeout…

  39. log_lock_waits = on

  40. Incident 2 Did we just use 1.5TB of disk?

  41. Not big data

  42. 2 hours chasing timeouts

  43. Oops

  44. … LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp23325.11166", size 1245184 LOG: temporary

    file: path "base/pgsql_tmp/pgsql_tmp23325.11165", size 8675328 LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp10729.11129", size 47734784 …
  45. A runaway analytics query

  46. Don’t OLAP where you OLTP

  47. Near miss

  48. Set appropriate bounds on the system

  49. temp_file_limit = … # size in kB

  50. But…

  51. Data growth

  52. Measure as you approach the limit

  53. But…

  54. > CREATE INDEX pay_merch ON payments (merchant_id); ERROR: temporary file

    size exceeds temp_file_limit
  55. SET temp_file_limit = -1; # disable CREATE INDEX pay_merch ON

    payments (merchant_id); RESET temp_file_limit; # re-enable
  56. Resource utilisation ORMs Clustering

  57. Incident 3 Revenge of the ORM

  58. Postgres Client 1 Client 2

  59. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5
  60. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5 . . .
  61. PgBouncer with transaction-pooling

  62. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5 PgBouncer
  63. No session-level features

  64. SET Session-level advisory locks Prepared statements

  65. We prepared our apps for this

  66. NoMethodError: undefined method 'fields' for nil:NilClass

  67. –Literally everyone on the internet “You should disable prepared statements.”

  68. 3 days 3 people

  69. On deployment On violating constraints

  70. tcpdump + Wireshark

  71. A bunch of extra statements!!!

  72. SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings

    = on; SET client_min_messages TO "warning"; SET time zone "UTC";
  73. SET client_encoding TO "utf8"; SET client_min_messages TO "panic"; SET standard_conforming_strings

    = on; SET client_min_messages TO "warning"; SET time zone "UTC";
  74. Postgres Client 1 Client 2 Client 3 Client 4 Client

    5 PgBouncer
  75. Postgres Client 1 PgBouncer

  76. Postgres Client 1 PgBouncer SET … "panic"; SET … "warning";

  77. Postgres Client 1 PgBouncer SET … "panic"; SET … "warning";

    SET … "panic";
  78. Postgres Client 1 PgBouncer SET … "panic"; SET … "warning";

    SET … "panic"; SET … "warning";
  79. So what if client_min_messages is set to "panic"?

  80. ERROR: duplicate key value violates unique constraint "index_users_on_email"

  81. $ git show a456acb2f2 commit a456acb2f2af8365eb9151c7cd2d5a10c189d191 Author: Harry Marr <harry.marr@gmail.com>

    Date: Wed Oct 28 16:30:02 2015 +0000 Avoid disabling postgres errors # Enable standard-conforming strings if available. def set_standard_conforming_strings - old, self.client_min_messages = client_min_messages, 'panic' - execute('SET standard_conforming_strings = on', 'SCHEMA') rescue nil - ensure - self.client_min_messages = old + execute(<<-SQL, 'SCHEMA') + UPDATE pg_settings + SET setting = 'on' + WHERE name = 'standard_conforming_strings' + SQL end
  82. https://gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres- connections/

  83. https://twitter.com/rbranson/status/675005104701870080

  84. There’s a lot more than /usr/bin/postgresql

  85. - PgBouncer - libpq - Ruby pg gem - ActiveRecord

    (ORM)
  86. It’s just code

  87. Don’t trust the ORM

  88. Resource utilisation ORMs Clustering

  89. How soon are we back online? Is all our data

    there?
  90. What’s in a cluster?

  91. Consensus State machine Health checks Actions

  92. Pacemaker

  93. Incident 4 Sorry, I cannot hear you I’m kinda busy

  94. It starts with a page

  95. crm mon status - demote event for Postgresql on <node>:

    OK
  96. psql: FATAL: sorry, too many clients already

  97. 30 seconds of API downtime

  98. Roll back the last change

  99. # Should only be set for schema migrations USE_DIRECT_PG_CONNECTION=true

  100. What about superuser_reserved_connections?

  101. 70 Normal App

  102. 70 26 Normal App Unused

  103. 70 3 1 26 Normal App Unused Superuser (non-reserved) Superuser

    (reserved)
  104. 70 3 1 26 Normal Incident App Unused Superuser (non-reserved)

    Superuser (reserved) App 97
  105. 70 3 1 26 3 1 Normal Incident App Unused

    Superuser (non-reserved) Superuser (reserved) App Superuser (rejected) Superuser (reserved) 97
  106. Set appropriate bounds on the system

  107. None
  108. Incident 5 What’s in a health check?

  109. su postgres -c 'psql -c "select now()"' Change to the

    Postgres user Run a simple query Connect to the database
  110. Humans in LDAP Robots locally

  111. su postgres -c 'psql -c "select now()"' Postgres is a

    local user
  112. Safe to restart an LDAP node, right?

  113. Wrong

  114. Cluster transition due to health-check timeout

  115. su postgres -c 'psql -c "select now()"'

  116. su postgres -c 'echo "hello"' Command doesn’t matter

  117. Enter strace, ltrace

  118. strace - print every system call made by a process

  119. ltrace - same, but for C library calls

  120. Opinion: the most useful debugging tool you can learn http://jvns.ca/blog/2015/04/14/strace-zine/

  121. # strace su postgres -c 'echo "hello"' getuid() = 0

    # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks
  122. # strace su postgres -c 'echo "hello"' getuid() = 0

    # ltrace su postgres -c 'echo "hello"' getlogin() = "chris" getpwnam_r(…) # blocks !?
  123. getlogin

  124. the user name associated by the login activity with the

    controlling terminal of the current process
  125. the user name associated by the login activity with the

    controlling terminal of the current process
  126. $ ./test_getlogin User is: chris $ sudo -i # ./test_getlogin

    User is: chris
  127. /var/run/utmp /proc/<pid>/loginuid

  128. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001
  129. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001 # cat /proc/9001/loginuid 1234
  130. # ps aux | grep '[p]acemakerd' | awk '{print $2}'

    9001 # cat /proc/9001/loginuid 1234 # getent passwd 1234 | cut -d':' -f1 chris
  131. None
  132. None
  133. None
  134. service pacemaker restart changes the user

  135. Hold up. What is su doing?

  136. The code is subtle

  137. Getting a passwd struct for the calling user

  138. uid = getuid() 0 (root)

  139. uid = getuid() login = getlogin() 0 (root) chris

  140. uid = getuid() login = getlogin() passwd = passwd_from_name(login) 0

    (root) chris {…}
  141. uid = getuid() login = getlogin() passwd = passwd_from_name(login) if

    (passwd.uid == uid) return passwd 0 (root) chris {…} 0 == 1234
  142. uid = getuid() login = getlogin() passwd = passwd_from_name(login) if

    (passwd.uid == uid) return passwd else return passwd_from_uid(uid) 0 (root) chris {…} 0 == 1234
  143. Only does anything if you have duplicate UIDs

  144. Why does the login activity matter?

  145. There may be Reasons™

  146. Set appropriate bounds on the system

  147. bind_timelimit

  148. There’s a lot more than /usr/bin/postgresql

  149. Run game days https://stripe.com/blog/game-day-exercises-at-stripe

  150. 3 things to do

  151. Watch the logs Set appropriate bounds Check what your ORM’s

    up to
  152. And two thoughts…

  153. Operating a database is about much more than the database

    itself
  154. https://twitter.com/cscotta/status/535898821914419200

  155. Thank you '❤ @ChrisSinjo @GoCardlessEng

  156. Questions? '❤ @ChrisSinjo @GoCardlessEng

  157. We’re hiring '❤ @ChrisSinjo @GoCardlessEng

  158. Image credits • Heart emoji - https://github.com/mozilla/fxemoji/blob/ 9f68ca9c5bc51521f9ffe284e00ba8b7308e2c41/svgs/ FirefoxEmoji/u2764-redheart.svg •

    Elephants - https://www.flickr.com/photos/makeitkenya/ 22047623331/ • Kitten - https://www.flickr.com/photos/aigle_dore/7787096102
  159. References (Incident 1) • Safe Operations For High Volume PostgreSQL

    (Braintree) - https://www.braintreepayments.com/blog/safe-operations-for- high-volume-postgresql/ • Zero-downtime Postgres migrations - the hard parts - https:// gocardless.com/blog/zero-downtime-postgres-migrations-the- hard-parts/ • Zero-downtime Postgres migrations - a little help - https:// gocardless.com/blog/zero-downtime-postgres-migrations-a-little- help/
  160. References (Incident 1) • Postgres 9.5 Client Connection Defaults (lock_timeout,

    statement_timeout) - https://www.postgresql.org/docs/9.5/ static/runtime-config-client.html • Postgres 9.6 Client Connection Defaults (idle_in_transaction_session_timeout) - https:// www.postgresql.org/docs/9.5/static/runtime-config-client.html • Postgres 9.5 Error Reporting and Logging (log_lock_waits) - https://www.postgresql.org/docs/9.5/static/runtime-config- logging.html
  161. References (Incident 2) • Postgres 9.5 Resource Consumption (temp_file_limit) -

    https:// www.postgresql.org/docs/current/static/runtime-config- resource.html
  162. References (Incident 3) • PgBouncer feature matrix for pooling modes

    - https:// wiki.postgresql.org/wiki/ PgBouncer#Feature_matrix_for_pooling_modes • The Troubleshooting Tales: issues scaling Postgres connections - https:// gocardless.com/blog/the-troubleshooting-tales-issues-scaling-postgres- connections/ • Rails commit that fixed client_min_messages issue - https://github.com/ rails/rails/commit/a456acb2f2af8365eb9151c7cd2d5a10c189d191 • Rick Branson's tweet about the same problem in Django - https:// twitter.com/rbranson/status/675005104701870080
  163. References (Incident 4) • Postgres 9.5 Connections and Authentication (superuser_reserved_connections)

    - https:// www.postgresql.org/docs/9.5/static/runtime-config- connection.html
  164. References (Incident 5) • Julia Evans' strace zine - http://jvns.ca/blog/2015/04/14/

    strace-zine/ • Game Day Exercises at Stripe: Learning from `kill -9` - https:// stripe.com/blog/game-day-exercises-at-stripe
  165. References (Incident 5) • Current code in `su` to look

    up user - https://github.com/ shadow-maint/shadow/blob/ ef45bb2496182b5df90ad0323bef75d1a5d69887/contrib/ pwdauth.c#L127 • Much older similar code in `su` (1994 or earlier) - https:// github.com/freebsd/freebsd/blob/ b5c3fb9427806b740f7df3e43a1513e1f5fa840b/usr.bin/su/ su.c#L255
  166. References (Incident 5) • getuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getuid.html •

    getlogin manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getlogin.html • getpwnam manual - http://pubs.opengroup.org/onlinepubs/ 9699919799/functions/getpwnam.html • getpwuid manual - http://pubs.opengroup.org/onlinepubs/ 009695399/functions/getpwuid.html
  167. References (Misc) • Practical Postmortems at Etsy - https://www.infoq.com/articles/ postmortems-etsy

    • Scott Andreas' tweet about investigating things - https:// twitter.com/cscotta/status/535898821914419200