When it all Goes Wrong (with Postgres) | Railsconf 2019 | Will Leinweber

When it all Goes Wrong (with Postgres) | Railsconf 2019 | Will Leinweber

You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…

Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!

In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.

024d6a0dd14fb31c804969a57a06dfbe?s=128

Citus Data

May 01, 2019
Tweet

Transcript

  1. 4.

    @leinweber special thanks citus cloud
 — dan farina (@danfarina) heroku

    postgres
 — maciek sakrejda (@uhoh_itsmaciek)
  2. 12.

    @leinweber hard to convince all the graphs saying DB is

    slow and nothing has changed …must be the database!
  3. 21.
  4. 26.

    @leinweber cpu mem disk parallelism application submitting backlogged workload connection

    leak pool sizes set too large pg_lock issue + application backlog
  5. 27.

    @leinweber cpu mem disk parallelism workload skew causing thrashing unusual

    sequential scan workload failover or restart => no cache pg_prewarm
  6. 28.
  7. 29.

    @leinweber cpu mem disk parallelism large GROUP BYs high disk

    latency due to unusual page dispersion pattern in the workload
  8. 30.

    @leinweber cpu mem disk parallelism workload has high mem (GROUP

    BY)
 + app adding backlog lock contention slowing mem release
  9. 33.

    @leinweber cpu mem disk parallelism small, in-memory workload lots of

    seq scans on small table index scan w/ filter dropping lots
  10. 34.

    @leinweber cpu mem disk parallelism app backlog 
 + too

    much processing on small data simply a lot of work
  11. 53.

    @leinweber tools of the trade: backends pgrep -lf postgres +

    grep + wc select * from pg_stat_activity
  12. 55.

    @leinweber tools of the trade: summary cpu mem disk parallelism

    network perf x gdb x iostat x iotop x htop x x bwm x pgrep x
  13. 60.

    @leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience

    In Complex Adaptive Systems” Jens Rasmussen:
 Risk management in a dynamic society: a modeling problem
  14. 68.

    @leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience

    In Complex Adaptive Systems” Jens Rasmussen: 
 Risk management in a dynamic society: a modeling problem