When it all Goes Wrong | PGConf EU 2019 | Will Leinweber

024d6a0dd14fb31c804969a57a06dfbe?s=47 Citus Data
October 17, 2019

When it all Goes Wrong | PGConf EU 2019 | Will Leinweber

You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…

Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!

In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.

024d6a0dd14fb31c804969a57a06dfbe?s=128

Citus Data

October 17, 2019
Tweet

Transcript

  1. When it all Goes Wrong

  2. @leinweber Will Leinweber @leinweber Citus Data (Microsoft) bitfission.com
 (warning autoplays

    midi)
  3. @leinweber coming from citus cloud heroku postgres

  4. @leinweber special thanks citus cloud
 — dan farina (@danfarina) heroku

    postgres
 — maciek sakrejda (@uhoh_itsmaciek)
  5. @leinweber same sorts of problems from pages & alerts from

    support tickets
  6. @leinweber this talk more app dev who uses postgres
 rather

    than dba
  7. @leinweber the problem with Postgres it’s pretty good you don’t

    get experience with how it breaks
  8. @leinweber what to do for a problem

  9. @leinweber what to do for a problem

  10. @leinweber complicated system network hardware o/s postgres

  11. @leinweber using the database (too much) 95% application 4% auto

    vacuum 1% everything else
  12. @leinweber hard to convince all the graphs saying DB is

    slow and nothing has changed …must be the database!
  13. @leinweber https://upload.wikimedia.org/wikipedia/commons/9/98/Survivorship-bias.png

  14. @leinweber “but I didn’t change anything” no deploys! no database

    migrations! no scaling!
  15. @leinweber “but I didn’t change anything” https://upload.wikimedia.org/wikipedia/commons/0/09/Redherring.gif

  16. @leinweber “but I didn’t change anything” more traffic? change in

    access patterns? one big user logged in?
  17. @leinweber run out of a resource

  18. @leinweber snowball

  19. @leinweber example manageable user 1s query => 2x expensive frequent,

    small queries 3ms => 12ms
  20. @leinweber assumptions hardware maintenance app

  21. @leinweber assumptions postgres should not crash …with overcommit off and

    no containers large extensions increase chance
  22. @leinweber if not postgres, then what

  23. @leinweber system resources cpu memory disk parallelism / backends locks

  24. @leinweber cpu mem disk parallelism cpu mem disk parallelism

  25. @leinweber cpu mem disk parallelism credentials wrong networking broken locking

    issue, check pg_locks idle in transaction
  26. @leinweber cpu mem disk parallelism application submitting backlogged workload connection

    leak pool sizes set too large pg_lock issue + application backlog
  27. @leinweber cpu mem disk parallelism workload skew causing thrashing unusual

    sequential scan workload failover or restart => no cache pg_prewarm
  28. @leinweber cpu mem disk parallelism same as just disk, but

    also the application is piling on
  29. @leinweber cpu mem disk parallelism large GROUP BYs high disk

    latency due to unusual page dispersion pattern in the workload
  30. @leinweber cpu mem disk parallelism workload has high mem (GROUP

    BY)
 + app adding backlog lock contention slowing mem release
  31. @leinweber cpu mem disk parallelism large GROUP BYs + paging

    in unusual data
  32. @leinweber cpu mem disk parallelism Look for what is causing

    disk access
  33. @leinweber cpu mem disk parallelism small, in-memory workload lots of

    seq scans on small table index scan w/ filter dropping lots
  34. @leinweber cpu mem disk parallelism app backlog 
 + too

    much processing on small data simply a lot of work
  35. @leinweber cpu mem disk parallelism large seq scans

  36. @leinweber cpu mem disk parallelism loading cold data + application

    backlog
  37. @leinweber cpu mem disk parallelism small # of backends doing

    a lot more work
  38. @leinweber cpu mem disk parallelism entity, workload, entity*workload soft deletes

    and non-conditional indexes
  39. @leinweber cpu mem disk parallelism reporting query

  40. @leinweber cpu mem disk parallelism app backlog, but with CPU/mem

    problems
  41. @leinweber tools of the trade

  42. @leinweber tools of the trade C symbols

  43. @leinweber tools of the trade: perf perf record -p <pid>

    && perf report
  44. @leinweber tools of the trade: perf perf top

  45. @leinweber tools of the trade: perf www.brendangregg.com/perf.html

  46. @leinweber tools of the trade: gdb gdb -batch -ex 'bt'

    -p <pid>
  47. @leinweber

  48. @leinweber

  49. @leinweber tools of the trade: iostat iostat -xm 10

  50. @leinweber tools of the trade: iotop

  51. @leinweber tools of the trade: htop

  52. @leinweber Tools of the trade: bwm-ng

  53. @leinweber tools of the trade: backends pgrep -lf postgres +

    grep + wc select * from pg_stat_activity
  54. @leinweber tools of the trade: pg_s_s select * from pg_stat_statements

  55. @leinweber tools of the trade: summary cpu mem disk parallelism

    network perf x gdb x iostat x iotop x htop x x bwm x pgrep x
  56. @leinweber what to do

  57. @leinweber what to do configuration change

  58. @leinweber what to do db change

  59. @leinweber what to do code change

  60. @leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience

    In Complex Adaptive Systems” Jens Rasmussen:
 Risk management in a dynamic society: a modeling problem
  61. @leinweber flirting with disaster economic boundary

  62. @leinweber flirting with disaster economic boundary workload boundary

  63. @leinweber flirting with disaster economic boundary workload boundary performance boundary

  64. @leinweber flirting with disaster economic boundary workload boundary performance boundary

    error margin
  65. @leinweber flirting with disaster economic boundary workload boundary performance boundary

  66. @leinweber flirting with disaster economic boundary workload boundary performance boundary

    error margin
  67. @leinweber flirting with disaster economic boundary workload boundary performance boundary

    error margin
  68. @leinweber flirting with disaster Velocity NY 2013: Richard Cook
 "Resilience

    In Complex Adaptive Systems” Jens Rasmussen: 
 Risk management in a dynamic society: a modeling problem
  69. @leinweber thank you Will Leinweber @leinweber citusdata.com