Upgrade to Pro — share decks privately, control downloads, hide ads and more …

When it all Goes Wrong | Nordic PGDay 2019 | Will Leinweber

When it all Goes Wrong | Nordic PGDay 2019 | Will Leinweber

You're woken up in the middle of the night to your phone. Your app is down and you're on call to fix it. Eventually you track it down to "something with the db," but what exactly is wrong? And of course, you're sure that nothing changed recently…

Knowing what to fix, and even where to start looking, is a skill that takes a long time to develop. Especially since Postgres normally works very well for months at a time, not letting you get practice!

In this talk, I'll share not only the more common failure cases and how to fix them, but also a general approach to efficiently figuring out what's wrong in the first place.

Citus Data

March 19, 2019
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. When it all Goes Wrong
    Nordic PGDay 2019 — March 19 — Copenhagen

    View full-size slide

  2. @leinweber
    Will Leinweber
    @leinweber

    Citus Data (Microsoft)

    bitfission.com

    (warning autoplays midi)

    View full-size slide

  3. @leinweber
    coming from
    citus cloud

    heroku postgres

    View full-size slide

  4. @leinweber
    special thanks
    citus cloud

    — dan farina (@danfarina)

    heroku postgres

    — maciek sakrejda (@uhoh_itsmaciek)

    View full-size slide

  5. @leinweber
    same sorts of problems
    from pages & alerts

    from support tickets

    View full-size slide

  6. @leinweber
    this talk
    more app dev who uses postgres

    rather than dba

    View full-size slide

  7. @leinweber
    the problem with Postgres
    it’s pretty good

    you don’t get experience with how it breaks

    View full-size slide

  8. @leinweber
    what to do for a problem

    View full-size slide

  9. @leinweber
    what to do for a problem

    View full-size slide

  10. @leinweber
    complicated system
    network

    hardware

    o/s

    postgres

    View full-size slide

  11. @leinweber
    using the database (too much)
    95% application

    4% auto vacuum

    1% everything else

    View full-size slide

  12. @leinweber
    hard to convince
    all the graphs saying DB is slow

    and nothing has changed

    …must be the database!

    View full-size slide

  13. @leinweber
    https://upload.wikimedia.org/wikipedia/commons/9/98/Survivorship-bias.png

    View full-size slide

  14. @leinweber
    “but I didn’t change anything”
    no deploys!

    no database migrations!

    no scaling!

    View full-size slide

  15. @leinweber
    “but I didn’t change anything”
    https://upload.wikimedia.org/wikipedia/commons/0/09/Redherring.gif

    View full-size slide

  16. @leinweber
    “but I didn’t change anything”
    more traffic?

    change in access patterns?

    one big user logged in?

    View full-size slide

  17. @leinweber
    run out of a resource

    View full-size slide

  18. @leinweber
    snowball

    View full-size slide

  19. @leinweber
    example
    manageable user 1s query => 2x expensive

    frequent, small queries 3ms => 12ms

    View full-size slide

  20. @leinweber
    assumptions
    hardware
    maintenance
    app

    View full-size slide

  21. @leinweber
    assumptions
    postgres should not crash

    …with overcommit off and no containers

    large extensions increase chance

    View full-size slide

  22. @leinweber
    if not postgres, then what

    View full-size slide

  23. @leinweber
    system resources
    cpu

    memory

    disk

    parallelism / backends

    locks

    View full-size slide

  24. @leinweber
    cpu mem disk parallelism
    cpu mem disk parallelism

    View full-size slide

  25. @leinweber
    cpu mem disk parallelism
    credentials wrong

    networking broken

    locking issue, check pg_locks

    idle in transaction

    View full-size slide

  26. @leinweber
    cpu mem disk parallelism
    application submitting backlogged workload

    connection leak

    pool sizes set too large

    pg_lock issue + application backlog

    View full-size slide

  27. @leinweber
    cpu mem disk parallelism
    workload skew causing thrashing

    unusual sequential scan workload

    failover or restart => no cache

    pg_prewarm

    View full-size slide

  28. @leinweber
    cpu mem disk parallelism
    same as just disk,

    but also the application is piling on

    View full-size slide

  29. @leinweber
    cpu mem disk parallelism
    large GROUP BYs

    high disk latency due to unusual page
    dispersion pattern in the workload

    View full-size slide

  30. @leinweber
    cpu mem disk parallelism
    workload has high mem (GROUP BY)

    + app adding backlog

    lock contention slowing mem release

    View full-size slide

  31. @leinweber
    cpu mem disk parallelism
    large GROUP BYs + paging in unusual data

    View full-size slide

  32. @leinweber
    cpu mem disk parallelism
    Look for what is causing disk access

    View full-size slide

  33. @leinweber
    cpu mem disk parallelism
    small, in-memory workload

    lots of seq scans on small table

    index scan w/ filter dropping lots

    View full-size slide

  34. @leinweber
    cpu mem disk parallelism
    app backlog 

    + too much processing on small data

    simply a lot of work

    View full-size slide

  35. @leinweber
    cpu mem disk parallelism
    large seq scans

    View full-size slide

  36. @leinweber
    cpu mem disk parallelism
    loading cold data + application backlog

    View full-size slide

  37. @leinweber
    cpu mem disk parallelism
    small # of backends doing a lot more work

    View full-size slide

  38. @leinweber
    cpu mem disk parallelism
    entity, workload, entity*workload

    soft deletes and non-conditional indexes

    View full-size slide

  39. @leinweber
    cpu mem disk parallelism
    reporting query

    View full-size slide

  40. @leinweber
    cpu mem disk parallelism
    app backlog, but with CPU/mem problems

    View full-size slide

  41. @leinweber
    tools of the trade

    View full-size slide

  42. @leinweber
    tools of the trade
    C symbols

    View full-size slide

  43. @leinweber
    tools of the trade: perf
    perf record -p && perf report

    View full-size slide

  44. @leinweber
    tools of the trade: perf
    perf top

    View full-size slide

  45. @leinweber
    tools of the trade: perf
    www.brendangregg.com/perf.html

    View full-size slide

  46. @leinweber
    tools of the trade: gdb
    gdb -batch -ex 'bt' -p

    View full-size slide

  47. @leinweber
    tools of the trade: iostat
    iostat -xm 10

    View full-size slide

  48. @leinweber
    tools of the trade: iotop

    View full-size slide

  49. @leinweber
    tools of the trade: htop

    View full-size slide

  50. @leinweber
    Tools of the trade: bwm-ng

    View full-size slide

  51. @leinweber
    tools of the trade: backends
    pgrep -lf postgres + grep + wc
    select * from pg_stat_activity

    View full-size slide

  52. @leinweber
    tools of the trade: pg_s_s
    select * from pg_stat_statements

    View full-size slide

  53. @leinweber
    tools of the trade: summary
    cpu mem disk parallelism network
    perf x
    gdb x
    iostat x
    iotop x
    htop x x
    bwm x
    pgrep x

    View full-size slide

  54. @leinweber
    what to do

    View full-size slide

  55. @leinweber
    what to do
    configuration change

    View full-size slide

  56. @leinweber
    what to do
    db change

    View full-size slide

  57. @leinweber
    what to do
    code change

    View full-size slide

  58. @leinweber
    flirting with disaster
    Velocity NY 2013: Richard Cook

    "Resilience In Complex Adaptive Systems”

    Jens Rasmussen:

    Risk management in a dynamic society: a
    modeling problem

    View full-size slide

  59. @leinweber
    flirting with disaster
    economic
    boundary

    View full-size slide

  60. @leinweber
    flirting with disaster
    economic
    boundary
    workload
    boundary

    View full-size slide

  61. @leinweber
    flirting with disaster
    economic
    boundary
    workload
    boundary
    performance
    boundary

    View full-size slide

  62. @leinweber
    flirting with disaster
    economic
    boundary
    workload
    boundary
    performance
    boundary
    error
    margin

    View full-size slide

  63. @leinweber
    flirting with disaster
    economic
    boundary
    workload
    boundary
    performance
    boundary

    View full-size slide

  64. @leinweber
    flirting with disaster
    economic
    boundary
    workload
    boundary
    performance
    boundary
    error
    margin

    View full-size slide

  65. @leinweber
    flirting with disaster
    economic
    boundary
    workload
    boundary
    performance
    boundary
    error
    margin

    View full-size slide

  66. @leinweber
    flirting with disaster
    Velocity NY 2013: Richard Cook

    "Resilience In Complex Adaptive Systems”

    Jens Rasmussen: 

    Risk management in a dynamic society: a
    modeling problem

    View full-size slide

  67. @leinweber
    thank you
    Will Leinweber
    @leinweber
    citusdata.com

    View full-size slide