Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Visualizing Postgres I/O Performance for Development

Visualizing Postgres I/O Performance for Development

Melanie

June 12, 2023
Tweet

More Decks by Melanie

Other Decks in Technology

Transcript

  1. Visualizing Postgres I/O
    Performance for
    Development
    Melanie Plageman

    View Slide

  2. Total TPS != User Experience
    Total TPS
    22,029 21,861

    View Slide

  3. View Performance Metrics over Time

    View Slide

  4. Use Multiple Systems and Tools
    to Gather Information

    View Slide

  5. Storage Stack Layers

    View Slide

  6. Metrics Sources
    - Postgres
    - pg_stat_io
    - pg_buffercache_summary
    - pg_stat_wal
    - pg_stat_activity waits
    - pg_total_relation_size()
    - Operating System
    - /proc/meminfo
    - pidstat
    - iostat
    - Benchmark
    - pgbench latency
    - pgbench TPS

    View Slide

  7. Benchmark Setup For Scenarios
    • 16 core, 32 thread AMD CPU
    • Linux 5.19
    • Sabrent Rocket NVMe 4.0 2TB (seq r/w 5000/4400 MBps, random r/w
    750000 IOPS)
    • ext4 w noatime,data=writeback
    • 64 GB RAM
    • 2 MB huge pages
    • Postgres compiled from source at O2
    • pgbench

    View Slide

  8. Using Metrics Together to
    Understand the Why

    View Slide

  9. backend_flush_after

    View Slide

  10. backend_flush_after 1MB finishes faster
    pgbench, 10 MB file COPY
    16 clients
    700 transactions
    20 GB shared buffers

    View Slide

  11. More backend writebacks

    View Slide

  12. Latency spikes without backend_flush_after
    as queue fills up

    View Slide

  13. Kernel writing out dirty data

    View Slide

  14. Initial TPS dip likely caused by memory
    pressure. Free memory hits 0

    View Slide

  15. Second dip coincides with checkpoint

    View Slide

  16. Using Metrics to Clarify other
    Metrics

    View Slide

  17. wal_compression

    View Slide

  18. Fewer Transactions without wal_compression
    pgbench, TPCB-like built-in, mode=prepared
    data scale 4000
    16 clients
    600 seconds
    20 GB shared buffers

    View Slide

  19. Higher latency and lower TPS without WAL
    compression

    View Slide

  20. Fewer Full Page Writes without
    wal_compression because fewer transactions

    View Slide

  21. Fewer writes/second without WAL
    compression

    View Slide

  22. Higher write throughput without WAL
    compression, so more larger writes

    View Slide

  23. WAL bytes higher without WAL compression,
    so the increased writes were WAL I/O

    View Slide

  24. WAL syncs much higher without compression,
    so additional flush requests are WAL

    View Slide

  25. Backends doing fewer writes and reads without
    compression. Bottlenecked on WAL I/O

    View Slide

  26. Benchmark Setup For Correct
    Comparisons

    View Slide

  27. initdb before every benchmark

    View Slide

  28. Without doing initdb first, 2000 COPY FROMs
    complete sooner
    pgbench, 1 MB file COPY
    16 clients
    2000 transactions
    20 GB shared buffers

    View Slide

  29. More flush requests issued after just having
    done initdb

    View Slide

  30. Waiting for WAL Init Sync after having done
    intidb

    View Slide

  31. TPS dip at 40 seconds corresponds with
    running out of system memory

    View Slide

  32. Increase wal_segment_size to 1GB, COPY FROMs
    take much longer, TPS is very spikey after initdb

    View Slide

  33. Fewer flush requests because each one takes
    longer and WAL file allocation takes longer with
    bigger WAL segment size

    View Slide

  34. With reduced min_wal_size and pause after
    loading data, performance without initdb is similar
    to with initdb

    View Slide

  35. The number of flush requests is the same as
    with initdb

    View Slide

  36. WAL Init Sync Waits with pause and
    decreased min_wal_size

    View Slide

  37. Benchmark Configuration Choices

    View Slide

  38. prepared vs simple

    View Slide

  39. Higher TPS with prepared query mode vs
    simple

    View Slide

  40. Additional CPU usage with simple query
    mode

    View Slide

  41. Benchmark Choice and Reflecting
    Customer Workloads

    View Slide

  42. data access distribution

    View Slide

  43. Gaussian data access distribution often performs
    better than uniform random access and is similar
    to real workloads
    pgbench, TPCB-like built-in and custom, mode=prepared, sync commit = off
    data scale 4200
    16 clients
    500 seconds
    20 GB shared buffers

    View Slide

  44. Uniform random access does more reads and
    writes because working set doesn’t fit in memory

    View Slide

  45. Usage count is low for random data access
    distribution

    View Slide

  46. Backend cache hit ratio is worse for uniform
    random access

    View Slide

  47. More evictions of shared buffers

    View Slide

  48. Backends are doing more reads and writes

    View Slide

  49. Determine when System
    Configurations Matter

    View Slide

  50. readahead

    View Slide

  51. read_ahead_kb
    target readahead = sequential BW * latency

    View Slide

  52. Larger read_ahead_kb finishes slightly sooner
    pgbench, SELECT * FROM large_table
    5 GB table
    1 client
    3 transactions
    8 GB shared buffers

    View Slide

  53. Read request size is much larger

    View Slide

  54. With 1ms added latency via dmsetup delay, run
    with read_ahead_kb 2048 finishes in 30 seconds

    View Slide

  55. Large request size and large read throughput

    View Slide

  56. Questioning Your Assumptions

    View Slide

  57. autovacuum_vacuum_cost_delay

    View Slide

  58. TPS starts high and gradually goes down with
    autovacuum_vacuum_cost_delay > 0
    pgbench, TPCB-like@1 + INSERT/DELETE@9 , mode=prepared
    data scale 4300
    32 clients
    600 seconds
    16 GB shared buffers

    View Slide

  59. Latency increases proportionally

    View Slide

  60. % time I/O requests being issued is much
    lower with higher cost delay

    View Slide

  61. Autovacuum mostly waiting

    View Slide

  62. Spike in reads not from autovacuum

    View Slide

  63. Size of the relations being thrashed increasing
    and backend cache hit ratio is plummeting

    View Slide

  64. System CPU usage is increasing. Potentially
    caused by swapping

    View Slide

  65. Comparing only autovacuum_vacuum_cost_delay
    2ms (default) vs 0

    View Slide

  66. Relation size relatively constant for delay = 0

    View Slide

  67. More autovacuum cache hits and fewer reads
    with cost delay 0

    View Slide

  68. More shared buffer evictions by autovacuum
    with default cost delay

    View Slide

  69. Autovacuum cleaning buffers and putting
    them on the freelist so more unused buffers

    View Slide

  70. No backend flushes required because there
    are clean buffers

    View Slide

  71. Finding the Real Root Cause

    View Slide

  72. wal_buffers

    View Slide

  73. COPY FROMs with larger wal_buffers finish
    faster
    pgbench, 20MB file, COPY FROM
    16 clients
    100 transactions
    10 GB shared buffers

    View Slide

  74. wal_buffers are full less often

    View Slide

  75. Smaller wal_buffers end up contending the
    WALInsert lock meaning they are waiting much
    more often

    View Slide

  76. Smaller wal_buffers causes those runs to do
    less I/O overall

    View Slide

  77. Smaller wal_buffers fill up and then cause
    waiting for WAL Sync

    View Slide

  78. Much higher throughput with larger
    wal_buffers but how can dips be explained

    View Slide

  79. At 20 seconds, start doing more smaller
    writes

    View Slide

  80. Fewer write merges and more requests in the
    queue

    View Slide

  81. Dirty data has built up, then it starts being
    flushed by kernel before second slowdown

    View Slide

  82. Shared buffers fills up around 20 seconds,
    faster with larger wal_buffers

    View Slide

  83. System memory fills up at 40 seconds
    explaining the second dip

    View Slide

  84. Needed pages are being swapped out and
    have to be read back in

    View Slide

  85. COPY FROM workload impacted by wal_buffers
    but a transactional workload would not be

    View Slide

  86. Benchmarking as a Developer
    • Not just configuring databases but identifying bottlenecks that can be
    addressed with code
    • Understanding system interactions when designing new features and
    performance enhancements
    • Designing scenarios that put the right things under test

    View Slide