Visualizing Postgres I/O Performance for Development

Slide 1

Slide 1 text

Visualizing Postgres I/O Performance for Development Melanie Plageman

Slide 2

Slide 2 text

Total TPS != User Experience Total TPS 22,029 21,861

Slide 3

Slide 3 text

View Performance Metrics over Time

Slide 4

Slide 4 text

Use Multiple Systems and Tools to Gather Information

Slide 5

Slide 5 text

Storage Stack Layers

Slide 6

Slide 6 text

Metrics Sources - Postgres - pg_stat_io - pg_buffercache_summary - pg_stat_wal - pg_stat_activity waits - pg_total_relation_size() - Operating System - /proc/meminfo - pidstat - iostat - Benchmark - pgbench latency - pgbench TPS

Slide 7

Slide 7 text

Benchmark Setup For Scenarios • 16 core, 32 thread AMD CPU • Linux 5.19 • Sabrent Rocket NVMe 4.0 2TB (seq r/w 5000/4400 MBps, random r/w 750000 IOPS) • ext4 w noatime,data=writeback • 64 GB RAM • 2 MB huge pages • Postgres compiled from source at O2 • pgbench

Slide 8

Slide 8 text

Using Metrics Together to Understand the Why

Slide 9

Slide 9 text

backend_flush_after

Slide 10

Slide 10 text

backend_flush_after 1MB finishes faster pgbench, 10 MB file COPY 16 clients 700 transactions 20 GB shared buffers

Slide 11

Slide 11 text

More backend writebacks

Slide 12

Slide 12 text

Latency spikes without backend_flush_after as queue fills up

Slide 13

Slide 13 text

Kernel writing out dirty data

Slide 14

Slide 14 text

Initial TPS dip likely caused by memory pressure. Free memory hits 0

Slide 15

Slide 15 text

Second dip coincides with checkpoint

Slide 16

Slide 16 text

Using Metrics to Clarify other Metrics

Slide 17

Slide 17 text

wal_compression

Slide 18

Slide 18 text

Fewer Transactions without wal_compression pgbench, TPCB-like built-in, mode=prepared data scale 4000 16 clients 600 seconds 20 GB shared buffers

Slide 19

Slide 19 text

Higher latency and lower TPS without WAL compression

Slide 20

Slide 20 text

Fewer Full Page Writes without wal_compression because fewer transactions

Slide 21

Slide 21 text

Fewer writes/second without WAL compression

Slide 22

Slide 22 text

Higher write throughput without WAL compression, so more larger writes

Slide 23

Slide 23 text

WAL bytes higher without WAL compression, so the increased writes were WAL I/O

Slide 24

Slide 24 text

WAL syncs much higher without compression, so additional flush requests are WAL

Slide 25

Slide 25 text

Backends doing fewer writes and reads without compression. Bottlenecked on WAL I/O

Slide 26

Slide 26 text

Benchmark Setup For Correct Comparisons

Slide 27

Slide 27 text

initdb before every benchmark

Slide 28

Slide 28 text

Without doing initdb first, 2000 COPY FROMs complete sooner pgbench, 1 MB file COPY 16 clients 2000 transactions 20 GB shared buffers

Slide 29

Slide 29 text

More flush requests issued after just having done initdb

Slide 30

Slide 30 text

Waiting for WAL Init Sync after having done intidb

Slide 31

Slide 31 text

TPS dip at 40 seconds corresponds with running out of system memory

Slide 32

Slide 32 text

Increase wal_segment_size to 1GB, COPY FROMs take much longer, TPS is very spikey after initdb

Slide 33

Slide 33 text

Fewer flush requests because each one takes longer and WAL file allocation takes longer with bigger WAL segment size

Slide 34

Slide 34 text

With reduced min_wal_size and pause after loading data, performance without initdb is similar to with initdb

Slide 35

Slide 35 text

The number of flush requests is the same as with initdb

Slide 36

Slide 36 text

WAL Init Sync Waits with pause and decreased min_wal_size

Slide 37

Slide 37 text

Benchmark Configuration Choices

Slide 38

Slide 38 text

prepared vs simple

Slide 39

Slide 39 text

Higher TPS with prepared query mode vs simple

Slide 40

Slide 40 text

Additional CPU usage with simple query mode

Slide 41

Slide 41 text

Benchmark Choice and Reflecting Customer Workloads

Slide 42

Slide 42 text

data access distribution

Slide 43

Slide 43 text

Gaussian data access distribution often performs better than uniform random access and is similar to real workloads pgbench, TPCB-like built-in and custom, mode=prepared, sync commit = off data scale 4200 16 clients 500 seconds 20 GB shared buffers

Slide 44

Slide 44 text

Uniform random access does more reads and writes because working set doesn’t fit in memory

Slide 45

Slide 45 text

Usage count is low for random data access distribution

Slide 46

Slide 46 text

Backend cache hit ratio is worse for uniform random access

Slide 47

Slide 47 text

More evictions of shared buffers

Slide 48

Slide 48 text

Backends are doing more reads and writes

Slide 49

Slide 49 text

Determine when System Configurations Matter

Slide 50

Slide 50 text

readahead

Slide 51

Slide 51 text

read_ahead_kb target readahead = sequential BW * latency

Slide 52

Slide 52 text

Larger read_ahead_kb finishes slightly sooner pgbench, SELECT * FROM large_table 5 GB table 1 client 3 transactions 8 GB shared buffers

Slide 53

Slide 53 text

Read request size is much larger

Slide 54

Slide 54 text

With 1ms added latency via dmsetup delay, run with read_ahead_kb 2048 finishes in 30 seconds

Slide 55

Slide 55 text

Large request size and large read throughput

Slide 56

Slide 56 text

Questioning Your Assumptions

Slide 57

Slide 57 text

autovacuum_vacuum_cost_delay

Slide 58

Slide 58 text

TPS starts high and gradually goes down with autovacuum_vacuum_cost_delay > 0 pgbench, TPCB-like@1 + INSERT/DELETE@9 , mode=prepared data scale 4300 32 clients 600 seconds 16 GB shared buffers

Slide 59

Slide 59 text

Latency increases proportionally

Slide 60

Slide 60 text

% time I/O requests being issued is much lower with higher cost delay

Slide 61

Slide 61 text

Autovacuum mostly waiting

Slide 62

Slide 62 text

Spike in reads not from autovacuum

Slide 63

Slide 63 text

Size of the relations being thrashed increasing and backend cache hit ratio is plummeting

Slide 64

Slide 64 text

System CPU usage is increasing. Potentially caused by swapping

Slide 65

Slide 65 text

Comparing only autovacuum_vacuum_cost_delay 2ms (default) vs 0

Slide 66

Slide 66 text

Relation size relatively constant for delay = 0

Slide 67

Slide 67 text

More autovacuum cache hits and fewer reads with cost delay 0

Slide 68

Slide 68 text

More shared buffer evictions by autovacuum with default cost delay

Slide 69

Slide 69 text

Autovacuum cleaning buffers and putting them on the freelist so more unused buffers

Slide 70

Slide 70 text

No backend flushes required because there are clean buffers

Slide 71

Slide 71 text

Finding the Real Root Cause

Slide 72

Slide 72 text

wal_buffers

Slide 73

Slide 73 text

COPY FROMs with larger wal_buffers finish faster pgbench, 20MB file, COPY FROM 16 clients 100 transactions 10 GB shared buffers

Slide 74

Slide 74 text

wal_buffers are full less often

Slide 75

Slide 75 text

Smaller wal_buffers end up contending the WALInsert lock meaning they are waiting much more often

Slide 76

Slide 76 text

Smaller wal_buffers causes those runs to do less I/O overall

Slide 77

Slide 77 text

Smaller wal_buffers fill up and then cause waiting for WAL Sync

Slide 78

Slide 78 text

Much higher throughput with larger wal_buffers but how can dips be explained

Slide 79

Slide 79 text

At 20 seconds, start doing more smaller writes

Slide 80

Slide 80 text

Fewer write merges and more requests in the queue

Slide 81

Slide 81 text

Dirty data has built up, then it starts being flushed by kernel before second slowdown

Slide 82

Slide 82 text

Shared buffers fills up around 20 seconds, faster with larger wal_buffers

Slide 83

Slide 83 text

System memory fills up at 40 seconds explaining the second dip

Slide 84

Slide 84 text

Needed pages are being swapped out and have to be read back in

Slide 85

Slide 85 text

COPY FROM workload impacted by wal_buffers but a transactional workload would not be

Slide 86

Slide 86 text

Benchmarking as a Developer • Not just configuring databases but identifying bottlenecks that can be addressed with code • Understanding system interactions when designing new features and performance enhancements • Designing scenarios that put the right things under test