Optimizing Postgres for high write workloads - Citus Con

Slide 1

Slide 1 text

Samay Sharma Manager, PostgreSQL Open Source team @ Microsoft Optimizing Postgres for write-heavy workloads ft. checkpointer & WAL configs

Slide 2

Slide 2 text

About Me Manage in the Open Source PG team at Microsoft Working with PostgreSQL for > 10 years Previously solutions & customer engineer at Microsoft, Citus Want to make Postgres performance and tuning easier to understand and do

Slide 3

Slide 3 text

Optimize Postgres, but before that… Ensure low latency b/w app and database Deal with connections effectively Make sure you have the right hardware PostgreSQL Client Server

Slide 4

Slide 4 text

Writing performant applications for PG • Manage Indexes carefully • Parallelize your loading job with concurrent connections • Use multi-value INSERTs or COPY to speed up throughput • Consider partitioning to reduce index sizes • Use unlogged tables for unimportant data

Slide 5

Slide 5 text

Moving onto internals of Postgres and tuning it

Slide 6

Slide 6 text

Journey of a Postgres write When a write comes in, it is first written to WAL for durability reasons. Then the change is made in shared buffers (not to the data file yet!) At some point, those pages are written to disk by the background writer or checkpointer. WAL (disk) Shared buffers Data files (disk) BG writer Checkpointer INSERT INTO …

Slide 7

Slide 7 text

Getting rid of WAL - Checkpoints • After a crash, recovery requires you to replay WAL. • To make recovery faster and avoid maintaining TBs of WAL, Postgres has checkpoints • Checkpoint guarantees all dirty buffers until a point are written to disk • Recovery can start from last checkpoint • WAL before that point can be recycled

Slide 8

Slide 8 text

For healthy long-term operation Autovacuum = well tuned Checkpointer, WAL parameters = well tuned Bgwriter = well tuned

Slide 9

Slide 9 text

Autovacuum PostgreSQL’s way of removing unneeded row versions Checks if tables significantly modified & if yes, runs VACUUM and ANALYZE Defaults generally low for large workloads Header Data4 Header Data3 (Updated to Data4) Header Data2 Header Data1 (Deleted) Header Data4 Header Data2

Slide 10

Slide 10 text

Common VACUUM problems and solutions • VACUUM isn’t triggered often for large tables • Reduce autovacuum_vacuum_scale_factor, autovacuum_vacuum_insert_scale_factor • VACUUM running slow • Adjust / disable cost limits • Increase maintenance_work_mem / autovacuum_work_mem • Adjust max_parallel_maintenance_workers for parallel index vacuuming • Dead rows not reducing • Long running queries • Unused replication slots • Uncommitted prepared transactions

Slide 11

Slide 11 text

Detailed autovacuum talk last year! aka.ms/cituscon-samaysharma-2022

Slide 12

Slide 12 text

Tuning Checkpointer

Slide 13

Slide 13 text

Healthy Checkpointer operation • Checkpoints are frequent enough to • Ensure a reasonable recovery time • Lesser disk utilization from WAL accumulation • Checkpoints are spaced out enough to • Ensure that I/O impact to user workload is low • Less WAL is generated and lot of pages are not flushed repeatedly

Slide 14

Slide 14 text

How checkpoints are triggered • Manual actions • Running CHECKPOINTcommand • Commands which run CHECKPOINT for you e.g. pg_start_backup, pg_ctl STOP etc. • When a particular time passes since the last checkpoint • When you hit a maximum limit of how much wal you can accumulate

Slide 15

Slide 15 text

Best practice for tuning checkpoint triggering • If checkpoints are triggered too often, you will get this message in your log LOG: checkpoints are occurring too frequently (9 seconds apart) HINT: Consider increasing the configuration parameter "max_wal_size" • Checkpoints should ideally be triggered with checkpoint_timeout • max_wal_size should be set such that it is rarely reached

Slide 16

Slide 16 text

Setting checkpoint_timeout • Checkpoint_timeout depends on RTO • Hard to predict specific recovery time • 5 min (default) is generally small – 15 – 30 min are common • Low values may read to write amplification due to full_page_writes (more later)

Slide 17

Slide 17 text

Setting max_wal_size • Measure wal generated over checkpoint_timeout period • Can use pg_stat_bgwriter, or results of log_checkpoints to determine • Set to 2-3x WAL you expect to generate • (Small workloads): Bound by size of disk for WAL – Soft limit

Slide 18

Slide 18 text

Tuning checkpointing rate • Checkpointing can lead to huge I/O spikes if done all at once • Spread checkpoints out by setting checkpoint_completion_target • 0.9 is generally a good value (PG will aim to complete checkpointing in 90% of time)

Slide 19

Slide 19 text

Impact of full_page_writes • Setting full_page_writes = on prevents corruption when OS crashes • Write content of entire page to WAL during first modification after a checkpoint • Greatly magnifies WAL volume with frequent checkpoints 8kb Postgres OS 8kb 8kb 4kb 4kb

Slide 20

Slide 20 text

Other WAL tuning parameters • wal_compression • Decreases of WAL volume by compressing it • Useful to enable to reduce WAL I/O • wal_buffers • Amount of shared memory used for WAL • Beneficial to increase (upto 128MB) if you have bulk writes or many concurrent writes • min_wal_size • Recycle WAL files until this size is reached instead of removing • Helpful to avoid new WAL file creation during spikes. Set based on log checkpoint output

Slide 21

Slide 21 text

Other WAL tuning parameters – Contd. • backend_flush_after • Attempt to force OS to issue writes once this amount of data is written by a backend • If you see I/O spikes and have a lot of concurrent writes, worth considering this. See caveats. • wal_segment_size • Sets the size of individual WAL files • When you have high WAL volume, number of WAL files per directory can become an issue. Increasing this helps. • Bunch of other parameters in docs – not as commonly used.

Slide 22

Slide 22 text

Tuning Background Writer

Slide 23

Slide 23 text

My reads can do writes? • Any pages needed for reads go into shared_buffers • When there isn’t space to put a buffer into shared_buffers, a page needs to be evicted • If that page is ”dirty”, then it must be written to disk synchronously Bitmap Heap Scan on t0 (cost=38054.04..2162680.07 rows=2867546 width=36) (actual time=1310.569..2181906.171 rows=2859373 loops=1) Recheck Cond: ((col1 = 1) AND (col2 IS NULL)) Rows Removed by Index Recheck: 3617136 Heap Blocks: exact=38631 lossy=859715 Buffers: shared hit=113729 read=792973 written=24959

Slide 24

Slide 24 text

In comes bgwriter • Bgwriter issues writes of “dirty” shared buffers • When clean shared buffers are insufficient, bgwriter writes dirty buffers and marks them as clean • Increases total I/O • Reduces likelihood of user queries having to write buffers themselves

Slide 25

Slide 25 text

Bgwriter tuning parameters • bgwriter_delay : Time interval at which bgwriter wakes up to write dirty buffers • bgwriter_lru_multiplier : Chooses how many buffers should be written per cycle. • Number of buffers is average recent need * bgwriter_lru_multipler. • Default it 2 (giving some cushion for spikes). • Values < 1 indicate writes will be done by server processes. • bgwriter_lru_maxpages : Max limit on how many pages bgwriter can write in each round.

Slide 26

Slide 26 text

Concepts for bgwriter tuning Tradeoff b/w increased total I/O vs I/O done by user queries Goal: See writes by backends and want lower latency Solution: Make bgwriter more aggressive How: Decrease bgwriter_delay and /or increase bgwriter_lru_multiplier Goal: Less total I/O Solution: Make bgwriter less aggressive How: Increase bgwriter_delay and decrease bgwriter_lru_multiplier

Slide 27

Slide 27 text

There’s more • Shared_buffers tuning – larger for write heavy workloads • OS parameters – data_writeback= on, vm.dirty_background_bytes etc. • Group commit settings • And others…

Slide 28

Slide 28 text

Tuning write workloads is tricky Defaults are not good for large workloads Explain doesn’t tell you as much Requires tuning bg processes vs individual statements Problems are more long term

Slide 29

Slide 29 text

Optimize Postgres for write workloads Setup Make sure you are setup well and follow application best practices Monitor Understand the bottleneck by monitoring your database & logs Tune Tune Postgres processes like autovacuum, checkpointer and background writer to improve performance Adapt Keep adapting based on changing workload demands

Slide 30

Slide 30 text

© Copyright Microsoft Corporation. All rights reserved. danke schön dank u merci धन्यवाद teşekkürler thank you grazie gracias tack @samay_sharma [email protected] aka.ms/open-source-discord, #cituscon channel Samay Sharma