Improving the SLRU subsystem

Slide 1

Slide 1 text

Thomas Munro, PostgreSQL hacker @ Microsoft PGCon 2022, which should have been in Ottawa /sləˈruː/ WIP: Improving PostgreSQL’s mysterious SLRU subsystem

Slide 2

Slide 2 text

Short version Transam and CLOG history Let a thousand SLRUs bloom Ways the buffer pool is better Reunification Prototype

Slide 3

Slide 3 text

• The SLRUs (“Simple Least Recently Used”) provide storage management and bu ff ered access for 7 di ff erent types of critical transaction-related data • I would like to keep the fi le management part of the SLRU system unchanged, in this project • I would like to replace the separate fi xed-sized caches with ordinary bu ff ers in the main bu ff er pool • Status: work in progress!

Slide 4

Slide 4 text

base/ ├── 1 │ ├── 112 │ ├── 113 │ ├── 1247 │ ├── 1247_fsm │ ├── 1247_vm │ ├── 1249 │ ├── 1249_fsm │ ├── 1249_vm │ ├── 1255 │ ├── 1255_fsm │ ├── 1255_vm │ ├── 1259 │ ├── … pg_xact ├── 0000 └── 0001 pg_subtrans ├── 0006 ├── 0007 ├── 0008 ├── … ┌─────────────────┬────────────────┐ │ name │ pg_size_pretty │ ├─────────────────┼────────────────┤ │ MultiXactOffset │ 65 kB │ │ Notify │ 65 kB │ │ MultiXactMember │ 130 kB │ │ Serial │ 130 kB │ │ Subtrans │ 261 kB │ │ Xact │ 2067 kB │ │ CommitTs │ 2085 kB │ │ Buffer Blocks │ 1024 MB │ └─────────────────┴────────────────┘ Today: partitioned caches Main buffer pool, here shared_buffers=1GB Shared memory usage Information from the pg_shmem_allocations view Corresponding SLRU fi les on disk

Slide 5

Slide 5 text

base/ ├── 1 │ ├── 112 │ ├── 113 │ ├── 1247 │ ├── 1247_fsm │ ├── 1247_vm │ ├── 1249 │ ├── 1249_fsm │ ├── 1249_vm │ ├── 1255 │ ├── 1255_fsm │ ├── 1255_vm │ ├── 1259 │ ├── … pg_xact ├── 0000 └── 0001 pg_subtrans ├── 0006 ├── 0007 ├── 0008 ├── … ┌─────────────────┬────────────────┐ │ name │ pg_size_pretty │ ├─────────────────┼────────────────┤ │ Buffer Blocks │ 1024 MB │ └─────────────────┴────────────────┘ Concept: unified buffer pool All cached data has to compete for space according to the replacement policy No change to fi les on disk

Slide 6

Slide 6 text

Sizing isn’t the only problem • We already know that some speci fi c workloads produce a lot of SLRU cache misses and can be improved by resizing. So far we have resisted the urge to add 7 hard-to-tune settings for manual size control • The main bu ff er pool also has better scalability, asynchronous write back, checksums, …, and plans for more feature like encryption, smart block storage and many more potential innovations

Slide 7

Slide 7 text

Can we unify buffers without slowing anything down? Can we speed anything up? Can we future-proof core transactional data?

Slide 8

Slide 8 text

Short version Transam and CLOG history  Let a thousand SLRUs bloom Ways the buffer pool is better Reunification Prototype

Slide 9

Slide 9 text

• University POSTGRES had no traditional transaction log (WAL) • Instead it had a minimalist log to record which transactions committed and aborted, with two bits per transaction • Most other systems use locking and undo-on-rollback to make uncommitted data invisible, or they used undo-based MVCC

Slide 10

Slide 10 text

• Before 7.2, commit/abort fl ags were accessed via the main bu ff er pool using a special pseudo-relation called “pg_log”. Problems: • It kept growing and was kept forever on disk • When 32 bit transaction IDs ran out you had to dump and restore    • In 7.2, it was kicked out into its own mini bu ff er pool to support changes: • Data was cut into 256K fi les so that old fi les could be unlinked • The xid was allowed to wrap around • Renamed to pg_clog (and later in 10, renamed again to pg_xact)

Slide 11

Slide 11 text

Short version Transam and CLOG history Let a thousand SLRUs bloom Ways the buffer pool is better Reunification Prototype

Slide 12

Slide 12 text

CLOG was generalised to support re-use • Subtransactions arrived in 8.0 to implement SQL SAVEPOINT, and needed a spillable place to store the parent for each transaction ID • Multixacts arrived in 8.1 to implement shared locks, so that foreign key checks didn’t block other sessions. This required storage for sets of transactions that hold a FOR SHARE lock on a tuple, and uses two SLRUs • NOTIFY (since reimplementation) and SERIALIZABLE also added SLRUs • Special mention for commit timestamps, which are much like “pg_time” from university POSTGRES; similar to pg_log this was bu ff ered as a relation

Slide 13

Slide 13 text

Short version Transam and CLOG history  Let a thousand SLRUs bloom Ways the buffer pool is better Reunification Prototype

Slide 14

Slide 14 text

• SLRUs use linear search to fi nd pages. This made sense for small numbers of bu ff ers on small machines, but begins to burn too much CPU above hundreds of bu ff ers. • One big LWLock for lookup and access. • The bu ff er pool has partitioned hash table for fast scalable bu ff er lookup. • Bu ff ers have pins, various atomic fl ags and content locks. (Which of these do we need?) Lookup: SLRU vs buffer pool

Slide 15

Slide 15 text

• SLRUs use an approximation of LRU to decide which bu ff er to replace. Replacement has to perform a linear scan to fi nd the least recently used bu ff er, while locking the whole SLRU. • The bu ff er pool uses a generalisation of the CLOCK algorithm from traditional Unix (originally Multics, 1969) bu ff er caches. It tries to account for recency and frequency of use. • It is not the state of the art, and does degrade to scanning linearly for replaceable bu ff ers in the worst case, but it is likely to be improved over time. (CAR?) Replacement: SLRU vs buffer pool

Slide 16

Slide 16 text

• If a dirty SLRU bu ff er must be evicted to make room to read in a new bu ff er, it must be done synchronously. • Before 13, we used to fsync synchronously too when that happened. Handing that part o ff was a stepping stone for this work (collaboration with Shawn Debnath). • The main bu ff er pool has a “background writer” process that tries to make sure that clean bu ff ers are always available. • sync_ fi le_range() is used to start write back before fsync(), for extra I/O concurrency • The background writer may not always succeed in that goal, but improvements are possible. Write-back: SLRU vs buffer pool

Slide 17

Slide 17 text

• If SLRU fi les are corrupted outside PostgreSQL, there are currently no checksums to detect that • The main bu ff er pool supports optional per-page checksums that are computed at write time and veri fi ed at read time Corruption detection: SLRU vs buffer pool Note: current prototype is not attempting to implement this part for SLRU data (more on that soon). One step at a time…

Slide 18

Slide 18 text

• Open source project such as Neon (“open source alternative to AWS Aurora Postgres”, see nearby talk) are working on distributed WAL- aware block storage systems that integrate at the bu ff er pool level    It must eventually help if all relevant data for a cluster is managed through one bu ff er pool with standardised LSN placement…? • The AIO proposal for PostgreSQL adds optional direct IO, asynchronous IO, IO merging, scatter/gather, and is integrated with the bu ff er pool. • In progress work to add TDE (transparent data encryption) probably integrates with the bu ff er pool a bit like checksums. • A better replacement policy? Future buffer pool innovations This slide is pure conjecture!

Slide 19

Slide 19 text

Short version Transam and CLOG history Let a thousand SLRUs bloom Ways the buffer pool is better Reunification Prototype

Slide 20

Slide 20 text

• The current prototype introduces a special database OID to represent SLRU data, and then relation ID selects the particular SLRU • Other schemes are possible! People have suggested stealing fork number bits or special table space IDs. • spc = DEFAULTTABLESPC_OID  db = SLRU_DB_ID  rel = SLRU_CLOG_REL_ID  fork = MAIN_FORKNUM  block = ? Buffer tags Hash table keys for fi nding bu ff ers db rel fork block spc

Slide 21

Slide 21 text

• For now, I keep fi le and page layout unchanged, that is, without any header: • No checksums: for those to be powerloss crash safe, we’d need to implement full page writes and register all modi fi ed pages in the WAL. Out of scope. • No LSN: fi le format change vs pg_upgrade. Out of scope. • External LSNs: we still need a place to track the last LSN to modify each page, so we need a new array to hold them (may be able to chisel some space out of padding in Bu ff erDescriptor) • These could all be fi xed, removing the need to handle raw pages Rough edge: raw pages

Slide 22

Slide 22 text

• The bu ff er manager interacts with fi les by using sgmrread() etc. • smgr.c historically had a place to dispatch to di ff erent implementations through a function table, though only md.c remains • This provides an easy way to teach smgr.c to recognise the special database at open time and select the function table for slru.c Storage manager bufmgr.c smgr.c md.c slru.c pg_xact ├── 0000 └── 0001 base/ ├── 1 │ ├── 112 │ ├── 113 │ ├── 1247 │ ├── …

Slide 23

Slide 23 text

• The existing SLRU system acquires a single LWLock to perform a linear search for a page number and then access the data, which is pretty fast if the cache is small and the it’s a cache hit • A naive implementation using the bu ff er pool would build a bu ff er tag, compute a hash, acquire/release an LWLock for a partition of the lookup table to fi nd a bu ff er, pin the bu ff er, and then share lock the page while reading the bits it needs. Ouch. • Current theory: we can avoid the bu ff er mapping table in common cases with a small cache of recent bu ff ers and Bu ff erReadRecent() • Current theory: in many interesting cases we only need a pin to read Lookup, pinning and locking overheads

Slide 24

Slide 24 text

• The existing system scans the whole dedicated cache looking for blocks to discard, when truncating old data • In the main bu ff er cache, we must loop over the block range probing the bu ff er mapping table looking for blocks to discard — a bit more work! • Hopefully this’ll be OK because it all happens in background jobs: pg_xact and pg_multixact are trimmed during vacuuming, and pg_subtrans during checkpoints Discarding/truncating

Slide 25

Slide 25 text

• The current code for trying to coordinate faster CLOG updates works by with extra state per CLOG bu ff er. If the CLOG page can’t be locked immediately, the update joins a queue and sleeps, and then updates are consolidated. • But… going o ff CPU when you’re trying to commit is bad? • Idea: Abandon the whole schmozzle. Invent a lock-free update protocol. We only need a pinned page, and we can use pg_atomic_fetch_or_u8() to set the bits without clobbering current writes. Use CAS to update the page LSN while making sure it doesn’t go backwards. • Is LSN contention and false sharing tra ff i c worse than the sleeping? Rough edge: grouped CLOG updates

Slide 26

Slide 26 text

Short version Transam and CLOG history Let a thousand SLRUs bloom Ways the buffer pool is better Reunification Prototype

Slide 27

Slide 27 text

• See pgsql-hackers@ mailing list or commitfest entry #3514 for code • Many open questions around locking and other details! • Runs well enough to begin exploring basic ideas… Early experiment stage patch

Slide 28

Slide 28 text

postgres=# WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'), (1, 'pg_multixact/offsets'), (2, 'pg_multixact/members'), (3, 'pg_subtrans'), (4, 'pg_serial'), (5, 'pg_commit_ts'), (6, 'pg_notify')) SELECT path, pg_size_pretty(COUNT(*) * 8192) FROM pg_buffercache NATURAL JOIN slru WHERE reldatabase = 9 GROUP BY 1 ORDER BY 1; path | pg_size_pretty ----------------------+---------------- pg_multixact/offsets | 8192 bytes pg_subtrans | 31 MB pg_xact | 3560 kB (3 rows)

Slide 29

Slide 29 text

postgres=# select test_clog_fetch('0'::xid, '8388608'::xid, 1); NOTICE: xid range 0, 8388608; loop = 1 test_clog_fetch ----------------- (1 row) Time: 222.341 ms Time: 360.045 ms

Slide 30

Slide 30 text

postgres=# select * from pg_stat_slru; name | blks_zeroed | blks_hit | blks_read | blks_written | blks_exists | flu -----------------+-------------+-----------+-----------+--------------+-------------+---- Xact | 1166 | 197753416 | 721 | 661 | 0 | MultiXactOffset | 0 | 0 | 7 | 6 | 0 | MultiXactMember | 0 | 0 | 0 | 0 | 0 | Subtrans | 18667 | 0 | 0 | 10432 | 0 | Serial | 0 | 0 | 0 | 0 | 0 | CommitTs | 0 | 0 | 0 | 0 | 0 | Notify | 0 | 0 | 0 | 0 | 0 | (7 rows)

Slide 31

Slide 31 text

fin