$30 off During Our Annual Pro Sale. View Details »

Improving the SLRU subsystem

Improving the SLRU subsystem

A talk I gave at PGCon 2022 about work in progress.


Here is a temporary video link: http://cfbot.cputube.org/tmp/pgcon2022-tmunro-slru.m4v

I will add the final video link to the PGCon channel when it is available.

Thomas Munro

May 28, 2022

More Decks by Thomas Munro

Other Decks in Programming


  1. Thomas Munro, PostgreSQL hacker @ Microsoft PGCon 2022, which should

    have been in Ottawa /sləˈruː/ WIP: Improving PostgreSQL’s mysterious SLRU subsystem
  2. Short version Transam and CLOG history Let a thousand SLRUs

    bloom Ways the buffer pool is better Reunification Prototype
  3. • The SLRUs (“Simple Least Recently Used”) provide storage management

    and bu ff ered access for 7 di ff erent types of critical transaction-related data • I would like to keep the fi le management part of the SLRU system unchanged, in this project • I would like to replace the separate fi xed-sized caches with ordinary bu ff ers in the main bu ff er pool • Status: work in progress!
  4. base/ ├── 1 │ ├── 112 │ ├── 113 │

    ├── 1247 │ ├── 1247_fsm │ ├── 1247_vm │ ├── 1249 │ ├── 1249_fsm │ ├── 1249_vm │ ├── 1255 │ ├── 1255_fsm │ ├── 1255_vm │ ├── 1259 │ ├── … pg_xact ├── 0000 └── 0001 pg_subtrans ├── 0006 ├── 0007 ├── 0008 ├── … ┌─────────────────┬────────────────┐ │ name │ pg_size_pretty │ ├─────────────────┼────────────────┤ │ MultiXactOffset │ 65 kB │ │ Notify │ 65 kB │ │ MultiXactMember │ 130 kB │ │ Serial │ 130 kB │ │ Subtrans │ 261 kB │ │ Xact │ 2067 kB │ │ CommitTs │ 2085 kB │ │ Buffer Blocks │ 1024 MB │ └─────────────────┴────────────────┘ Today: partitioned caches Main buffer pool, here shared_buffers=1GB Shared memory usage Information from the pg_shmem_allocations view Corresponding SLRU fi les on disk
  5. base/ ├── 1 │ ├── 112 │ ├── 113 │

    ├── 1247 │ ├── 1247_fsm │ ├── 1247_vm │ ├── 1249 │ ├── 1249_fsm │ ├── 1249_vm │ ├── 1255 │ ├── 1255_fsm │ ├── 1255_vm │ ├── 1259 │ ├── … pg_xact ├── 0000 └── 0001 pg_subtrans ├── 0006 ├── 0007 ├── 0008 ├── … ┌─────────────────┬────────────────┐ │ name │ pg_size_pretty │ ├─────────────────┼────────────────┤ │ Buffer Blocks │ 1024 MB │ └─────────────────┴────────────────┘ Concept: unified buffer pool All cached data has to compete for space according to the replacement policy No change to fi les on disk
  6. Sizing isn’t the only problem • We already know that

    some speci fi c workloads produce a lot of SLRU cache misses and can be improved by resizing. So far we have resisted the urge to add 7 hard-to-tune settings for manual size control • The main bu ff er pool also has better scalability, asynchronous write back, checksums, …, and plans for more feature like encryption, smart block storage and many more potential innovations
  7. Can we unify buffers without slowing anything down? Can we

    speed anything up? Can we future-proof core transactional data?
  8. Short version Transam and CLOG history
 Let a thousand SLRUs

    bloom Ways the buffer pool is better Reunification Prototype
  9. • University POSTGRES had no traditional transaction log (WAL) •

    Instead it had a minimalist log to record which transactions committed and aborted, with two bits per transaction • Most other systems use locking and undo-on-rollback to make uncommitted data invisible, or they used undo-based MVCC
  10. • Before 7.2, commit/abort fl ags were accessed via the

    main bu ff er pool using a special pseudo-relation called “pg_log”. Problems: • It kept growing and was kept forever on disk • When 32 bit transaction IDs ran out you had to dump and restore
 • In 7.2, it was kicked out into its own mini bu ff er pool to support changes: • Data was cut into 256K fi les so that old fi les could be unlinked • The xid was allowed to wrap around • Renamed to pg_clog (and later in 10, renamed again to pg_xact)
  11. Short version Transam and CLOG history Let a thousand SLRUs

    bloom Ways the buffer pool is better Reunification Prototype
  12. CLOG was generalised to support re-use • Subtransactions arrived in

    8.0 to implement SQL SAVEPOINT, and needed a spillable place to store the parent for each transaction ID • Multixacts arrived in 8.1 to implement shared locks, so that foreign key checks didn’t block other sessions. This required storage for sets of transactions that hold a FOR SHARE lock on a tuple, and uses two SLRUs • NOTIFY (since reimplementation) and SERIALIZABLE also added SLRUs • Special mention for commit timestamps, which are much like “pg_time” from university POSTGRES; similar to pg_log this was bu ff ered as a relation
  13. Short version Transam and CLOG history
 Let a thousand SLRUs

    bloom Ways the buffer pool is better Reunification Prototype
  14. • SLRUs use linear search to fi nd pages. This

    made sense for small numbers of bu ff ers on small machines, but begins to burn too much CPU above hundreds of bu ff ers. • One big LWLock for lookup and access. • The bu ff er pool has partitioned hash table for fast scalable bu ff er lookup. • Bu ff ers have pins, various atomic fl ags and content locks. (Which of these do we need?) Lookup: SLRU vs buffer pool
  15. • SLRUs use an approximation of LRU to decide which

    bu ff er to replace. Replacement has to perform a linear scan to fi nd the least recently used bu ff er, while locking the whole SLRU. • The bu ff er pool uses a generalisation of the CLOCK algorithm from traditional Unix (originally Multics, 1969) bu ff er caches. It tries to account for recency and frequency of use. • It is not the state of the art, and does degrade to scanning linearly for replaceable bu ff ers in the worst case, but it is likely to be improved over time. (CAR?) Replacement: SLRU vs buffer pool
  16. • If a dirty SLRU bu ff er must be

    evicted to make room to read in a new bu ff er, it must be done synchronously. • Before 13, we used to fsync synchronously too when that happened. Handing that part o ff was a stepping stone for this work (collaboration with Shawn Debnath). • The main bu ff er pool has a “background writer” process that tries to make sure that clean bu ff ers are always available. • sync_ fi le_range() is used to start write back before fsync(), for extra I/O concurrency • The background writer may not always succeed in that goal, but improvements are possible. Write-back: SLRU vs buffer pool
  17. • If SLRU fi les are corrupted outside PostgreSQL, there

    are currently no checksums to detect that • The main bu ff er pool supports optional per-page checksums that are computed at write time and veri fi ed at read time Corruption detection: SLRU vs buffer pool Note: current prototype is not attempting to implement this part for SLRU data (more on that soon). One step at a time…
  18. • Open source project such as Neon (“open source alternative

    to AWS Aurora Postgres”, see nearby talk) are working on distributed WAL- aware block storage systems that integrate at the bu ff er pool level
 It must eventually help if all relevant data for a cluster is managed through one bu ff er pool with standardised LSN placement…? • The AIO proposal for PostgreSQL adds optional direct IO, asynchronous IO, IO merging, scatter/gather, and is integrated with the bu ff er pool. • In progress work to add TDE (transparent data encryption) probably integrates with the bu ff er pool a bit like checksums. • A better replacement policy? Future buffer pool innovations This slide is pure conjecture!
  19. Short version Transam and CLOG history Let a thousand SLRUs

    bloom Ways the buffer pool is better Reunification Prototype
  20. • The current prototype introduces a special database OID to

    represent SLRU data, and then relation ID selects the particular SLRU • Other schemes are possible! People have suggested stealing fork number bits or special table space IDs. • spc = DEFAULTTABLESPC_OID
 db = SLRU_DB_ID
 block = ? Buffer tags Hash table keys for fi nding bu ff ers db rel fork block spc
  21. • For now, I keep fi le and page layout

    unchanged, that is, without any header: • No checksums: for those to be powerloss crash safe, we’d need to implement full page writes and register all modi fi ed pages in the WAL. Out of scope. • No LSN: fi le format change vs pg_upgrade. Out of scope. • External LSNs: we still need a place to track the last LSN to modify each page, so we need a new array to hold them (may be able to chisel some space out of padding in Bu ff erDescriptor) • These could all be fi xed, removing the need to handle raw pages Rough edge: raw pages
  22. • The bu ff er manager interacts with fi les

    by using sgmrread() etc. • smgr.c historically had a place to dispatch to di ff erent implementations through a function table, though only md.c remains • This provides an easy way to teach smgr.c to recognise the special database at open time and select the function table for slru.c Storage manager bufmgr.c smgr.c md.c slru.c pg_xact ├── 0000 └── 0001 base/ ├── 1 │ ├── 112 │ ├── 113 │ ├── 1247 │ ├── …
  23. • The existing SLRU system acquires a single LWLock to

    perform a linear search for a page number and then access the data, which is pretty fast if the cache is small and the it’s a cache hit • A naive implementation using the bu ff er pool would build a bu ff er tag, compute a hash, acquire/release an LWLock for a partition of the lookup table to fi nd a bu ff er, pin the bu ff er, and then share lock the page while reading the bits it needs. Ouch. • Current theory: we can avoid the bu ff er mapping table in common cases with a small cache of recent bu ff ers and Bu ff erReadRecent() • Current theory: in many interesting cases we only need a pin to read Lookup, pinning and locking overheads
  24. • The existing system scans the whole dedicated cache looking

    for blocks to discard, when truncating old data • In the main bu ff er cache, we must loop over the block range probing the bu ff er mapping table looking for blocks to discard — a bit more work! • Hopefully this’ll be OK because it all happens in background jobs: pg_xact and pg_multixact are trimmed during vacuuming, and pg_subtrans during checkpoints Discarding/truncating
  25. • The current code for trying to coordinate faster CLOG

    updates works by with extra state per CLOG bu ff er. If the CLOG page can’t be locked immediately, the update joins a queue and sleeps, and then updates are consolidated. • But… going o ff CPU when you’re trying to commit is bad? • Idea: Abandon the whole schmozzle. Invent a lock-free update protocol. We only need a pinned page, and we can use pg_atomic_fetch_or_u8() to set the bits without clobbering current writes. Use CAS to update the page LSN while making sure it doesn’t go backwards. • Is LSN contention and false sharing tra ff i c worse than the sleeping? Rough edge: grouped CLOG updates
  26. Short version Transam and CLOG history Let a thousand SLRUs

    bloom Ways the buffer pool is better Reunification Prototype
  27. • See pgsql-hackers@ mailing list or commitfest entry #3514 for

    code • Many open questions around locking and other details! • Runs well enough to begin exploring basic ideas… Early experiment stage patch
  28. postgres=# WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'), (1, 'pg_multixact/offsets'),

    (2, 'pg_multixact/members'), (3, 'pg_subtrans'), (4, 'pg_serial'), (5, 'pg_commit_ts'), (6, 'pg_notify')) SELECT path, pg_size_pretty(COUNT(*) * 8192) FROM pg_buffercache NATURAL JOIN slru WHERE reldatabase = 9 GROUP BY 1 ORDER BY 1; path | pg_size_pretty ----------------------+---------------- pg_multixact/offsets | 8192 bytes pg_subtrans | 31 MB pg_xact | 3560 kB (3 rows)
  29. postgres=# select test_clog_fetch('0'::xid, '8388608'::xid, 1); NOTICE: xid range 0, 8388608;

    loop = 1 test_clog_fetch ----------------- (1 row) Time: 222.341 ms Time: 360.045 ms
  30. postgres=# select * from pg_stat_slru; name | blks_zeroed | blks_hit

    | blks_read | blks_written | blks_exists | flu -----------------+-------------+-----------+-----------+--------------+-------------+---- Xact | 1166 | 197753416 | 721 | 661 | 0 | MultiXactOffset | 0 | 0 | 7 | 6 | 0 | MultiXactMember | 0 | 0 | 0 | 0 | 0 | Subtrans | 18667 | 0 | 0 | 10432 | 0 | Serial | 0 | 0 | 0 | 0 | 0 | CommitTs | 0 | 0 | 0 | 0 | 0 | Notify | 0 | 0 | 0 | 0 | 0 | (7 rows)
  31. fin