Improving the SLRU subsystem

Thomas Munro, PostgreSQL hacker @ Microsoft PGCon 2022, which should
have been in Ottawa /sləˈruː/ WIP: Improving PostgreSQL’s mysterious SLRU subsystem

Short version Transam and CLOG history Let a thousand SLRUs
bloom Ways the buffer pool is better Reunification Prototype

• The SLRUs (“Simple Least Recently Used”) provide storage management
and bu ff ered access for 7 di ff erent types of critical transaction-related data • I would like to keep the fi le management part of the SLRU system unchanged, in this project • I would like to replace the separate fi xed-sized caches with ordinary bu ff ers in the main bu ff er pool • Status: work in progress!

base/ ├── 1 │ ├── 112 │ ├── 113 │
├── 1247 │ ├── 1247_fsm │ ├── 1247_vm │ ├── 1249 │ ├── 1249_fsm │ ├── 1249_vm │ ├── 1255 │ ├── 1255_fsm │ ├── 1255_vm │ ├── 1259 │ ├── … pg_xact ├── 0000 └── 0001 pg_subtrans ├── 0006 ├── 0007 ├── 0008 ├── … ┌─────────────────┬────────────────┐ │ name │ pg_size_pretty │ ├─────────────────┼────────────────┤ │ MultiXactOffset │ 65 kB │ │ Notify │ 65 kB │ │ MultiXactMember │ 130 kB │ │ Serial │ 130 kB │ │ Subtrans │ 261 kB │ │ Xact │ 2067 kB │ │ CommitTs │ 2085 kB │ │ Buffer Blocks │ 1024 MB │ └─────────────────┴────────────────┘ Today: partitioned caches Main buffer pool, here shared_buffers=1GB Shared memory usage Information from the pg_shmem_allocations view Corresponding SLRU fi les on disk

base/ ├── 1 │ ├── 112 │ ├── 113 │
├── 1247 │ ├── 1247_fsm │ ├── 1247_vm │ ├── 1249 │ ├── 1249_fsm │ ├── 1249_vm │ ├── 1255 │ ├── 1255_fsm │ ├── 1255_vm │ ├── 1259 │ ├── … pg_xact ├── 0000 └── 0001 pg_subtrans ├── 0006 ├── 0007 ├── 0008 ├── … ┌─────────────────┬────────────────┐ │ name │ pg_size_pretty │ ├─────────────────┼────────────────┤ │ Buffer Blocks │ 1024 MB │ └─────────────────┴────────────────┘ Concept: unified buffer pool All cached data has to compete for space according to the replacement policy No change to fi les on disk

Sizing isn’t the only problem • We already know that
some speci fi c workloads produce a lot of SLRU cache misses and can be improved by resizing. So far we have resisted the urge to add 7 hard-to-tune settings for manual size control • The main bu ff er pool also has better scalability, asynchronous write back, checksums, …, and plans for more feature like encryption, smart block storage and many more potential innovations

Can we unify buffers without slowing anything down? Can we
speed anything up? Can we future-proof core transactional data?

Short version Transam and CLOG history  Let a thousand SLRUs

• University POSTGRES had no traditional transaction log (WAL) •
Instead it had a minimalist log to record which transactions committed and aborted, with two bits per transaction • Most other systems use locking and undo-on-rollback to make uncommitted data invisible, or they used undo-based MVCC

• Before 7.2, commit/abort fl ags were accessed via the
main bu ff er pool using a special pseudo-relation called “pg_log”. Problems: • It kept growing and was kept forever on disk • When 32 bit transaction IDs ran out you had to dump and restore    • In 7.2, it was kicked out into its own mini bu ff er pool to support changes: • Data was cut into 256K fi les so that old fi les could be unlinked • The xid was allowed to wrap around • Renamed to pg_clog (and later in 10, renamed again to pg_xact)

CLOG was generalised to support re-use • Subtransactions arrived in
8.0 to implement SQL SAVEPOINT, and needed a spillable place to store the parent for each transaction ID • Multixacts arrived in 8.1 to implement shared locks, so that foreign key checks didn’t block other sessions. This required storage for sets of transactions that hold a FOR SHARE lock on a tuple, and uses two SLRUs • NOTIFY (since reimplementation) and SERIALIZABLE also added SLRUs • Special mention for commit timestamps, which are much like “pg_time” from university POSTGRES; similar to pg_log this was bu ff ered as a relation

Short version Transam and CLOG history  Let a thousand SLRUs

• SLRUs use linear search to fi nd pages. This
made sense for small numbers of bu ff ers on small machines, but begins to burn too much CPU above hundreds of bu ff ers. • One big LWLock for lookup and access. • The bu ff er pool has partitioned hash table for fast scalable bu ff er lookup. • Bu ff ers have pins, various atomic fl ags and content locks. (Which of these do we need?) Lookup: SLRU vs buffer pool

• SLRUs use an approximation of LRU to decide which
bu ff er to replace. Replacement has to perform a linear scan to fi nd the least recently used bu ff er, while locking the whole SLRU. • The bu ff er pool uses a generalisation of the CLOCK algorithm from traditional Unix (originally Multics, 1969) bu ff er caches. It tries to account for recency and frequency of use. • It is not the state of the art, and does degrade to scanning linearly for replaceable bu ff ers in the worst case, but it is likely to be improved over time. (CAR?) Replacement: SLRU vs buffer pool

• If a dirty SLRU bu ff er must be
evicted to make room to read in a new bu ff er, it must be done synchronously. • Before 13, we used to fsync synchronously too when that happened. Handing that part o ff was a stepping stone for this work (collaboration with Shawn Debnath). • The main bu ff er pool has a “background writer” process that tries to make sure that clean bu ff ers are always available. • sync_ fi le_range() is used to start write back before fsync(), for extra I/O concurrency • The background writer may not always succeed in that goal, but improvements are possible. Write-back: SLRU vs buffer pool

• If SLRU fi les are corrupted outside PostgreSQL, there
are currently no checksums to detect that • The main bu ff er pool supports optional per-page checksums that are computed at write time and veri fi ed at read time Corruption detection: SLRU vs buffer pool Note: current prototype is not attempting to implement this part for SLRU data (more on that soon). One step at a time…

• Open source project such as Neon (“open source alternative
to AWS Aurora Postgres”, see nearby talk) are working on distributed WAL- aware block storage systems that integrate at the bu ff er pool level    It must eventually help if all relevant data for a cluster is managed through one bu ff er pool with standardised LSN placement…? • The AIO proposal for PostgreSQL adds optional direct IO, asynchronous IO, IO merging, scatter/gather, and is integrated with the bu ff er pool. • In progress work to add TDE (transparent data encryption) probably integrates with the bu ff er pool a bit like checksums. • A better replacement policy? Future buffer pool innovations This slide is pure conjecture!

• The current prototype introduces a special database OID to
represent SLRU data, and then relation ID selects the particular SLRU • Other schemes are possible! People have suggested stealing fork number bits or special table space IDs. • spc = DEFAULTTABLESPC_OID  db = SLRU_DB_ID  rel = SLRU_CLOG_REL_ID  fork = MAIN_FORKNUM  block = ? Buffer tags Hash table keys for fi nding bu ff ers db rel fork block spc

• For now, I keep fi le and page layout
unchanged, that is, without any header: • No checksums: for those to be powerloss crash safe, we’d need to implement full page writes and register all modi fi ed pages in the WAL. Out of scope. • No LSN: fi le format change vs pg_upgrade. Out of scope. • External LSNs: we still need a place to track the last LSN to modify each page, so we need a new array to hold them (may be able to chisel some space out of padding in Bu ff erDescriptor) • These could all be fi xed, removing the need to handle raw pages Rough edge: raw pages

• The bu ff er manager interacts with fi les
by using sgmrread() etc. • smgr.c historically had a place to dispatch to di ff erent implementations through a function table, though only md.c remains • This provides an easy way to teach smgr.c to recognise the special database at open time and select the function table for slru.c Storage manager bufmgr.c smgr.c md.c slru.c pg_xact ├── 0000 └── 0001 base/ ├── 1 │ ├── 112 │ ├── 113 │ ├── 1247 │ ├── …

• The existing SLRU system acquires a single LWLock to
perform a linear search for a page number and then access the data, which is pretty fast if the cache is small and the it’s a cache hit • A naive implementation using the bu ff er pool would build a bu ff er tag, compute a hash, acquire/release an LWLock for a partition of the lookup table to fi nd a bu ff er, pin the bu ff er, and then share lock the page while reading the bits it needs. Ouch. • Current theory: we can avoid the bu ff er mapping table in common cases with a small cache of recent bu ff ers and Bu ff erReadRecent() • Current theory: in many interesting cases we only need a pin to read Lookup, pinning and locking overheads

• The existing system scans the whole dedicated cache looking
for blocks to discard, when truncating old data • In the main bu ff er cache, we must loop over the block range probing the bu ff er mapping table looking for blocks to discard — a bit more work! • Hopefully this’ll be OK because it all happens in background jobs: pg_xact and pg_multixact are trimmed during vacuuming, and pg_subtrans during checkpoints Discarding/truncating

• The current code for trying to coordinate faster CLOG
updates works by with extra state per CLOG bu ff er. If the CLOG page can’t be locked immediately, the update joins a queue and sleeps, and then updates are consolidated. • But… going o ff CPU when you’re trying to commit is bad? • Idea: Abandon the whole schmozzle. Invent a lock-free update protocol. We only need a pinned page, and we can use pg_atomic_fetch_or_u8() to set the bits without clobbering current writes. Use CAS to update the page LSN while making sure it doesn’t go backwards. • Is LSN contention and false sharing tra ff i c worse than the sleeping? Rough edge: grouped CLOG updates

• See pgsql-hackers@ mailing list or commitfest entry #3514 for
code • Many open questions around locking and other details! • Runs well enough to begin exploring basic ideas… Early experiment stage patch

postgres=# WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'), (1, 'pg_multixact/offsets'),
(2, 'pg_multixact/members'), (3, 'pg_subtrans'), (4, 'pg_serial'), (5, 'pg_commit_ts'), (6, 'pg_notify')) SELECT path, pg_size_pretty(COUNT(*) * 8192) FROM pg_buffercache NATURAL JOIN slru WHERE reldatabase = 9 GROUP BY 1 ORDER BY 1; path | pg_size_pretty ----------------------+---------------- pg_multixact/offsets | 8192 bytes pg_subtrans | 31 MB pg_xact | 3560 kB (3 rows)

postgres=# select test_clog_fetch('0'::xid, '8388608'::xid, 1); NOTICE: xid range 0, 8388608;
loop = 1 test_clog_fetch ----------------- (1 row) Time: 222.341 ms Time: 360.045 ms

postgres=# select * from pg_stat_slru; name | blks_zeroed | blks_hit
| blks_read | blks_written | blks_exists | flu -----------------+-------------+-----------+-----------+--------------+-------------+---- Xact | 1166 | 197753416 | 721 | 661 | 0 | MultiXactOffset | 0 | 0 | 7 | 6 | 0 | MultiXactMember | 0 | 0 | 0 | 0 | 0 | Subtrans | 18667 | 0 | 0 | 10432 | 0 | Serial | 0 | 0 | 0 | 0 | 0 | CommitTs | 0 | 0 | 0 | 0 | 0 | Notify | 0 | 0 | 0 | 0 | 0 | (7 rows)

Improving the SLRU subsystem

Improving the SLRU subsystem

Thomas Munro

More Decks by Thomas Munro

Other Decks in Programming

Featured

Transcript

Thomas Munro, PostgreSQL hacker @ Microsoft PGCon 2022, which should

Short version Transam and CLOG history Let a thousand SLRUs

• The SLRUs (“Simple Least Recently Used”) provide storage management

base/ ├── 1 │ ├── 112 │ ├── 113 │

base/ ├── 1 │ ├── 112 │ ├── 113 │

Sizing isn’t the only problem • We already know that

Can we unify buffers without slowing anything down? Can we

Short version Transam and CLOG history  Let a thousand SLRUs

• University POSTGRES had no traditional transaction log (WAL) •

• Before 7.2, commit/abort fl ags were accessed via the

Short version Transam and CLOG history Let a thousand SLRUs

CLOG was generalised to support re-use • Subtransactions arrived in

Short version Transam and CLOG history  Let a thousand SLRUs

• SLRUs use linear search to fi nd pages. This

• SLRUs use an approximation of LRU to decide which

• If a dirty SLRU bu ff er must be

• If SLRU fi les are corrupted outside PostgreSQL, there

• Open source project such as Neon (“open source alternative

Short version Transam and CLOG history Let a thousand SLRUs

• The current prototype introduces a special database OID to

• For now, I keep fi le and page layout

• The bu ff er manager interacts with fi les

• The existing SLRU system acquires a single LWLock to

• The existing system scans the whole dedicated cache looking

• The current code for trying to coordinate faster CLOG

Short version Transam and CLOG history Let a thousand SLRUs

• See pgsql-hackers@ mailing list or commitfest entry #3514 for

postgres=# WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'), (1, 'pg_multixact/offsets'),

postgres=# select test_clog_fetch('0'::xid, '8388608'::xid, 1); NOTICE: xid range 0, 8388608;

postgres=# select * from pg_stat_slru; name | blks_zeroed | blks_hit

fin