Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improving the SLRU subsystem

Improving the SLRU subsystem

A talk I gave at PGCon 2022 about work in progress.

https://www.pgcon.org/events/pgcon_2022/schedule/session/287-improving-postgresqls-mysterious-slru-subsystem/

Here is a temporary video link: http://cfbot.cputube.org/tmp/pgcon2022-tmunro-slru.m4v

I will add the final video link to the PGCon channel when it is available.

Thomas Munro

May 28, 2022
Tweet

More Decks by Thomas Munro

Other Decks in Programming

Transcript

  1. Thomas Munro, PostgreSQL hacker @ Microsoft
    PGCon 2022, which should have been in Ottawa
    /sləˈruː/
    WIP: Improving PostgreSQL’s mysterious SLRU subsystem

    View full-size slide

  2. Short version
    Transam and CLOG history
    Let a thousand SLRUs bloom
    Ways the buffer pool is better
    Reunification
    Prototype

    View full-size slide

  3. • The SLRUs (“Simple Least Recently Used”) provide storage management and
    bu
    ff
    ered access for 7 di
    ff
    erent types of critical transaction-related data

    • I would like to keep the
    fi
    le management part of the SLRU system unchanged,
    in this project

    • I would like to replace the separate
    fi
    xed-sized caches with ordinary bu
    ff
    ers in
    the main bu
    ff
    er pool

    • Status: work in progress!

    View full-size slide

  4. base/


    ├── 1


    │ ├── 112


    │ ├── 113


    │ ├── 1247


    │ ├── 1247_fsm


    │ ├── 1247_vm


    │ ├── 1249


    │ ├── 1249_fsm


    │ ├── 1249_vm


    │ ├── 1255


    │ ├── 1255_fsm


    │ ├── 1255_vm


    │ ├── 1259


    │ ├── …
    pg_xact


    ├── 0000


    └── 0001
    pg_subtrans


    ├── 0006


    ├── 0007


    ├── 0008


    ├── …
    ┌─────────────────┬────────────────┐


    │ name │ pg_size_pretty │


    ├─────────────────┼────────────────┤


    │ MultiXactOffset │ 65 kB │


    │ Notify │ 65 kB │


    │ MultiXactMember │ 130 kB │


    │ Serial │ 130 kB │


    │ Subtrans │ 261 kB │


    │ Xact │ 2067 kB │


    │ CommitTs │ 2085 kB │


    │ Buffer Blocks │ 1024 MB │


    └─────────────────┴────────────────┘
    Today: partitioned caches
    Main buffer pool, here
    shared_buffers=1GB
    Shared memory usage Information from
    the pg_shmem_allocations view
    Corresponding SLRU
    fi
    les on disk

    View full-size slide

  5. base/


    ├── 1


    │ ├── 112


    │ ├── 113


    │ ├── 1247


    │ ├── 1247_fsm


    │ ├── 1247_vm


    │ ├── 1249


    │ ├── 1249_fsm


    │ ├── 1249_vm


    │ ├── 1255


    │ ├── 1255_fsm


    │ ├── 1255_vm


    │ ├── 1259


    │ ├── …
    pg_xact


    ├── 0000


    └── 0001
    pg_subtrans


    ├── 0006


    ├── 0007


    ├── 0008


    ├── …
    ┌─────────────────┬────────────────┐


    │ name │ pg_size_pretty │


    ├─────────────────┼────────────────┤


    │ Buffer Blocks │ 1024 MB │


    └─────────────────┴────────────────┘
    Concept: unified buffer pool
    All cached data has to
    compete for space
    according to the
    replacement policy
    No change to
    fi
    les on disk

    View full-size slide

  6. Sizing isn’t the only problem
    • We already know that some speci
    fi
    c
    workloads produce a lot of SLRU
    cache misses and can be improved by
    resizing. So far we have resisted the
    urge to add 7 hard-to-tune settings for
    manual size control

    • The main bu
    ff
    er pool also has better
    scalability, asynchronous write back,
    checksums, …, and plans for more
    feature like encryption, smart block
    storage and many more potential
    innovations

    View full-size slide

  7. Can we unify buffers without slowing anything down?


    Can we speed anything up?


    Can we future-proof core transactional data?

    View full-size slide

  8. Short version
    Transam and CLOG history

    Let a thousand SLRUs bloom
    Ways the buffer pool is better
    Reunification
    Prototype

    View full-size slide

  9. • University POSTGRES had no
    traditional transaction log (WAL)

    • Instead it had a minimalist log to
    record which transactions
    committed and aborted, with two
    bits per transaction

    • Most other systems use locking
    and undo-on-rollback to make
    uncommitted data invisible, or they
    used undo-based MVCC

    View full-size slide

  10. • Before 7.2, commit/abort
    fl
    ags were
    accessed via the main bu
    ff
    er pool
    using a special pseudo-relation
    called “pg_log”. Problems:

    • It kept growing and was kept
    forever on disk

    • When 32 bit transaction IDs ran
    out you had to dump and restore


    • In 7.2, it was kicked out into its own
    mini bu
    ff
    er pool to support
    changes:

    • Data was cut into 256K
    fi
    les so
    that old
    fi
    les could be unlinked

    • The xid was allowed to wrap
    around

    • Renamed to pg_clog (and later in
    10, renamed again to pg_xact)

    View full-size slide

  11. Short version
    Transam and CLOG history
    Let a thousand SLRUs bloom
    Ways the buffer pool is better
    Reunification
    Prototype

    View full-size slide

  12. CLOG was generalised to support re-use
    • Subtransactions arrived in 8.0 to implement SQL SAVEPOINT, and needed a
    spillable place to store the parent for each transaction ID

    • Multixacts arrived in 8.1 to implement shared locks, so that foreign key
    checks didn’t block other sessions. This required storage for sets of
    transactions that hold a FOR SHARE lock on a tuple, and uses two SLRUs

    • NOTIFY (since reimplementation) and SERIALIZABLE also added SLRUs

    • Special mention for commit timestamps, which are much like “pg_time” from
    university POSTGRES; similar to pg_log this was bu
    ff
    ered as a relation

    View full-size slide

  13. Short version
    Transam and CLOG history

    Let a thousand SLRUs bloom
    Ways the buffer pool is better
    Reunification
    Prototype

    View full-size slide

  14. • SLRUs use linear search to
    fi
    nd
    pages. This made sense for small
    numbers of bu
    ff
    ers on small
    machines, but begins to burn too
    much CPU above hundreds of
    bu
    ff
    ers.

    • One big LWLock for lookup and
    access.

    • The bu
    ff
    er pool has partitioned
    hash table for fast scalable bu
    ff
    er
    lookup.

    • Bu
    ff
    ers have pins, various atomic
    fl
    ags and content locks. (Which of
    these do we need?)
    Lookup: SLRU vs buffer pool

    View full-size slide

  15. • SLRUs use an approximation of
    LRU to decide which bu
    ff
    er to
    replace. Replacement has to
    perform a linear scan to
    fi
    nd the
    least recently used bu
    ff
    er, while
    locking the whole SLRU.

    • The bu
    ff
    er pool uses a
    generalisation of the CLOCK
    algorithm from traditional Unix
    (originally Multics, 1969) bu
    ff
    er
    caches. It tries to account for
    recency and frequency of use.

    • It is not the state of the art, and
    does degrade to scanning linearly
    for replaceable bu
    ff
    ers in the worst
    case, but it is likely to be improved
    over time. (CAR?)
    Replacement: SLRU vs buffer pool

    View full-size slide

  16. • If a dirty SLRU bu
    ff
    er must be
    evicted to make room to read in a
    new bu
    ff
    er, it must be done
    synchronously.

    • Before 13, we used to fsync
    synchronously too when that
    happened. Handing that part o
    ff
    was a stepping stone for this work
    (collaboration with Shawn
    Debnath).

    • The main bu
    ff
    er pool has a
    “background writer” process that
    tries to make sure that clean bu
    ff
    ers
    are always available.

    • sync_
    fi
    le_range() is used to start
    write back before fsync(), for extra
    I/O concurrency

    • The background writer may not
    always succeed in that goal, but
    improvements are possible.
    Write-back: SLRU vs buffer pool

    View full-size slide

  17. • If SLRU
    fi
    les are corrupted outside
    PostgreSQL, there are currently no
    checksums to detect that

    • The main bu
    ff
    er pool supports
    optional per-page checksums that
    are computed at write time and
    veri
    fi
    ed at read time
    Corruption detection: SLRU vs buffer pool
    Note: current prototype is not attempting
    to implement this part for SLRU data
    (more on that soon). One step at a time…

    View full-size slide

  18. • Open source project such as Neon
    (“open source alternative to AWS
    Aurora Postgres”, see nearby talk)
    are working on distributed WAL-
    aware block storage systems that
    integrate at the bu
    ff
    er pool level


    It must eventually help if all relevant
    data for a cluster is managed
    through one bu
    ff
    er pool with
    standardised LSN placement…?

    • The AIO proposal for PostgreSQL
    adds optional direct IO,
    asynchronous IO, IO merging,
    scatter/gather, and is integrated
    with the bu
    ff
    er pool.

    • In progress work to add TDE
    (transparent data encryption)
    probably integrates with the bu
    ff
    er
    pool a bit like checksums.

    • A better replacement policy?
    Future buffer pool innovations
    This slide is pure conjecture!

    View full-size slide

  19. Short version
    Transam and CLOG history
    Let a thousand SLRUs bloom
    Ways the buffer pool is better
    Reunification
    Prototype

    View full-size slide

  20. • The current prototype introduces a
    special database OID to represent
    SLRU data, and then relation ID
    selects the particular SLRU

    • Other schemes are possible!
    People have suggested stealing
    fork number bits or special table
    space IDs.

    • spc = DEFAULTTABLESPC_OID

    db = SLRU_DB_ID

    rel = SLRU_CLOG_REL_ID

    fork = MAIN_FORKNUM

    block = ?
    Buffer tags
    Hash table keys for
    fi
    nding bu
    ff
    ers
    db rel fork block
    spc

    View full-size slide

  21. • For now, I keep
    fi
    le and page layout
    unchanged, that is, without any
    header:

    • No checksums: for those to be
    powerloss crash safe, we’d need
    to implement full page writes and
    register all modi
    fi
    ed pages in the
    WAL. Out of scope.

    • No LSN:
    fi
    le format change vs
    pg_upgrade. Out of scope.

    • External LSNs: we still need a
    place to track the last LSN to
    modify each page, so we need a
    new array to hold them (may be
    able to chisel some space out of
    padding in Bu
    ff
    erDescriptor)

    • These could all be
    fi
    xed,
    removing the need to handle raw
    pages
    Rough edge: raw pages

    View full-size slide

  22. • The bu
    ff
    er manager interacts with
    fi
    les by using sgmrread() etc.

    • smgr.c historically had a place to
    dispatch to di
    ff
    erent
    implementations through a function
    table, though only md.c remains

    • This provides an easy way to teach
    smgr.c to recognise the special
    database at open time and select
    the function table for slru.c
    Storage manager
    bufmgr.c
    smgr.c
    md.c slru.c
    pg_xact


    ├── 0000


    └── 0001
    base/


    ├── 1


    │ ├── 112


    │ ├── 113


    │ ├── 1247


    │ ├── …

    View full-size slide

  23. • The existing SLRU system acquires
    a single LWLock to perform a linear
    search for a page number and then
    access the data, which is pretty
    fast if the cache is small and the it’s
    a cache hit

    • A naive implementation using the
    bu
    ff
    er pool would build a bu
    ff
    er tag,
    compute a hash, acquire/release an
    LWLock for a partition of the lookup
    table to
    fi
    nd a bu
    ff
    er, pin the bu
    ff
    er,
    and then share lock the page while
    reading the bits it needs. Ouch.

    • Current theory: we can avoid the
    bu
    ff
    er mapping table in common
    cases with a small cache of recent
    bu
    ff
    ers and Bu
    ff
    erReadRecent()

    • Current theory: in many interesting
    cases we only need a pin to read
    Lookup, pinning and locking overheads

    View full-size slide

  24. • The existing system scans the
    whole dedicated cache looking for
    blocks to discard, when truncating
    old data

    • In the main bu
    ff
    er cache, we must
    loop over the block range probing
    the bu
    ff
    er mapping table looking for
    blocks to discard — a bit more
    work!

    • Hopefully this’ll be OK because it
    all happens in background jobs:
    pg_xact and pg_multixact are
    trimmed during vacuuming, and
    pg_subtrans during checkpoints
    Discarding/truncating

    View full-size slide

  25. • The current code for trying to
    coordinate faster CLOG updates
    works by with extra state per CLOG
    bu
    ff
    er. If the CLOG page can’t be
    locked immediately, the update
    joins a queue and sleeps, and then
    updates are consolidated.

    • But… going o
    ff
    CPU when you’re
    trying to commit is bad?

    • Idea: Abandon the whole
    schmozzle. Invent a lock-free
    update protocol. We only need a
    pinned page, and we can use
    pg_atomic_fetch_or_u8() to set the
    bits without clobbering current
    writes. Use CAS to update the
    page LSN while making sure it
    doesn’t go backwards.

    • Is LSN contention and false sharing
    tra
    ff
    i
    c worse than the sleeping?
    Rough edge: grouped CLOG updates

    View full-size slide

  26. Short version
    Transam and CLOG history
    Let a thousand SLRUs bloom
    Ways the buffer pool is better
    Reunification
    Prototype

    View full-size slide

  27. • See pgsql-hackers@ mailing list or
    commitfest entry #3514 for code

    • Many open questions around
    locking and other details!

    • Runs well enough to begin
    exploring basic ideas…
    Early experiment stage patch

    View full-size slide

  28. postgres=# WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'),


    (1, 'pg_multixact/offsets'),


    (2, 'pg_multixact/members'),


    (3, 'pg_subtrans'),


    (4, 'pg_serial'),


    (5, 'pg_commit_ts'),


    (6, 'pg_notify'))


    SELECT path, pg_size_pretty(COUNT(*) * 8192)


    FROM pg_buffercache NATURAL JOIN slru


    WHERE reldatabase = 9


    GROUP BY 1


    ORDER BY 1;


    path | pg_size_pretty


    ----------------------+----------------


    pg_multixact/offsets | 8192 bytes


    pg_subtrans | 31 MB


    pg_xact | 3560 kB


    (3 rows)


    View full-size slide

  29. postgres=# select test_clog_fetch('0'::xid, '8388608'::xid, 1);


    NOTICE: xid range 0, 8388608; loop = 1


    test_clog_fetch


    -----------------


    (1 row)


    Time: 222.341 ms


    Time: 360.045 ms

    View full-size slide

  30. postgres=# select * from pg_stat_slru;


    name | blks_zeroed | blks_hit | blks_read | blks_written | blks_exists | flu
    -----------------+-------------+-----------+-----------+--------------+-------------+----
    Xact | 1166 | 197753416 | 721 | 661 | 0 |
    MultiXactOffset | 0 | 0 | 7 | 6 | 0 |
    MultiXactMember | 0 | 0 | 0 | 0 | 0 |
    Subtrans | 18667 | 0 | 0 | 10432 | 0 |
    Serial | 0 | 0 | 0 | 0 | 0 |
    CommitTs | 0 | 0 | 0 | 0 | 0 |
    Notify | 0 | 0 | 0 | 0 | 0 |
    (7 rows)

    View full-size slide