Streaming I/O

PGConf.dev 2024 | Vancouver, Canada | Thomas Munro | Open
source database hacker at Microsoft Streaming I/O New abstractions for e ffi cient fi le I/O

Talk • Part I: Review OS I/O facilities • Part
II: Read streams • Part III: Write streams • Part IV: Looking ahead

Part I: Review of OS facilities

• bu ff ered I/O vs  direct I/O • vectored
I/O AKA scatter/gather I/O • synchronous I/O vs asynchronous I/O Database I/O Programming

read, write, worksync, iowait MULTICS (’65) UNIX (’69) read, write
POSIX (’93) aio_read, aio_write, aio_suspend, RT signals ’80s, ’90s: p… = with position  …v = vectored  p…v = both Linux (’19) io_uring O_DIRECT Linux (’03?) libaio + kernel support POSIX giveth, but nobody noticeth MULTICS giveth UNIX taketh away Linux giveth (again), but this time everybody liketh! IBM S/360 (’65) DEC RX11 (’71) VMS (’77) NT (’93) AIO!

What are the p… functions for? Primary motivation was threads
(not relevant to us, yet) ssize_t read(int filedes, void *buf, size_t nbyte) ssize_t pread(int filedes, void *buf, size_t nbyte, off_t offset) • The concept of “current file position” doesn’t fit well with multi-threaded applications sharing “seekable” file descriptors • PostgreSQL 12 switched from lseek + read/write to pread/pwrite, removing extra system calls and book-keeping in vfd layer • Required wrapper function for Windows; unfortunately the wrappers are imperfect and change the current position, so we use the names pg_pread and pg_pwrite to remember that!

Who needs vectored I/O? Systems that manage their own bu
ff er  pool (basically, databases) • We want to read contiguous blocks of a file into memory in one op • Replacement algorithm doesn’t try to find contiguous memory blocks (and shouldn’t!) • Can also map to single physical I/O ssize_t pread (int filedes, void *buf, size_t nbytes, off_t offset) ssize_t preadv(int filedes, struct iovec *iov, int iovcnt, off_t offset) struct iovec { void *iov_base; size_t iov_len; };

• POSIX hasn’t standardised p…v yet (despite having p… and
…v!), but every Unix now has them and they are proposed for POSIX issue 8 at https://austingroupbugs.net/view.php?id=1832 • Windows doesn’t have a general enough* equivalent operation, so we have pg_preadv and pg_pwritev macros that just fall back to looping over pg_pread and pg_pwrite on that OS.  *The native scatter/gather operations work only on non-buffered, overlapped (asynchronous) file handles. We might be able to use them in future work… they were made for databases! De facto Unix portability (aliquando de jure)

Who wants direct I/O? Systems that manage their own bu
ff er pool (basically, databases*) • Buffering costs CPU (extra copying) • Buffering costs RAM (extra copy) • Read-ahead and write-behind heuristics not good enough • Buffered I/O is still good for many uses cases though! • Direct I/O requires tuning (*Also big data systems with extremely low access frequency) 4k 3k 2k 1G 2G 0G TPS

- https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics “The exact semantics of Direct I/O (O_DIRECT) are
not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.”

• On OSes that support O_DIRECT (most!), it doesn’t work
on all file systems • PostgreSQL 16 added debug_io_direct=data,wal for testing • Currently meeting all known alignment requirements • Early testing already revealed some interesting problems (btrfs checksums can corrupt themselves when we set hint bits!)

Who wants asynchronous I/O? People using direct I/O! • Wtihout
buffering, you lose read-ahead heuristics • We ought to be able to do better read-ahead and write-behind than the kernel anyway, we have much more information (cf. “advice”-based prefetching + sync_file_range()) • Minimum solution is to have I/O worker threads/processes running preadv/pwritev (at application level, or inside libc/librt) • Modern (and ancient) OSes offer ways to skip the scheduling and IPC overheads

submission queue entries completion queue entries • io_uring_enter(): initiate and/or
wait for many operations • Start multiple operations at once by writing them into a submission queue in user space memory and then telling the kernel • Consume completion notifications, either directly from userspace memory if possible, or by waiting if not • Directly modelled on kernel/device communication on the “other side” in hardware

pread(8KB) pread(8KB) pread(8KB) pread(8KB) pread(8KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB)
preadv(128KB) preadv(128KB) preadv(128KB) io_uring_enter() io_uring_enter() [v15] Block-at-a-time, synchronous, bu ff ered [v17] Larger vectored I/O, optional direct I/O [WIP] I/O calls running ahead of time in worker processes [WIP] Kernel managed asynchronous I/O, no extra process overheads (O ffl oaded to worker)

storage DMA engine SQ CQ NVMe io_uring SQ CQ direct
I/O DMA SELECT COUNT(*) Lofty ideal Zero copy, zero wait, 100% on CPU bu ff er pool Video credit: Wallace & Gromit “The Wrong Trousers”, 1993, by Aardman  Cycle loop from giphy.com

Part II: Read streams

“Reading” blocks of relation data A very common operation •
PostgreSQL works in terms of 8KB blocks, calling ReadBuffer(relation identifier, block_number) to access each one • If the buffer is already in the buffer pool, it is pinned • If the buffer is not already in the buffer pool, it must be loaded first, possibly after evicting something else • In order to build larger I/Os and start the physical I/O asynchronously, we need to find all the places that do that, and somehow convince them to participate in a new grouping scheme

Ad hoc grouping and read-ahead at every call site Re-usable
stream component

for (i = 0; i < nblocks; ++i) { buf
= ReadBuffer(…, i); ReleaseBuffer(buf); } static BlockNumber my_blocknum_callback(void *private_data); stream = read_stream_begin_relation(…, my_blocknum_callback, &my_callback_state, …); for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } io_combine_limit block numbers are pulled in here pinned buffers are pulled out here combined blocks read in with preadv()

static BlockNumber my_blocknum_callback(void *private_data); stream = read_stream_begin_relation(…, my_blocknum_callback, &my_callback_state, …);
for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } e ff ective_io_concurrency non-sequential block numbers hinted to kernel with posix_fadvise() preadv() deferred until absolutely necessary, so the hint as a good chance of working! By issuing POSIX_FADV_WILLNEED as soon as possible and preadv() as late as possible, we get a sort of poor man’s asynchronous I/O.

- Danish proverb about look-ahead callback functions Prediction is difficult,
especially about the future

0 1 2 3 4 5 Arithmetic-driven:  seq scan (17
beta)  ANALYZE sampling (17 beta) Data-driven:  bitmap heapscan (WIP)  recovery (WIP) 0 1 2 3 4 5

Sequential heuristic (for DIO only)?  Like kernel read-ahead 0 1
2 3 4 5 Just an idea

Ramping up and down • A stream doesn’t generally know
if a SELECT … LIMIT 1 might not want more than one block, so it starts out reading just a single block and increases the look ahead distance only while that seems to be useful. • When regions of already-cached blocks are encountered, the look ahead distances slowly decreases. • In this way we don’t pay extra overheads such as extra pins and book keeping unless there is some bene fi t to it.

A B C 1 io_combine_limit K * e ff ective_io_concurrency
Tuning the look-ahead distance All cached Sequential I/O pattern detected: currently no point in look ahead further than io_combine_limit Random I/O pattern detected: currently fadvise used to control concurrency Distance moves up and down in response to randomness, hits and misses

Fully cached data Even without I/O, there may be some
gains available • If you can predict future, you can also prefetch memory. Early experiments show sign fi cant gains. • In the future, could we perform more e ffi cient bu ff er mapping table lookups (combining neighbouring blocks in tree-based lookup? batched partition locking? SIMD hash computation? …) Experim ental

Out-of-order streams • The current behaviour of a read stream
is to emit bu ff ers in the order that the callback emitted block numbers • Not all users require ordered data, and then cached blocks might as well be reordered before blocks that require I/O, to give the I/O the greatest chance of completing. This speeds up partially cached scans in early experiments. Experim ental

Outer types of read streams • Raw fi les? •
Relations but reading to temporarty bu ff ers? • Sockets? Future w ork

Part III: Write streams

Where can we write? • While reading happens all over
the tree, the places that write data are few, and contained in just a few places: • Checkpointer: bu ff ers already sorted, read to be combined by write stream • Background writer • Regular backend, when evicting dirty bu ff ers • Non-shared temporary bu ff ers for CREATE INDEX and other bulk writes W IP

pwrite(8KB) pwrite(8KB) pwrite(8KB) pwrite(8KB) pwrite(8KB) pwritev(128KB) pwritev(128KB) sync_file_range() pwritev(128KB) io_uring_enter()
io_uring_enter() (O ffl oaded to worker) CHECKPOINT COPY CREATE INDEX W IP pwritev(128KB) pwritev(128KB) sync_file_range() pwritev(128KB)

pwrite(8KB) pread(8KB) pwrite(8KB) pread(8KB) pwrite(8KB) pwritev(128KB) preadv(128KB) pwritev(128KB) preadv(128KB) pwritev(128KB)
preadv(128KB) pwritev(128KB) preadv(128KB) io_uring_enter() io_uring_enter() (O ffl oaded to worker) VACUUM W IP

ReadStream used by a consumer like VACUUM that dirities the
buffers WriteStream inside BufferStrategy, combining writes buffers that are often sequential block numbers. Strategy write-behind Experim ental

Controlling OS buffered writeback rate • When using bu ff
ered I/O on Linux, PostgreSQL calls sync_ fi le_range() after writing {backend,checkpoint,bgwriter}_ fl ush_after bytes • This is necessary because Linux is very lazy about performing writes, so we have to insist; this is sort of the opposite of POSIX_FADV_WILLNEED! • We can subsume the existing “WritebackContext” mechanism into WriteStream W IP

Part IV: Looking ahead

• v17 BETA 1 has basic synchronous ReadStream support (by
me), currently used only by ANALYZE (Nazir Bilal Yavuz) and heapam Sequential Scan (Melanie Plageman) • Several more ReadStream user patches will be proposed for v18 • Can you see more places that can be streamified? (Access methods, …) • Current streaming benefits only uses synchronous I/O; in future proposals we might call that mode io_method=synchronous (?) • Andres Freund is working on true asynchronous subsystem (where all these ideas started); little/no changes should be needed to users of ReadStream to benefit from io_method=worker,io_uring

Streaming I/O

Streaming I/O

More Decks by Thomas Munro

Other Decks in Programming

Featured

Transcript