POSIX (’93) aio_read, aio_write, aio_suspend, RT signals ’80s, ’90s: p… = with position …v = vectored p…v = both Linux (’19) io_uring O_DIRECT Linux (’03?) libaio + kernel support POSIX giveth, but nobody noticeth MULTICS giveth UNIX taketh away Linux giveth (again), but this time everybody liketh! IBM S/360 (’65) DEC RX11 (’71) VMS (’77) NT (’93) AIO!
(not relevant to us, yet) ssize_t read(int filedes, void *buf, size_t nbyte) ssize_t pread(int filedes, void *buf, size_t nbyte, off_t offset) • The concept of “current file position” doesn’t fit well with multi-threaded applications sharing “seekable” file descriptors • PostgreSQL 12 switched from lseek + read/write to pread/pwrite, removing extra system calls and book-keeping in vfd layer • Required wrapper function for Windows; unfortunately the wrappers are imperfect and change the current position, so we use the names pg_pread and pg_pwrite to remember that!
ff er pool (basically, databases) • We want to read contiguous blocks of a file into memory in one op • Replacement algorithm doesn’t try to find contiguous memory blocks (and shouldn’t!) • Can also map to single physical I/O ssize_t pread (int filedes, void *buf, size_t nbytes, off_t offset) ssize_t preadv(int filedes, struct iovec *iov, int iovcnt, off_t offset) struct iovec { void *iov_base; size_t iov_len; };
…v!), but every Unix now has them and they are proposed for POSIX issue 8 at https://austingroupbugs.net/view.php?id=1832 • Windows doesn’t have a general enough* equivalent operation, so we have pg_preadv and pg_pwritev macros that just fall back to looping over pg_pread and pg_pwrite on that OS. *The native scatter/gather operations work only on non-buffered, overlapped (asynchronous) file handles. We might be able to use them in future work… they were made for databases! De facto Unix portability (aliquando de jure)
ff er pool (basically, databases*) • Buffering costs CPU (extra copying) • Buffering costs RAM (extra copy) • Read-ahead and write-behind heuristics not good enough • Buffered I/O is still good for many uses cases though! • Direct I/O requires tuning (*Also big data systems with extremely low access frequency) 4k 3k 2k 1G 2G 0G TPS
not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.”
on all file systems • PostgreSQL 16 added debug_io_direct=data,wal for testing • Currently meeting all known alignment requirements • Early testing already revealed some interesting problems (btrfs checksums can corrupt themselves when we set hint bits!)
buffering, you lose read-ahead heuristics • We ought to be able to do better read-ahead and write-behind than the kernel anyway, we have much more information (cf. “advice”-based prefetching + sync_file_range()) • Minimum solution is to have I/O worker threads/processes running preadv/pwritev (at application level, or inside libc/librt) • Modern (and ancient) OSes offer ways to skip the scheduling and IPC overheads
wait for many operations • Start multiple operations at once by writing them into a submission queue in user space memory and then telling the kernel • Consume completion notifications, either directly from userspace memory if possible, or by waiting if not • Directly modelled on kernel/device communication on the “other side” in hardware
preadv(128KB) preadv(128KB) preadv(128KB) io_uring_enter() io_uring_enter() [v15] Block-at-a-time, synchronous, bu ff ered [v17] Larger vectored I/O, optional direct I/O [WIP] I/O calls running ahead of time in worker processes [WIP] Kernel managed asynchronous I/O, no extra process overheads (O ffl oaded to worker)
I/O DMA SELECT COUNT(*) Lofty ideal Zero copy, zero wait, 100% on CPU bu ff er pool Video credit: Wallace & Gromit “The Wrong Trousers”, 1993, by Aardman Cycle loop from giphy.com
PostgreSQL works in terms of 8KB blocks, calling ReadBuffer(relation identifier, block_number) to access each one • If the buffer is already in the buffer pool, it is pinned • If the buffer is not already in the buffer pool, it must be loaded first, possibly after evicting something else • In order to build larger I/Os and start the physical I/O asynchronously, we need to find all the places that do that, and somehow convince them to participate in a new grouping scheme
= ReadBuffer(…, i); ReleaseBuffer(buf); } static BlockNumber my_blocknum_callback(void *private_data); stream = read_stream_begin_relation(…, my_blocknum_callback, &my_callback_state, …); for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } io_combine_limit block numbers are pulled in here pinned buffers are pulled out here combined blocks read in with preadv()
for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } e ff ective_io_concurrency non-sequential block numbers hinted to kernel with posix_fadvise() preadv() deferred until absolutely necessary, so the hint as a good chance of working! By issuing POSIX_FADV_WILLNEED as soon as possible and preadv() as late as possible, we get a sort of poor man’s asynchronous I/O.
if a SELECT … LIMIT 1 might not want more than one block, so it starts out reading just a single block and increases the look ahead distance only while that seems to be useful. • When regions of already-cached blocks are encountered, the look ahead distances slowly decreases. • In this way we don’t pay extra overheads such as extra pins and book keeping unless there is some bene fi t to it.
Tuning the look-ahead distance All cached Sequential I/O pattern detected: currently no point in look ahead further than io_combine_limit Random I/O pattern detected: currently fadvise used to control concurrency Distance moves up and down in response to randomness, hits and misses
gains available • If you can predict future, you can also prefetch memory. Early experiments show sign fi cant gains. • In the future, could we perform more e ffi cient bu ff er mapping table lookups (combining neighbouring blocks in tree-based lookup? batched partition locking? SIMD hash computation? …) Experim ental
is to emit bu ff ers in the order that the callback emitted block numbers • Not all users require ordered data, and then cached blocks might as well be reordered before blocks that require I/O, to give the I/O the greatest chance of completing. This speeds up partially cached scans in early experiments. Experim ental
the tree, the places that write data are few, and contained in just a few places: • Checkpointer: bu ff ers already sorted, read to be combined by write stream • Background writer • Regular backend, when evicting dirty bu ff ers • Non-shared temporary bu ff ers for CREATE INDEX and other bulk writes W IP
ered I/O on Linux, PostgreSQL calls sync_ fi le_range() after writing {backend,checkpoint,bgwriter}_ fl ush_after bytes • This is necessary because Linux is very lazy about performing writes, so we have to insist; this is sort of the opposite of POSIX_FADV_WILLNEED! • We can subsume the existing “WritebackContext” mechanism into WriteStream W IP
me), currently used only by ANALYZE (Nazir Bilal Yavuz) and heapam Sequential Scan (Melanie Plageman) • Several more ReadStream user patches will be proposed for v18 • Can you see more places that can be streamified? (Access methods, …) • Current streaming benefits only uses synchronous I/O; in future proposals we might call that mode io_method=synchronous (?) • Andres Freund is working on true asynchronous subsystem (where all these ideas started); little/no changes should be needed to users of ReadStream to benefit from io_method=worker,io_uring