Slide 1

Slide 1 text

PGConf.dev 2024 | Vancouver, Canada | Thomas Munro | Open source database hacker at Microsoft Streaming I/O New abstractions for e ffi cient fi le I/O

Slide 2

Slide 2 text

Talk • Part I: Review OS I/O facilities • Part II: Read streams • Part III: Write streams • Part IV: Looking ahead

Slide 3

Slide 3 text

Part I: Review of OS facilities

Slide 4

Slide 4 text

• bu ff ered I/O vs
 direct I/O • vectored I/O AKA scatter/gather I/O • synchronous I/O vs asynchronous I/O Database I/O Programming

Slide 5

Slide 5 text

read, write, worksync, iowait MULTICS (’65) UNIX (’69) read, write POSIX (’93) aio_read, aio_write, aio_suspend, RT signals ’80s, ’90s: p… = with position
 …v = vectored
 p…v = both Linux (’19) io_uring O_DIRECT Linux (’03?) libaio + kernel support POSIX giveth, but nobody noticeth MULTICS giveth UNIX taketh away Linux giveth (again), but this time everybody liketh! IBM S/360 (’65) DEC RX11 (’71) VMS (’77) NT (’93) AIO!

Slide 6

Slide 6 text

What are the p… functions for? Primary motivation was threads (not relevant to us, yet) ssize_t read(int filedes, void *buf, size_t nbyte) ssize_t pread(int filedes, void *buf, size_t nbyte, off_t offset) • The concept of “current file position” doesn’t fit well with multi-threaded applications sharing “seekable” file descriptors • PostgreSQL 12 switched from lseek + read/write to pread/pwrite, removing extra system calls and book-keeping in vfd layer • Required wrapper function for Windows; unfortunately the wrappers are imperfect and change the current position, so we use the names pg_pread and pg_pwrite to remember that!

Slide 7

Slide 7 text

Who needs vectored I/O? Systems that manage their own bu ff er
 pool (basically, databases) • We want to read contiguous blocks of a file into memory in one op • Replacement algorithm doesn’t try to find contiguous memory blocks (and shouldn’t!) • Can also map to single physical I/O ssize_t pread (int filedes, void *buf, size_t nbytes, off_t offset) ssize_t preadv(int filedes, struct iovec *iov, int iovcnt, off_t offset) struct iovec { void *iov_base; size_t iov_len; };

Slide 8

Slide 8 text

• POSIX hasn’t standardised p…v yet (despite having p… and …v!), but every Unix now has them and they are proposed for POSIX issue 8 at https://austingroupbugs.net/view.php?id=1832 • Windows doesn’t have a general enough* equivalent operation, so we have pg_preadv and pg_pwritev macros that just fall back to looping over pg_pread and pg_pwrite on that OS.
 *The native scatter/gather operations work only on non-buffered, overlapped (asynchronous) file handles. We might be able to use them in future work… they were made for databases! De facto Unix portability (aliquando de jure)

Slide 9

Slide 9 text

Who wants direct I/O? Systems that manage their own bu ff er pool (basically, databases*) • Buffering costs CPU (extra copying) • Buffering costs RAM (extra copy) • Read-ahead and write-behind heuristics not good enough • Buffered I/O is still good for many uses cases though! • Direct I/O requires tuning (*Also big data systems with extremely low access frequency) 4k 3k 2k 1G 2G 0G TPS

Slide 10

Slide 10 text

- https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics “The exact semantics of Direct I/O (O_DIRECT) are not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications.”

Slide 11

Slide 11 text

• On OSes that support O_DIRECT (most!), it doesn’t work on all file systems • PostgreSQL 16 added debug_io_direct=data,wal for testing • Currently meeting all known alignment requirements • Early testing already revealed some interesting problems (btrfs checksums can corrupt themselves when we set hint bits!)

Slide 12

Slide 12 text

Who wants asynchronous I/O? People using direct I/O! • Wtihout buffering, you lose read-ahead heuristics • We ought to be able to do better read-ahead and write-behind than the kernel anyway, we have much more information (cf. “advice”-based prefetching + sync_file_range()) • Minimum solution is to have I/O worker threads/processes running preadv/pwritev (at application level, or inside libc/librt) • Modern (and ancient) OSes offer ways to skip the scheduling and IPC overheads

Slide 13

Slide 13 text

submission queue entries completion queue entries • io_uring_enter(): initiate and/or wait for many operations • Start multiple operations at once by writing them into a submission queue in user space memory and then telling the kernel • Consume completion notifications, either directly from userspace memory if possible, or by waiting if not • Directly modelled on kernel/device communication on the “other side” in hardware

Slide 14

Slide 14 text

pread(8KB) pread(8KB) pread(8KB) pread(8KB) pread(8KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB) preadv(128KB) io_uring_enter() io_uring_enter() [v15] Block-at-a-time, synchronous, bu ff ered [v17] Larger vectored I/O, optional direct I/O [WIP] I/O calls running ahead of time in worker processes [WIP] Kernel managed asynchronous I/O, no extra process overheads (O ffl oaded to worker)

Slide 15

Slide 15 text

storage DMA engine SQ CQ NVMe io_uring SQ CQ direct I/O DMA SELECT COUNT(*) Lofty ideal Zero copy, zero wait, 100% on CPU bu ff er pool Video credit: Wallace & Gromit “The Wrong Trousers”, 1993, by Aardman
 Cycle loop from giphy.com

Slide 16

Slide 16 text

Part II: Read streams

Slide 17

Slide 17 text

“Reading” blocks of relation data A very common operation • PostgreSQL works in terms of 8KB blocks, calling ReadBuffer(relation identifier, block_number) to access each one • If the buffer is already in the buffer pool, it is pinned • If the buffer is not already in the buffer pool, it must be loaded first, possibly after evicting something else • In order to build larger I/Os and start the physical I/O asynchronously, we need to find all the places that do that, and somehow convince them to participate in a new grouping scheme

Slide 18

Slide 18 text

Ad hoc grouping and read-ahead at every call site Re-usable stream component

Slide 19

Slide 19 text

for (i = 0; i < nblocks; ++i) { buf = ReadBuffer(…, i); ReleaseBuffer(buf); } static BlockNumber my_blocknum_callback(void *private_data); stream = read_stream_begin_relation(…, my_blocknum_callback, &my_callback_state, …); for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } io_combine_limit block numbers are pulled in here pinned buffers are pulled out here combined blocks read in with preadv()

Slide 20

Slide 20 text

static BlockNumber my_blocknum_callback(void *private_data); stream = read_stream_begin_relation(…, my_blocknum_callback, &my_callback_state, …); for (i = 0; i < nblocks; ++i) { buf = read_stream_next(stream); ReleaseBuffer(buf); } read_stream_end(stream); } e ff ective_io_concurrency non-sequential block numbers hinted to kernel with posix_fadvise() preadv() deferred until absolutely necessary, so the hint as a good chance of working! By issuing POSIX_FADV_WILLNEED as soon as possible and preadv() as late as possible, we get a sort of poor man’s asynchronous I/O.

Slide 21

Slide 21 text

- Danish proverb about look-ahead callback functions Prediction is difficult, especially about the future

Slide 22

Slide 22 text

0 1 2 3 4 5 Arithmetic-driven:
 seq scan (17 beta)
 ANALYZE sampling (17 beta) Data-driven:
 bitmap heapscan (WIP)
 recovery (WIP) 0 1 2 3 4 5

Slide 23

Slide 23 text

Sequential heuristic (for DIO only)?
 Like kernel read-ahead 0 1 2 3 4 5 Just an idea

Slide 24

Slide 24 text

Ramping up and down • A stream doesn’t generally know if a SELECT … LIMIT 1 might not want more than one block, so it starts out reading just a single block and increases the look ahead distance only while that seems to be useful. • When regions of already-cached blocks are encountered, the look ahead distances slowly decreases. • In this way we don’t pay extra overheads such as extra pins and book keeping unless there is some bene fi t to it.

Slide 25

Slide 25 text

A B C 1 io_combine_limit K * e ff ective_io_concurrency Tuning the look-ahead distance All cached Sequential I/O pattern detected: currently no point in look ahead further than io_combine_limit Random I/O pattern detected: currently fadvise used to control concurrency Distance moves up and down in response to randomness, hits and misses

Slide 26

Slide 26 text

Fully cached data Even without I/O, there may be some gains available • If you can predict future, you can also prefetch memory. Early experiments show sign fi cant gains. • In the future, could we perform more e ffi cient bu ff er mapping table lookups (combining neighbouring blocks in tree-based lookup? batched partition locking? SIMD hash computation? …) Experim ental

Slide 27

Slide 27 text

Out-of-order streams • The current behaviour of a read stream is to emit bu ff ers in the order that the callback emitted block numbers • Not all users require ordered data, and then cached blocks might as well be reordered before blocks that require I/O, to give the I/O the greatest chance of completing. This speeds up partially cached scans in early experiments. Experim ental

Slide 28

Slide 28 text

Outer types of read streams • Raw fi les? • Relations but reading to temporarty bu ff ers? • Sockets? Future w ork

Slide 29

Slide 29 text

Part III: Write streams

Slide 30

Slide 30 text

Where can we write? • While reading happens all over the tree, the places that write data are few, and contained in just a few places: • Checkpointer: bu ff ers already sorted, read to be combined by write stream • Background writer • Regular backend, when evicting dirty bu ff ers • Non-shared temporary bu ff ers for CREATE INDEX and other bulk writes W IP

Slide 31

Slide 31 text

pwrite(8KB) pwrite(8KB) pwrite(8KB) pwrite(8KB) pwrite(8KB) pwritev(128KB) pwritev(128KB) sync_file_range() pwritev(128KB) io_uring_enter() io_uring_enter() (O ffl oaded to worker) CHECKPOINT COPY CREATE INDEX W IP pwritev(128KB) pwritev(128KB) sync_file_range() pwritev(128KB)

Slide 32

Slide 32 text

pwrite(8KB) pread(8KB) pwrite(8KB) pread(8KB) pwrite(8KB) pwritev(128KB) preadv(128KB) pwritev(128KB) preadv(128KB) pwritev(128KB) preadv(128KB) pwritev(128KB) preadv(128KB) io_uring_enter() io_uring_enter() (O ffl oaded to worker) VACUUM W IP

Slide 33

Slide 33 text

ReadStream used by a consumer like VACUUM that dirities the buffers WriteStream inside BufferStrategy, combining writes buffers that are often sequential block numbers. Strategy write-behind Experim ental

Slide 34

Slide 34 text

Controlling OS buffered writeback rate • When using bu ff ered I/O on Linux, PostgreSQL calls sync_ fi le_range() after writing {backend,checkpoint,bgwriter}_ fl ush_after bytes • This is necessary because Linux is very lazy about performing writes, so we have to insist; this is sort of the opposite of POSIX_FADV_WILLNEED! • We can subsume the existing “WritebackContext” mechanism into WriteStream W IP

Slide 35

Slide 35 text

Part IV: Looking ahead

Slide 36

Slide 36 text

• v17 BETA 1 has basic synchronous ReadStream support (by me), currently used only by ANALYZE (Nazir Bilal Yavuz) and heapam Sequential Scan (Melanie Plageman) • Several more ReadStream user patches will be proposed for v18 • Can you see more places that can be streamified? (Access methods, …) • Current streaming benefits only uses synchronous I/O; in future proposals we might call that mode io_method=synchronous (?) • Andres Freund is working on true asynchronous subsystem (where all these ideas started); little/no changes should be needed to users of ReadStream to benefit from io_method=worker,io_uring

Slide 37

Slide 37 text

fin