Slide 1

Slide 1 text

AIO & DIO for PostgreSQL on FreeBSD* Work in progress report *And other animals Thomas Munro PostgreSQL hacker @ Microsoft PGCon 2022, which should have been in Ottawa

Slide 2

Slide 2 text

Modernising PostgreSQL’s disk IO And some things I learned about FreeBSD along the journey • PostgreSQL disk AIO project started by Andres Freund, later joined by Thomas Munro, Melanie Plageman, David Rowley, being proposed for PostgreSQL 16 io_method={worker,io_uring,posix_aio,windows_iocp} io_data_direct={on,off} 
 io_wal_direct={on,off} All supported OSes Linux POSIX, especially FreeBSD Windows

Slide 3

Slide 3 text

Problems we want to solve Direct I/O sounds easy 
 AIO is a portability nightmare FreeBSD thoughts Q&A

Slide 4

Slide 4 text

Kernel Storage PostgreSQL • RAM is wasted by having two levels of bu ff er • Data is transferred between caches 8KB at a time with a lot of synchronous pread/pwrite • Relying on kernel heuristics for readhead and writeback • We give limited hints about future reads and fsyncs on some OSes for IO concurrency (but not FreeBSD 😔) Bu ff er cache/ARC Bu ff er pool WAL bu ff ers PostgreSQL’s traditional disk I/O

Slide 5

Slide 5 text

Kernel Storage PostgreSQL • Optionally skip the kernel bu ff er cache, no double bu ff ering, no kernel copying • Submit and complete batches of IOs with minimal system calls via modern* asynchronous interfaces • Combine IOs (adjacent data, scatter/ gather) • Make our own specialised IO streaming predictors (not just “detect sequential”, but data-dependent) • Model and control IO with certain goals (latency, resource usage, …) Bu ff er pool WAL bu ff ers AIO/DIO

Slide 6

Slide 6 text

postgres=# set max_parallel_workers_per_gather = 0; SET 
 postgres=# select count(*) from t; count --------- 1000000 (1 row) Disable query executor parallelism, to make this example simple We’ll scan a table that isn’t in PostgreSQL’s buffer pool

Slide 7

Slide 7 text

kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016fc610 },1,0x0) = 1 (0x1) recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0) = 29 (0x1d) pread(62,"\0\0\0\08\^]M\^A\0\0\^D\0\240\^C"...,8192,0x0) = 8192 (0x2000) pread(62,"\0\0\0\0\M-hUM\^A\0\0\^D\0\240"...,8192,0x2000) = 8192 (0x2000) pread(62,"\0\0\0\0\M^X\M^NM\^A\0\0\^D\0"...,8192,0x4000) = 8192 (0x2000) pread(62,"\0\0\0\0H\M-GM\^A\0\0\^D\0\240"...,8192,0x6000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-`\M^?M\^A\0\0\^D\0"...,8192,0x8000) = 8192 (0x2000) pread(62,"\0\0\0\0\M^P8N\^A\0\0\^D\0\240"...,8192,0xa000) = 8192 (0x2000) pread(62,"\0\0\0\0@qN\^A\0\0\^D\0\240\^C"...,8192,0xc000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-p\M-)N\^A\0\0\^D\0"...,8192,0xe000) = 8192 (0x2000) pread(62,"\0\0\0\0\240\M-bN\^A\0\0\^D\0"...,8192,0x10000) = 8192 (0x2000) pread(62,"\0\0\0\08\^[O\^A\0\0\^D\0\240\^C"...,8192,0x12000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-hSO\^A\0\0\^D\0\240"...,8192,0x14000) = 8192 (0x2000) pread(62,"\0\0\0\0\M^X\M^LO\^A\0\0\^D\0"...,8192,0x16000) = 8192 (0x2000) pread(62,"\0\0\0\0H\M-EO\^A\0\0\^D\0\240"...,8192,0x18000) = 8192 (0x2000) … pread(62,"\0\0\0\0HpL\^C\0\0\^D\0\240\^C"...,8192,0x120c000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-x\M-(L\^C\0\0\^D\0"...,8192,0x120e000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-(\M-aL\^C\0\0\^D\0"...,8192,0x1210000) = 8192 (0x2000) sendto(8,"T\0\0\0\^^\0\^Acount\0\0\0\0\0\0"...,69,0,NULL,0) = 69 (0x45) recvfrom(8,0xe19cd0,8192,0,0x0,0x0) ERR#35 'Resource temporarily unavailable' O ld w ay

Slide 8

Slide 8 text

kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016ee790 },1,0x0) = 1 (0x1) recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0) = 29 (0x1d) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) sendto(8,"T\0\0\0\^^\0\^Acount\0\0\0\0\0\0"...,69,0,NULL,0) = 69 (0x45) recvfrom(8,0xe46480,8192,0,0x0,0x0) ERR#35 'Resource temporarily unavailable' io_m ethod= w orker

Slide 9

Slide 9 text

kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x1da000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x1fa000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x21a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x23a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x25a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x27a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x29a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x2,0x2ba000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x2da000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x2fa000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x31a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x33a000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x35a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x37a000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x39a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x3ba000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x3da000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x3fa000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x41a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x43a000) = 131072 (0x20000) IO w orker process

Slide 10

Slide 10 text

kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016ee790 },1,0x0) = 1 (0x1) recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0) = 29 (0x1d) lio_listio(LIO_NOWAIT,[{ 32,1941504,0x80372a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2072576,0x80374a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } lio_listio(LIO_NOWAIT,[{ 32,2203648,0x80376a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2334720,0x80378a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } lio_listio(LIO_NOWAIT,[{ 32,2465792,0x8037aa000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2596864,0x8037ca000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } lio_listio(LIO_NOWAIT,[{ 32,2727936,0x8037ea000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2859008,0x80380a000,16384,LIO_READ,{ sigev_notify=SIGEV_NONE } { 32,2875392,0x80470e000,114688,LIO_READ,{ sigev_notify=SIGEV_NONE } aio_waitcomplete({ 32,2072576,0x80374a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,1941504,0x80372a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2203648,0x80376a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2334720,0x80378a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2465792,0x8037aa000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2596864,0x8037ca000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2859008,0x80380a000,16384,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0. aio_waitcomplete({ 32,2727936,0x8037ea000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2875392,0x80470e000,114688,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 lio_listio(LIO_NOWAIT,[{ 32,2990080,0x80472a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,3121152,0x80474a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } … io_m ethod= 
 posix_aio (FreeBSD ) A FreeBSD-speci fi c way to consume completion events (one of many supported ways, see next section "Portability nightmares”) The POSIX way to start multiple IOs with one system call

Slide 11

Slide 11 text

epoll_wait(12, [{EPOLLIN, {u32=2507243840, u64=94075175928128}}], 1, -1) = 1 recvfrom(16, "Q\0\0\0\34select count(*) from t;\0", 8192, 0, NULL, NULL) = 29 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 … io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 sendto(16, "T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"..., 69, 0, 0) = 69 recvfrom(16, 0x558f937066a0, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavaila epoll_wait(12, 0x558f95718188, 1, -1) = -1 EINTR (Interrupted system call) io_m ethod= io_uring (Linux) Simple tracing tools can’t see submissions and completions as they’re in user space queues, not in sys call arguments/results

Slide 12

Slide 12 text

AIO streaming controller callback Data consumer/ producer PG AIO API Logic for controlling IO depth is common infrastructure, working in batches, to allow for IO combining; Research topic: control theory algorithms Logic for how to initiate one more IO is provided by from consumer/ producer Index scans, recovery/ replication system, checkpointer, …

Slide 13

Slide 13 text

Primary node Replica node A system with many clients can be faulting in pages concurrently Replicas would synchronously fault in one page at a time; fi xed with AIO-based prefetching postgres=# select wal_distance, io_depth from pg_stat_recovery_prefetch; wal_distance | io_depth 
 ——————————————+—————————- 3272 | 7

Slide 14

Slide 14 text

postgres=# select op, scb, owner_pid, "desc" from pg_stat_aios where flags != 'PGAIOIP_IDLE'; ┌───────┬─────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────┐ │ op │ scb │ owner_pid │ desc │ ├───────┼─────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ read │ sb │ 67223 │ fd: 13, mode: 0, offset: 28188672, nbytes: 8192, already_done: 0, buffid: 10797 │ │ read │ sb │ 67222 │ fd: 13, mode: 0, offset: 8757248, nbytes: 8192, already_done: 0, buffid: 10463 │ │ read │ sb │ 67221 │ fd: 8, mode: 0, offset: 289153024, nbytes: 8192, already_done: 0, buffid: 11007 │ │ write │ wal │ 67227 │ write_no: 23, fd: 7, offset: 3244032, nbytes: 8192, already_done: 0, bufdata: 0x103582000 │ │ read │ sb │ 67220 │ fd: 12, mode: 0, offset: 84320256, nbytes: 8192, already_done: 0, buffid: 9771 │ │ read │ sb │ 67225 │ fd: 12, mode: 0, offset: 29343744, nbytes: 8192, already_done: 0, buffid: 10102 │ │ read │ sb │ 67224 │ fd: 12, mode: 0, offset: 175702016, nbytes: 8192, already_done: 0, buffid: 9862 │ └───────┴─────┴───────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

Slide 15

Slide 15 text

Problems we want to solve Direct I/O sounds easy 
 AIO is a portability nightmare FreeBSD thoughts Q&A

Slide 16

Slide 16 text

O_DIRECT • IRIX gave us open(O_DIRECT), widely adopted, but not covered by POSIX so no formal meaning; general idea “don’t bu ff er my data” • Systems di ff er in details (block alignment requirements, e ff ects on descriptors with di ff erent fl ags, support on di ff erent fi le systems) • There were also some other ideas: Sun’s directio() and mount options, Apple’s F_NOCACHE • AFAIK, mostly of interest to database hackers and big data graphics/science • Speculation: one of the motivations for 90s databases to use raw disk partitions was perhaps to skip the kernel’s bu ff er cache, IO scheduler and concurrency/locking problems; O_DIRECT and highly concurrent fi le systems address those issues

Slide 17

Slide 17 text

O_DIRECT gives you more problems • Our bu ff er replacement logic and bu ff er pool size had better be good, because the kernel’s bu ff ering won’t be cushioning our misses • We’ll need to supply readhead, writeback, clustering and concurrency logic, even for the simplest straight-line cases • -> AIO and DIO go together like peanut butter and jelly • Slower unbu ff ered reads and writes may reveal locking problems in fi lesystem implementation (gulp)

Slide 18

Slide 18 text

Problems we want to solve Direct I/O sounds easy 
 AIO is a portability nightmare FreeBSD thoughts Q&A

Slide 19

Slide 19 text

• 1993: Windows NT • Brings ideas from VMS, RSX-11D, RSX-11M • Asynchronous interfaces everywhere (“overlapped”) • Event-based polling and callbacks • 1994: Windows NT 3.5 • Adds IOCP for queue-like event processing • 1993: POSIX AIO, RT signals • IOs can now be started asynchronously • RT signals can carry a payload for example an IO number/pointer, and be consumed synchronously from a per-process queue or call handlers • Alternative polling scheme Nearly three decades of AIO

Slide 20

Slide 20 text

io_method= worker posix_aio io_uring windows_iocp AIX ✓ ✓ Dragon fl yBSD ✓ N/A FreeBSD ✓ ✓ HP-UX ✓ ❌ Linux (glibc, musl) ✓ ✓🧵 ✓ macOS ✓ ✓* NetBSD ✓ 💥 OpenBSD ✓ N/A Solaris ✓ ✓🧵 Windows ✓ N/A ✓

Slide 21

Slide 21 text

POSIX AIO: submission APIs • aio_read(), aio_write(), aio_fsync() • lio_listio() for batched submission of LIO_READ, LIO_WRITE • FreeBSD 13 added non-standard extensions aio_readv(), aio_writev(), LIO_READV, LIO_WRITEV for vector IO (scatter/gather) • Along the same lines, preadv()/pwritev() are also curiously absent from POSIX, but are obvious combinations of pread()/pwrite() and readv()/writev()

Slide 22

Slide 22 text

POSIX AIO: completion APIs • aio_error() to fi nd out if it’s fi nished, failed or EINPROGRESS • aio_return() to get the result • aio_suspend() to wait for a list of IOs to complete, then see above to fi nd out which ones (reads and writes only, no fsyncs, though almost all implementations allow sync to be waited for too)

Slide 23

Slide 23 text

POSIX AIO: standard notification mechanisms • SIGEV_SIGNAL • sigwaitinfo() can give you read-from-a-queue style interface, at a cost of three syscalls per IO (sigwaitinfo(), aio_error(), aio_return()), but macOS didn’t implement it! You could do sigwait() and then aio_error() for all outstanding IOs, for O(n^2) syscalls… • Using signal handlers would be unpleasant, and glibc has some questionable async signal safety • SIGEV_THREAD • Call a function pointer in some unspeci fi ed thread. As a matter of policy we probably don’t want threads for this, but in any case macOS didn’t implement it… • SIGEV_NONE • You’ll have to use aio_suspend() to wait, and then O(n^2). Only HP-UX doesn’t allow aio_fsync() to be waited for with aio_suspend() (which is allowed by POSIX, why?!)

Slide 24

Slide 24 text

POSIX AIO: extended completion APIs Almost every OS invented a much nicer queue-like system • SIGEV_KEVENT (FreeBSD): drain N from a kqueue • You still have to call aio_return() on each IO, and possibly aio_error() too*, so you cannot get below 1.x syscalls per IO! • macOS and NetBSD have both kqueue and AIO, but they are not connected; Dragon fl y removed AIO. • aio_waitcomplete() (FreeBSD): drain one like a queue; no aio_error() or aio_return() • AIX’s aio_nwait() (but doesn’t work for aio_fsync()!), Solaris’s aio_waitn(), HP-UX’s aio_reap() could all read batches of completions, some without even entering the kernel • AIX and Solaris also added Windows IOCP-style APIs (not explored by me; SIGEV_PORT?)

Slide 25

Slide 25 text

POSIX AIO: Assorted surprises • What should happen if you close a descriptor while an IO is pending? • The IO should run to completion • ECANCELED • The system should go bananas and run the IO against some later user of the same fi le descriptor number • What should happen if you perform multiple IOs on the same descriptor? • They should run in parallel to the extend possible • Serialise them, surely the calling program jests • POSIX says that aio_suspend() can’t wait for aio_fsync() (only HP-UX doesn’t support fsync here) • You can’t complete IOs started by another process. OK, maybe not so surprising. PostgreSQL really needs to go multi-threaded…

Slide 26

Slide 26 text

Problems we want to solve Direct I/O sounds easy 
 AIO is a portability nightmare FreeBSD thoughts Q&A

Slide 27

Slide 27 text

OpenZFS will hopefully support O_DIRECT • Direct IO is coming! • https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021 • I haven’t tried it out yet, but this looks interesting for databases.

Slide 28

Slide 28 text

Assorted tricky problems • UFS O_SYNC/O_DSYNC waits *for each block* instead of whole writes • UFS has a fast bu ff erless path for O_DIRECT reads, but not vector reads (which should ideally produce a single multi-segment read at the device level), and no equivalent for writes • UFS fsync()/fdatasync() doesn’t fl ush device caches or use FUA • UFS uses VFS-level writes locks for write() and fsync(), preventing concurrency • VFS doesn’t allow for true asynchronous IO, instead kernel threads call the synchronous entry points and sleep

Slide 29

Slide 29 text

Improve AIO completions via kqueue? • I would like to be able to use SEGEV_KEVENT to consume completion events from a kqueue without having to call aio_return(). Then we could get below 1 system call per IO, like on other OSes. • I have some prototype code for this: see D33271, D33144. It’s tricky, there are many places in the kernel that assume that IO is synchronous… • I have heard suggestions that kevent() should also be able to start IOs too

Slide 30

Slide 30 text

Problems we want to solve Direct I/O sounds easy 
 AIO is a portability nightmare FreeBSD thoughts Q&A

Slide 31

Slide 31 text

https://wiki.postgresql.org/wiki/AIO https://wiki.postgresql.org/wiki/FreeBSD/AIO