$30 off During Our Annual Pro Sale. View Details »

AIO & DIO for PostgreSQL on FreeBSD

AIO & DIO for PostgreSQL on FreeBSD

A talk I gave at BSDCan 2022 about porting asynchronous & direct I/O to FreeBSD (especially) and lots of other OSes.

https://www.bsdcan.org/events/bsdcan_2022/schedule/session/90-asynchronous-and-direct-io-for-postgresql-on-freebsd/

Temporary video link, just until the BSDCan channel has the video: http://cfbot.cputube.org/tmp/bsdcan2022-aio-postgres-tmunro-v2.m4v

Thomas Munro

June 03, 2022
Tweet

More Decks by Thomas Munro

Other Decks in Programming

Transcript

  1. AIO & DIO for PostgreSQL on FreeBSD* Work in progress

    report *And other animals Thomas Munro <tmunro@{postgresql,freebsd}.org> PostgreSQL hacker @ Microsoft PGCon 2022, which should have been in Ottawa
  2. Modernising PostgreSQL’s disk IO And some things I learned about

    FreeBSD along the journey • PostgreSQL disk AIO project started by Andres Freund, later joined by Thomas Munro, Melanie Plageman, David Rowley, being proposed for PostgreSQL 16 io_method={worker,io_uring,posix_aio,windows_iocp} io_data_direct={on,off} 
 io_wal_direct={on,off} All supported OSes Linux POSIX, especially FreeBSD Windows
  3. Problems we want to solve Direct I/O sounds easy 


    AIO is a portability nightmare FreeBSD thoughts Q&A
  4. Kernel Storage PostgreSQL • RAM is wasted by having two

    levels of bu ff er • Data is transferred between caches 8KB at a time with a lot of synchronous pread/pwrite • Relying on kernel heuristics for readhead and writeback • We give limited hints about future reads and fsyncs on some OSes for IO concurrency (but not FreeBSD 😔) Bu ff er cache/ARC Bu ff er pool WAL bu ff ers PostgreSQL’s traditional disk I/O
  5. Kernel Storage PostgreSQL • Optionally skip the kernel bu ff

    er cache, no double bu ff ering, no kernel copying • Submit and complete batches of IOs with minimal system calls via modern* asynchronous interfaces • Combine IOs (adjacent data, scatter/ gather) • Make our own specialised IO streaming predictors (not just “detect sequential”, but data-dependent) • Model and control IO with certain goals (latency, resource usage, …) Bu ff er pool WAL bu ff ers AIO/DIO
  6. postgres=# set max_parallel_workers_per_gather = 0; SET 
 postgres=# select count(*)

    from t; count --------- 1000000 (1 row) Disable query executor parallelism, to make this example simple We’ll scan a table that isn’t in PostgreSQL’s buffer pool
  7. kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016fc610 },1,0x0) = 1 (0x1) recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0)

    = 29 (0x1d) pread(62,"\0\0\0\08\^]M\^A\0\0\^D\0\240\^C"...,8192,0x0) = 8192 (0x2000) pread(62,"\0\0\0\0\M-hUM\^A\0\0\^D\0\240"...,8192,0x2000) = 8192 (0x2000) pread(62,"\0\0\0\0\M^X\M^NM\^A\0\0\^D\0"...,8192,0x4000) = 8192 (0x2000) pread(62,"\0\0\0\0H\M-GM\^A\0\0\^D\0\240"...,8192,0x6000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-`\M^?M\^A\0\0\^D\0"...,8192,0x8000) = 8192 (0x2000) pread(62,"\0\0\0\0\M^P8N\^A\0\0\^D\0\240"...,8192,0xa000) = 8192 (0x2000) pread(62,"\0\0\0\0@qN\^A\0\0\^D\0\240\^C"...,8192,0xc000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-p\M-)N\^A\0\0\^D\0"...,8192,0xe000) = 8192 (0x2000) pread(62,"\0\0\0\0\240\M-bN\^A\0\0\^D\0"...,8192,0x10000) = 8192 (0x2000) pread(62,"\0\0\0\08\^[O\^A\0\0\^D\0\240\^C"...,8192,0x12000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-hSO\^A\0\0\^D\0\240"...,8192,0x14000) = 8192 (0x2000) pread(62,"\0\0\0\0\M^X\M^LO\^A\0\0\^D\0"...,8192,0x16000) = 8192 (0x2000) pread(62,"\0\0\0\0H\M-EO\^A\0\0\^D\0\240"...,8192,0x18000) = 8192 (0x2000) … pread(62,"\0\0\0\0HpL\^C\0\0\^D\0\240\^C"...,8192,0x120c000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-x\M-(L\^C\0\0\^D\0"...,8192,0x120e000) = 8192 (0x2000) pread(62,"\0\0\0\0\M-(\M-aL\^C\0\0\^D\0"...,8192,0x1210000) = 8192 (0x2000) sendto(8,"T\0\0\0\^^\0\^Acount\0\0\0\0\0\0"...,69,0,NULL,0) = 69 (0x45) recvfrom(8,0xe19cd0,8192,0,0x0,0x0) ERR#35 'Resource temporarily unavailable' O ld w ay
  8. kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016ee790 },1,0x0) = 1 (0x1) recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0)

    = 29 (0x1d) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) kill(52213,SIGURG) = 0 (0x0) sendto(8,"T\0\0\0\^^\0\^Acount\0\0\0\0\0\0"...,69,0,NULL,0) = 69 (0x45) recvfrom(8,0xe46480,8192,0,0x0,0x0) ERR#35 'Resource temporarily unavailable' io_m ethod= w orker
  9. kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x1da000) = 131072 (0x20000)

    preadv(0xd,0x7fffffffa3d0,0x1,0x1fa000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x21a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x23a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x25a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x27a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x29a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x2,0x2ba000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x2da000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x2fa000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x31a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x33a000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x35a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x37a000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x39a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x3ba000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x3da000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x3fa000) = 131072 (0x20000) kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1) preadv(0xd,0x7fffffffa3d0,0x1,0x41a000) = 131072 (0x20000) preadv(0xd,0x7fffffffa3d0,0x1,0x43a000) = 131072 (0x20000) IO w orker process
  10. kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016ee790 },1,0x0) = 1 (0x1) recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0)

    = 29 (0x1d) lio_listio(LIO_NOWAIT,[{ 32,1941504,0x80372a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2072576,0x80374a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } lio_listio(LIO_NOWAIT,[{ 32,2203648,0x80376a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2334720,0x80378a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } lio_listio(LIO_NOWAIT,[{ 32,2465792,0x8037aa000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2596864,0x8037ca000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } lio_listio(LIO_NOWAIT,[{ 32,2727936,0x8037ea000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,2859008,0x80380a000,16384,LIO_READ,{ sigev_notify=SIGEV_NONE } { 32,2875392,0x80470e000,114688,LIO_READ,{ sigev_notify=SIGEV_NONE } aio_waitcomplete({ 32,2072576,0x80374a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,1941504,0x80372a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2203648,0x80376a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2334720,0x80378a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2465792,0x8037aa000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2596864,0x8037ca000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2859008,0x80380a000,16384,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0. aio_waitcomplete({ 32,2727936,0x8037ea000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 aio_waitcomplete({ 32,2875392,0x80470e000,114688,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0 lio_listio(LIO_NOWAIT,[{ 32,2990080,0x80472a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } 
 { 32,3121152,0x80474a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } … io_m ethod= 
 posix_aio (FreeBSD ) A FreeBSD-speci fi c way to consume completion events (one of many supported ways, see next section "Portability nightmares”) The POSIX way to start multiple IOs with one system call
  11. epoll_wait(12, [{EPOLLIN, {u32=2507243840, u64=94075175928128}}], 1, -1) = 1 recvfrom(16, "Q\0\0\0\34select

    count(*) from t;\0", 8192, 0, NULL, NULL) = 29 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 … io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 io_uring_enter(9, 1, 0, 0, NULL, 8) = 1 sendto(16, "T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"..., 69, 0, 0) = 69 recvfrom(16, 0x558f937066a0, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavaila epoll_wait(12, 0x558f95718188, 1, -1) = -1 EINTR (Interrupted system call) io_m ethod= io_uring (Linux) Simple tracing tools can’t see submissions and completions as they’re in user space queues, not in sys call arguments/results
  12. AIO streaming controller <next> callback Data consumer/ producer PG AIO

    API <io_method> Logic for controlling IO depth is common infrastructure, working in batches, to allow for IO combining; Research topic: control theory algorithms Logic for how to initiate one more IO is provided by from consumer/ producer Index scans, recovery/ replication system, checkpointer, …
  13. Primary node Replica node A system with many clients can

    be faulting in pages concurrently Replicas would synchronously fault in one page at a time; fi xed with AIO-based prefetching postgres=# select wal_distance, io_depth from pg_stat_recovery_prefetch; wal_distance | io_depth 
 ——————————————+—————————- 3272 | 7
  14. postgres=# select op, scb, owner_pid, "desc" from pg_stat_aios where flags

    != 'PGAIOIP_IDLE'; ┌───────┬─────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────┐ │ op │ scb │ owner_pid │ desc │ ├───────┼─────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────┤ │ read │ sb │ 67223 │ fd: 13, mode: 0, offset: 28188672, nbytes: 8192, already_done: 0, buffid: 10797 │ │ read │ sb │ 67222 │ fd: 13, mode: 0, offset: 8757248, nbytes: 8192, already_done: 0, buffid: 10463 │ │ read │ sb │ 67221 │ fd: 8, mode: 0, offset: 289153024, nbytes: 8192, already_done: 0, buffid: 11007 │ │ write │ wal │ 67227 │ write_no: 23, fd: 7, offset: 3244032, nbytes: 8192, already_done: 0, bufdata: 0x103582000 │ │ read │ sb │ 67220 │ fd: 12, mode: 0, offset: 84320256, nbytes: 8192, already_done: 0, buffid: 9771 │ │ read │ sb │ 67225 │ fd: 12, mode: 0, offset: 29343744, nbytes: 8192, already_done: 0, buffid: 10102 │ │ read │ sb │ 67224 │ fd: 12, mode: 0, offset: 175702016, nbytes: 8192, already_done: 0, buffid: 9862 │ └───────┴─────┴───────────┴───────────────────────────────────────────────────────────────────────────────────────────┘
  15. Problems we want to solve Direct I/O sounds easy 


    AIO is a portability nightmare FreeBSD thoughts Q&A
  16. O_DIRECT • IRIX gave us open(O_DIRECT), widely adopted, but not

    covered by POSIX so no formal meaning; general idea “don’t bu ff er my data” • Systems di ff er in details (block alignment requirements, e ff ects on descriptors with di ff erent fl ags, support on di ff erent fi le systems) • There were also some other ideas: Sun’s directio() and mount options, Apple’s F_NOCACHE • AFAIK, mostly of interest to database hackers and big data graphics/science • Speculation: one of the motivations for 90s databases to use raw disk partitions was perhaps to skip the kernel’s bu ff er cache, IO scheduler and concurrency/locking problems; O_DIRECT and highly concurrent fi le systems address those issues
  17. O_DIRECT gives you more problems • Our bu ff er

    replacement logic and bu ff er pool size had better be good, because the kernel’s bu ff ering won’t be cushioning our misses • We’ll need to supply readhead, writeback, clustering and concurrency logic, even for the simplest straight-line cases • -> AIO and DIO go together like peanut butter and jelly • Slower unbu ff ered reads and writes may reveal locking problems in fi lesystem implementation (gulp)
  18. Problems we want to solve Direct I/O sounds easy 


    AIO is a portability nightmare FreeBSD thoughts Q&A
  19. • 1993: Windows NT • Brings ideas from VMS, RSX-11D,

    RSX-11M • Asynchronous interfaces everywhere (“overlapped”) • Event-based polling and callbacks • 1994: Windows NT 3.5 • Adds IOCP for queue-like event processing • 1993: POSIX AIO, RT signals • IOs can now be started asynchronously • RT signals can carry a payload for example an IO number/pointer, and be consumed synchronously from a per-process queue or call handlers • Alternative polling scheme Nearly three decades of AIO
  20. io_method= worker posix_aio io_uring windows_iocp AIX ✓ ✓ Dragon fl

    yBSD ✓ N/A FreeBSD ✓ ✓ HP-UX ✓ ❌ Linux (glibc, musl) ✓ ✓🧵 ✓ macOS ✓ ✓* NetBSD ✓ 💥 OpenBSD ✓ N/A Solaris ✓ ✓🧵 Windows ✓ N/A ✓
  21. POSIX AIO: submission APIs • aio_read(), aio_write(), aio_fsync() • lio_listio()

    for batched submission of LIO_READ, LIO_WRITE • FreeBSD 13 added non-standard extensions aio_readv(), aio_writev(), LIO_READV, LIO_WRITEV for vector IO (scatter/gather) • Along the same lines, preadv()/pwritev() are also curiously absent from POSIX, but are obvious combinations of pread()/pwrite() and readv()/writev()
  22. POSIX AIO: completion APIs • aio_error() to fi nd out

    if it’s fi nished, failed or EINPROGRESS • aio_return() to get the result • aio_suspend() to wait for a list of IOs to complete, then see above to fi nd out which ones (reads and writes only, no fsyncs, though almost all implementations allow sync to be waited for too)
  23. POSIX AIO: standard notification mechanisms • SIGEV_SIGNAL • sigwaitinfo() can

    give you read-from-a-queue style interface, at a cost of three syscalls per IO (sigwaitinfo(), aio_error(), aio_return()), but macOS didn’t implement it! You could do sigwait() and then aio_error() for all outstanding IOs, for O(n^2) syscalls… • Using signal handlers would be unpleasant, and glibc has some questionable async signal safety • SIGEV_THREAD • Call a function pointer in some unspeci fi ed thread. As a matter of policy we probably don’t want threads for this, but in any case macOS didn’t implement it… • SIGEV_NONE • You’ll have to use aio_suspend() to wait, and then O(n^2). Only HP-UX doesn’t allow aio_fsync() to be waited for with aio_suspend() (which is allowed by POSIX, why?!)
  24. POSIX AIO: extended completion APIs Almost every OS invented a

    much nicer queue-like system • SIGEV_KEVENT (FreeBSD): drain N from a kqueue • You still have to call aio_return() on each IO, and possibly aio_error() too*, so you cannot get below 1.x syscalls per IO! • macOS and NetBSD have both kqueue and AIO, but they are not connected; Dragon fl y removed AIO. • aio_waitcomplete() (FreeBSD): drain one like a queue; no aio_error() or aio_return() • AIX’s aio_nwait() (but doesn’t work for aio_fsync()!), Solaris’s aio_waitn(), HP-UX’s aio_reap() could all read batches of completions, some without even entering the kernel • AIX and Solaris also added Windows IOCP-style APIs (not explored by me; SIGEV_PORT?)
  25. POSIX AIO: Assorted surprises • What should happen if you

    close a descriptor while an IO is pending? • The IO should run to completion • ECANCELED • The system should go bananas and run the IO against some later user of the same fi le descriptor number • What should happen if you perform multiple IOs on the same descriptor? • They should run in parallel to the extend possible • Serialise them, surely the calling program jests • POSIX says that aio_suspend() can’t wait for aio_fsync() (only HP-UX doesn’t support fsync here) • You can’t complete IOs started by another process. OK, maybe not so surprising. PostgreSQL really needs to go multi-threaded…
  26. Problems we want to solve Direct I/O sounds easy 


    AIO is a portability nightmare FreeBSD thoughts Q&A
  27. OpenZFS will hopefully support O_DIRECT • Direct IO is coming!

    • https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021 • I haven’t tried it out yet, but this looks interesting for databases.
  28. Assorted tricky problems • UFS O_SYNC/O_DSYNC waits *for each block*

    instead of whole writes • UFS has a fast bu ff erless path for O_DIRECT reads, but not vector reads (which should ideally produce a single multi-segment read at the device level), and no equivalent for writes • UFS fsync()/fdatasync() doesn’t fl ush device caches or use FUA • UFS uses VFS-level writes locks for write() and fsync(), preventing concurrency • VFS doesn’t allow for true asynchronous IO, instead kernel threads call the synchronous entry points and sleep
  29. Improve AIO completions via kqueue? • I would like to

    be able to use SEGEV_KEVENT to consume completion events from a kqueue without having to call aio_return(). Then we could get below 1 system call per IO, like on other OSes. • I have some prototype code for this: see D33271, D33144. It’s tricky, there are many places in the kernel that assume that IO is synchronous… • I have heard suggestions that kevent() should also be able to start IOs too
  30. Problems we want to solve Direct I/O sounds easy 


    AIO is a portability nightmare FreeBSD thoughts Q&A
  31. https://wiki.postgresql.org/wiki/AIO https://wiki.postgresql.org/wiki/FreeBSD/AIO