Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AIO & DIO for PostgreSQL on FreeBSD

AIO & DIO for PostgreSQL on FreeBSD

A talk I gave at BSDCan 2022 about porting asynchronous & direct I/O to FreeBSD (especially) and lots of other OSes.

https://www.bsdcan.org/events/bsdcan_2022/schedule/session/90-asynchronous-and-direct-io-for-postgresql-on-freebsd/

Temporary video link, just until the BSDCan channel has the video: http://cfbot.cputube.org/tmp/bsdcan2022-aio-postgres-tmunro-v2.m4v

Thomas Munro

June 03, 2022
Tweet

More Decks by Thomas Munro

Other Decks in Programming

Transcript

  1. AIO & DIO
    for PostgreSQL on FreeBSD*
    Work in progress report
    *And other animals
    Thomas Munro
    PostgreSQL hacker @ Microsoft
    PGCon 2022, which should have been in Ottawa

    View Slide

  2. Modernising PostgreSQL’s disk IO
    And some things I learned about FreeBSD along the journey
    • PostgreSQL disk AIO project started by Andres Freund, later joined by
    Thomas Munro, Melanie Plageman, David Rowley, being proposed for
    PostgreSQL 16
    io_method={worker,io_uring,posix_aio,windows_iocp}


    io_data_direct={on,off}

    io_wal_direct={on,off}


    All
    supported
    OSes
    Linux
    POSIX,
    especially
    FreeBSD
    Windows

    View Slide

  3. Problems we want to solve


    Direct I/O sounds easy

    AIO is a portability nightmare


    FreeBSD thoughts


    Q&A

    View Slide

  4. Kernel Storage
    PostgreSQL
    • RAM is wasted by having two levels
    of bu
    ff
    er

    • Data is transferred between caches
    8KB at a time with a lot of
    synchronous pread/pwrite

    • Relying on kernel heuristics for
    readhead and writeback

    • We give limited hints about future
    reads and fsyncs on some OSes for
    IO concurrency (but not FreeBSD 😔)
    Bu
    ff
    er cache/ARC
    Bu
    ff
    er pool
    WAL bu
    ff
    ers
    PostgreSQL’s traditional disk I/O

    View Slide

  5. Kernel Storage
    PostgreSQL
    • Optionally skip the kernel bu
    ff
    er cache,
    no double bu
    ff
    ering, no kernel copying

    • Submit and complete batches of IOs
    with minimal system calls via modern*
    asynchronous interfaces

    • Combine IOs (adjacent data, scatter/
    gather)

    • Make our own specialised IO
    streaming predictors (not just “detect
    sequential”, but data-dependent)

    • Model and control IO with certain
    goals (latency, resource usage, …)
    Bu
    ff
    er pool
    WAL bu
    ff
    ers
    AIO/DIO

    View Slide

  6. postgres=# set max_parallel_workers_per_gather = 0;


    SET



    postgres=# select count(*) from t;


    count


    ---------


    1000000


    (1 row)


    Disable query executor
    parallelism, to make this
    example simple
    We’ll scan a table that isn’t in
    PostgreSQL’s buffer pool

    View Slide

  7. kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016fc610 },1,0x0) = 1 (0x1)


    recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0) = 29 (0x1d)


    pread(62,"\0\0\0\08\^]M\^A\0\0\^D\0\240\^C"...,8192,0x0) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M-hUM\^A\0\0\^D\0\240"...,8192,0x2000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M^X\M^NM\^A\0\0\^D\0"...,8192,0x4000) = 8192 (0x2000)


    pread(62,"\0\0\0\0H\M-GM\^A\0\0\^D\0\240"...,8192,0x6000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M-`\M^?M\^A\0\0\^D\0"...,8192,0x8000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M^P8N\^A\0\0\^D\0\240"...,8192,0xa000) = 8192 (0x2000)


    pread(62,"\0\0\0\[email protected]\^A\0\0\^D\0\240\^C"...,8192,0xc000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M-p\M-)N\^A\0\0\^D\0"...,8192,0xe000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\240\M-bN\^A\0\0\^D\0"...,8192,0x10000) = 8192 (0x2000)


    pread(62,"\0\0\0\08\^[O\^A\0\0\^D\0\240\^C"...,8192,0x12000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M-hSO\^A\0\0\^D\0\240"...,8192,0x14000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M^X\M^LO\^A\0\0\^D\0"...,8192,0x16000) = 8192 (0x2000)


    pread(62,"\0\0\0\0H\M-EO\^A\0\0\^D\0\240"...,8192,0x18000) = 8192 (0x2000)





    pread(62,"\0\0\0\0HpL\^C\0\0\^D\0\240\^C"...,8192,0x120c000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M-x\M-(L\^C\0\0\^D\0"...,8192,0x120e000) = 8192 (0x2000)


    pread(62,"\0\0\0\0\M-(\M-aL\^C\0\0\^D\0"...,8192,0x1210000) = 8192 (0x2000)


    sendto(8,"T\0\0\0\^^\0\^Acount\0\0\0\0\0\0"...,69,0,NULL,0) = 69 (0x45)


    recvfrom(8,0xe19cd0,8192,0,0x0,0x0) ERR#35 'Resource temporarily unavailable'
    O
    ld
    w
    ay

    View Slide

  8. kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016ee790 },1,0x0) = 1 (0x1)


    recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0) = 29 (0x1d)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    kill(52213,SIGURG) = 0 (0x0)


    sendto(8,"T\0\0\0\^^\0\^Acount\0\0\0\0\0\0"...,69,0,NULL,0) = 69 (0x45)


    recvfrom(8,0xe46480,8192,0,0x0,0x0) ERR#35 'Resource temporarily unavailable'
    io_m
    ethod=


    w
    orker

    View Slide

  9. kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1)


    preadv(0xd,0x7fffffffa3d0,0x1,0x1da000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x1fa000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x21a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x23a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x25a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x27a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x29a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x2,0x2ba000) = 131072 (0x20000)


    kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1)


    preadv(0xd,0x7fffffffa3d0,0x1,0x2da000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x2fa000) = 131072 (0x20000)


    kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1)


    preadv(0xd,0x7fffffffa3d0,0x1,0x31a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x33a000) = 131072 (0x20000)


    kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1)


    preadv(0xd,0x7fffffffa3d0,0x1,0x35a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x37a000) = 131072 (0x20000)


    kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1)


    preadv(0xd,0x7fffffffa3d0,0x1,0x39a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x3ba000) = 131072 (0x20000)


    kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1)


    preadv(0xd,0x7fffffffa3d0,0x1,0x3da000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x3fa000) = 131072 (0x20000)


    kevent(8,0x0,0,{ SIGURG,EVFILT_SIGNAL,EV_CLEAR,0,0x1,0x80160a580 },1,0x0) = 1 (0x1)


    preadv(0xd,0x7fffffffa3d0,0x1,0x41a000) = 131072 (0x20000)


    preadv(0xd,0x7fffffffa3d0,0x1,0x43a000) = 131072 (0x20000)
    IO
    w
    orker process

    View Slide

  10. kevent(4,0x0,0,{ 8,EVFILT_READ,0x0,0,0x1d,0x8016ee790 },1,0x0) = 1 (0x1)


    recvfrom(8,"Q\0\0\0\^\select count(*) from t"...,8192,0,NULL,0x0) = 29 (0x1d)


    lio_listio(LIO_NOWAIT,[{ 32,1941504,0x80372a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }

    { 32,2072576,0x80374a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }
    lio_listio(LIO_NOWAIT,[{ 32,2203648,0x80376a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }

    { 32,2334720,0x80378a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }
    lio_listio(LIO_NOWAIT,[{ 32,2465792,0x8037aa000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }

    { 32,2596864,0x8037ca000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }
    lio_listio(LIO_NOWAIT,[{ 32,2727936,0x8037ea000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }

    { 32,2859008,0x80380a000,16384,LIO_READ,{ sigev_notify=SIGEV_NONE }


    { 32,2875392,0x80470e000,114688,LIO_READ,{ sigev_notify=SIGEV_NONE }
    aio_waitcomplete({ 32,2072576,0x80374a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    aio_waitcomplete({ 32,1941504,0x80372a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    aio_waitcomplete({ 32,2203648,0x80376a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    aio_waitcomplete({ 32,2334720,0x80378a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    aio_waitcomplete({ 32,2465792,0x8037aa000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    aio_waitcomplete({ 32,2596864,0x8037ca000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    aio_waitcomplete({ 32,2859008,0x80380a000,16384,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0.
    aio_waitcomplete({ 32,2727936,0x8037ea000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    aio_waitcomplete({ 32,2875392,0x80470e000,114688,LIO_READ,{ sigev_notify=SIGEV_NONE } },{ 0
    lio_listio(LIO_NOWAIT,[{ 32,2990080,0x80472a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }

    { 32,3121152,0x80474a000,131072,LIO_READ,{ sigev_notify=SIGEV_NONE }

    io_m
    ethod=

    posix_aio
    (FreeBSD
    )
    A FreeBSD-speci
    fi
    c way to
    consume completion events


    (one of many supported ways, see
    next section "Portability
    nightmares”)
    The POSIX way to start
    multiple IOs with one
    system call

    View Slide

  11. epoll_wait(12, [{EPOLLIN, {u32=2507243840, u64=94075175928128}}], 1, -1) = 1


    recvfrom(16, "Q\0\0\0\34select count(*) from t;\0", 8192, 0, NULL, NULL) = 29


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1





    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    io_uring_enter(9, 1, 0, 0, NULL, 8) = 1


    sendto(16, "T\0\0\0\36\0\1count\0\0\0\0\0\0\0\0\0\0\24\0\10\377\377\377\377\0\0D"..., 69, 0,
    0) = 69


    recvfrom(16, 0x558f937066a0, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavaila


    epoll_wait(12, 0x558f95718188, 1, -1) = -1 EINTR (Interrupted system call)
    io_m
    ethod=


    io_uring
    (Linux)
    Simple tracing tools can’t see
    submissions and completions as
    they’re in user space queues, not in
    sys call arguments/results

    View Slide

  12. AIO
    streaming
    controller

    callback
    Data
    consumer/
    producer
    PG AIO API

    Logic for controlling IO depth is
    common infrastructure, working in
    batches, to allow for IO combining;


    Research topic: control theory
    algorithms
    Logic for how to initiate one more IO
    is provided by from consumer/
    producer
    Index scans,
    recovery/
    replication system,
    checkpointer, …

    View Slide

  13. Primary node Replica node
    A system with
    many clients can
    be faulting in
    pages
    concurrently
    Replicas would
    synchronously
    fault in one page
    at a time;
    fi
    xed
    with AIO-based
    prefetching
    postgres=# select wal_distance, io_depth from pg_stat_recovery_prefetch;


    wal_distance | io_depth

    ——————————————+—————————-


    3272 | 7

    View Slide

  14. postgres=# select op, scb, owner_pid, "desc" from pg_stat_aios where flags != 'PGAIOIP_IDLE';


    ┌───────┬─────┬───────────┬───────────────────────────────────────────────────────────────────────────────────────────┐


    │ op │ scb │ owner_pid │ desc │


    ├───────┼─────┼───────────┼───────────────────────────────────────────────────────────────────────────────────────────┤


    │ read │ sb │ 67223 │ fd: 13, mode: 0, offset: 28188672, nbytes: 8192, already_done: 0, buffid: 10797 │


    │ read │ sb │ 67222 │ fd: 13, mode: 0, offset: 8757248, nbytes: 8192, already_done: 0, buffid: 10463 │


    │ read │ sb │ 67221 │ fd: 8, mode: 0, offset: 289153024, nbytes: 8192, already_done: 0, buffid: 11007 │


    │ write │ wal │ 67227 │ write_no: 23, fd: 7, offset: 3244032, nbytes: 8192, already_done: 0, bufdata: 0x103582000 │


    │ read │ sb │ 67220 │ fd: 12, mode: 0, offset: 84320256, nbytes: 8192, already_done: 0, buffid: 9771 │


    │ read │ sb │ 67225 │ fd: 12, mode: 0, offset: 29343744, nbytes: 8192, already_done: 0, buffid: 10102 │


    │ read │ sb │ 67224 │ fd: 12, mode: 0, offset: 175702016, nbytes: 8192, already_done: 0, buffid: 9862 │


    └───────┴─────┴───────────┴───────────────────────────────────────────────────────────────────────────────────────────┘


    View Slide

  15. Problems we want to solve


    Direct I/O sounds easy

    AIO is a portability nightmare


    FreeBSD thoughts


    Q&A

    View Slide

  16. O_DIRECT
    • IRIX gave us open(O_DIRECT), widely adopted, but not covered by POSIX so no
    formal meaning; general idea “don’t bu
    ff
    er my data”

    • Systems di
    ff
    er in details (block alignment requirements, e
    ff
    ects on descriptors with
    di
    ff
    erent
    fl
    ags, support on di
    ff
    erent
    fi
    le systems)

    • There were also some other ideas: Sun’s directio() and mount options, Apple’s
    F_NOCACHE

    • AFAIK, mostly of interest to database hackers and big data graphics/science

    • Speculation: one of the motivations for 90s databases to use raw disk partitions was
    perhaps to skip the kernel’s bu
    ff
    er cache, IO scheduler and concurrency/locking
    problems; O_DIRECT and highly concurrent
    fi
    le systems address those issues

    View Slide

  17. O_DIRECT gives you more problems
    • Our bu
    ff
    er replacement logic and bu
    ff
    er pool size had better be good,
    because the kernel’s bu
    ff
    ering won’t be cushioning our misses

    • We’ll need to supply readhead, writeback, clustering and concurrency logic,
    even for the simplest straight-line cases

    • -> AIO and DIO go together like peanut butter and jelly

    • Slower unbu
    ff
    ered reads and writes may reveal locking problems in
    fi
    lesystem
    implementation (gulp)

    View Slide

  18. Problems we want to solve


    Direct I/O sounds easy

    AIO is a portability nightmare


    FreeBSD thoughts


    Q&A

    View Slide

  19. • 1993: Windows NT

    • Brings ideas from VMS, RSX-11D,
    RSX-11M

    • Asynchronous interfaces everywhere
    (“overlapped”)

    • Event-based polling and callbacks

    • 1994: Windows NT 3.5

    • Adds IOCP for queue-like event
    processing

    • 1993: POSIX AIO, RT signals

    • IOs can now be started
    asynchronously

    • RT signals can carry a payload for
    example an IO number/pointer,
    and be consumed synchronously
    from a per-process queue or call
    handlers

    • Alternative polling scheme
    Nearly three decades of AIO

    View Slide

  20. io_method= worker posix_aio io_uring windows_iocp
    AIX ✓ ✓
    Dragon
    fl
    yBSD ✓ N/A
    FreeBSD ✓ ✓
    HP-UX ✓ ❌
    Linux (glibc, musl) ✓ ✓🧵 ✓
    macOS ✓ ✓*
    NetBSD ✓ 💥
    OpenBSD ✓ N/A
    Solaris ✓ ✓🧵
    Windows ✓ N/A ✓

    View Slide

  21. POSIX AIO: submission APIs
    • aio_read(), aio_write(), aio_fsync()

    • lio_listio() for batched submission of LIO_READ, LIO_WRITE

    • FreeBSD 13 added non-standard extensions aio_readv(), aio_writev(),
    LIO_READV, LIO_WRITEV for vector IO (scatter/gather)

    • Along the same lines, preadv()/pwritev() are also curiously absent from POSIX,
    but are obvious combinations of pread()/pwrite() and readv()/writev()

    View Slide

  22. POSIX AIO: completion APIs
    • aio_error() to
    fi
    nd out if it’s
    fi
    nished, failed or EINPROGRESS

    • aio_return() to get the result

    • aio_suspend() to wait for a list of IOs to complete, then see above to
    fi
    nd out
    which ones (reads and writes only, no fsyncs, though almost all
    implementations allow sync to be waited for too)

    View Slide

  23. POSIX AIO: standard notification mechanisms
    • SIGEV_SIGNAL

    • sigwaitinfo() can give you read-from-a-queue style interface, at a cost of three syscalls per IO
    (sigwaitinfo(), aio_error(), aio_return()), but macOS didn’t implement it! You could do sigwait() and then
    aio_error() for all outstanding IOs, for O(n^2) syscalls…

    • Using signal handlers would be unpleasant, and glibc has some questionable async signal safety

    • SIGEV_THREAD

    • Call a function pointer in some unspeci
    fi
    ed thread. As a matter of policy we probably don’t want
    threads for this, but in any case macOS didn’t implement it…

    • SIGEV_NONE

    • You’ll have to use aio_suspend() to wait, and then O(n^2). Only HP-UX doesn’t allow aio_fsync() to be
    waited for with aio_suspend() (which is allowed by POSIX, why?!)

    View Slide

  24. POSIX AIO: extended completion APIs
    Almost every OS invented a much nicer queue-like system
    • SIGEV_KEVENT (FreeBSD): drain N from a kqueue

    • You still have to call aio_return() on each IO, and possibly aio_error() too*, so you cannot
    get below 1.x syscalls per IO!

    • macOS and NetBSD have both kqueue and AIO, but they are not connected; Dragon
    fl
    y
    removed AIO.

    • aio_waitcomplete() (FreeBSD): drain one like a queue; no aio_error() or aio_return()

    • AIX’s aio_nwait() (but doesn’t work for aio_fsync()!), Solaris’s aio_waitn(), HP-UX’s aio_reap()
    could all read batches of completions, some without even entering the kernel

    • AIX and Solaris also added Windows IOCP-style APIs (not explored by me; SIGEV_PORT?)

    View Slide

  25. POSIX AIO: Assorted surprises
    • What should happen if you close a descriptor while an IO is pending?

    • The IO should run to completion

    • ECANCELED

    • The system should go bananas and run the IO against some later user of the same
    fi
    le descriptor number

    • What should happen if you perform multiple IOs on the same descriptor?

    • They should run in parallel to the extend possible

    • Serialise them, surely the calling program jests

    • POSIX says that aio_suspend() can’t wait for aio_fsync() (only HP-UX doesn’t support fsync here)

    • You can’t complete IOs started by another process. OK, maybe not so surprising. PostgreSQL really needs
    to go multi-threaded…

    View Slide

  26. Problems we want to solve


    Direct I/O sounds easy

    AIO is a portability nightmare


    FreeBSD thoughts


    Q&A

    View Slide

  27. OpenZFS will hopefully support O_DIRECT
    • Direct IO is coming!

    • https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021

    • I haven’t tried it out yet, but this looks interesting for databases.

    View Slide

  28. Assorted tricky problems
    • UFS O_SYNC/O_DSYNC waits *for each block* instead of whole writes

    • UFS has a fast bu
    ff
    erless path for O_DIRECT reads, but not vector reads
    (which should ideally produce a single multi-segment read at the device level),
    and no equivalent for writes

    • UFS fsync()/fdatasync() doesn’t
    fl
    ush device caches or use FUA

    • UFS uses VFS-level writes locks for write() and fsync(), preventing concurrency

    • VFS doesn’t allow for true asynchronous IO, instead kernel threads call the
    synchronous entry points and sleep

    View Slide

  29. Improve AIO completions via kqueue?
    • I would like to be able to use SEGEV_KEVENT to consume completion events
    from a kqueue without having to call aio_return(). Then we could get below 1
    system call per IO, like on other OSes.

    • I have some prototype code for this: see D33271, D33144. It’s tricky, there
    are many places in the kernel that assume that IO is synchronous…

    • I have heard suggestions that kevent() should also be able to start IOs too

    View Slide

  30. Problems we want to solve


    Direct I/O sounds easy

    AIO is a portability nightmare


    FreeBSD thoughts


    Q&A

    View Slide

  31. https://wiki.postgresql.org/wiki/AIO


    https://wiki.postgresql.org/wiki/FreeBSD/AIO

    View Slide