$30 off During Our Annual Pro Sale. View Details »

PostgreSQL Portability Tales

PostgreSQL Portability Tales


This is an informal talk I gave to the Warsaw PostgreSQL User Group via video link. If I sounded a bit spaced out, that might be because it was 4:30am my time and I hadn't had anywhere near enough coffee!

Description: "PostgreSQL runs on many operating systems. I'm going to talk about a selection of interesting differences between them, and things we have done or could to in the future to make things better. Mostly focusing on Linux and FreeBSD, with some discussion of macOS and Windows, and some passing mentions of the other BSDs, illumos and some ancient commercial Unix history. Topics include memory, I/O, IPC and management."

Here are some links to more information about things discussed or other talks/blogs mentioned in the talk or the Q&A session:

* Work-in-progress patch to get posix_fadvise(POSIX_FADV_WILLNEED) to work on OpenZFS (for Linux and FreeBSD at least): https://github.com/openzfs/zfs/pull/9807

* Async and direct I/O have been researched quite a bit by Andres Freund who has spoken about it at a couple of conferences: https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf

* I mentioned the paper "On the Impact of Instruction Address Translation Overhead", about the performance impact of page size for executable code on PostgreSQL performance: https://www.cs.rochester.edu/u/xdong/ispass-19-final.pdf

* I've also heard anecdotes reporting similar benefits for text segments on huge pages on Linux, for PostgreSQL, Oracle and other software. The instructions for getting this working on Linux are at: https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO

* Some more information about the differences in write-back error handling on different operating systems can be found here: https://wiki.postgresql.org/wiki/Fsync_Errors

* A description of ancient Berkeley POSTGRES's Lisp origins, and also the famous promise to move from processes to threads, as mentioned by Adam in the Q&A: https://dsf.berkeley.edu/papers/ERL-M90-34.pdf

* Some information about PostgreSQL on ZFS from Sean Chittenden, that came up in the Q&A: https://www.slideshare.net/SeanChittenden/postgresql-zfs-best-practices ; Sean also presented on Joyent's pg_prefaulter at pgcon: https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html

* My attempt to get WAL prefetching into core PostgreSQL (work in progress): https://commitfest.postgresql.org/28/2410/

* A post mortem of an outage published by Joyent, that came up in the Q&A: https://www.joyent.com/blog/manta-postmortem-7-27-2015

* In the Q&A, Adam mentioned multithreading designs that involve event driven programming with one thread per core. I mentioned that one subproject involved in that problem is converting queries from a tree walker format that uses the C stack to represent state, to something more amenable to processing as 'coroutines', 'green threads' or 'continuation capture' (cf. the Scheme programming language). Andres Freund has been exploring that topic and has written about his designs and prototyping here: https://www.postgresql.org/message-id/flat/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de

* As mentioned in the Q&A, some notes on some small projects to improve PostgreSQL on FreeBSD can be found here: https://wiki.postgresql.org/wiki/FreeBSD

* In the Q&A, Alicja mentioned that she'd seen some evidence that the traditional advice to turn Linux's THP off (which I repeated in this talk) might not be the whole story, and that she'd found cases here it helped a lot and had presented on that topic: https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2125/slides/103/VMandFS_tuning.pdf

Thomas Munro

April 16, 2020

More Decks by Thomas Munro

Other Decks in Programming


  1. PostgreSQL
    Portability Tales
    Warsaw PostgreSQL Users Group via video conference
    Thomas Munro, 2020-04-16
    [email protected]
    [email protected]
    [email protected]

    View Slide

  2. Talk topics


    Who cares about portability?

    Instruction set architectures


    Simulating fork()

    Page size effects and control

    Unexpected allocator effects

    Buffered I/O

    Error reporting variations

    Prefetching heuristics and control

    Writeback heuristics and control

    Towards async & direct I/O

    IPC and networking


    I/O multiplexing


    View Slide

  3. Introduction: Who cares about portability?

    Original POSTGRES development platforms at UC Berkeley seem to have been:
    SunOS, Sequent Dynix (= a BSD fork with large scale SMP), BSD

    Today, developers mostly use Linux or macOS, a few use FreeBSD or Windows

    According to buildfarm.postgresql.org, our community currently cares about:

    { Debian, Ubuntu, SUSE, RHEL, CentOS, Fedora, Photon, Raspbian, Arch, Amazon } Linux

    { Free, Open, Net, DragonFly } BSD


    Solaris, Illumos




    Over the years we’ve dropped support for many others (RIP):

    IRIX, SCO, DGUX, Ultrix, NeXTSTEP, SunOS4, SVR4, UnixWare, QNX, ...

    View Slide

  4. Introduction: Instruction set architectures

    Modern and ancient ISAs we have in the build farm:









    Modern ISAs that we don’t explicitly support yet -- build farm machine wanted!


    There’s explicit code to support various other long dead systems, but no build
    farm so status unknown to me (M68K, M88K, VAX)

    View Slide

  5. Memory: Simulating fork()

    PostgreSQL is still process based after all these decades (one day someone will
    attack that…).

    On Unix, the main shared memory area is inherited by children of the postmaster;
    we count on address stability in that region.

    Windows has no fork(). Child processes have to recreate all required state from a
    file, and try to attach the main shared memory area at the same address.

    For those of us working on Unix and wanting to test or extend that mechanism,
    it’s possible to build PostgreSQL that way on Unix too, with -DEXEC_BACKEND.

    Windows support must have taken Herculean effort. Added in 8.0 (2005).

    View Slide

  6. Memory: Page size effects and control


    The “transparent huge pages” support is currently discouraged by PostgreSQL docs; one day?.

    We try to map the main shared memory segment using huge pages (if configured), for performance

    Putting the text (executable) into huge pages is known to speed up benchmarks by several percent,
    but requires some special magic not done out of the box (see libhugetlbfs linker scripts)

    Putting DSM segments into huge pages is known to speed up some parallel queries, especially
    parallel hash joins, significantly, but requires some extra configuration (/dev/shm mount options)

    FreeBSD (and illumos too?)

    Most of the above happens transparently out of the box (but further improvements are possible*)


    We try to map the main shared memory segment only using huge pages (requires privileges)

    Future research

    Different page size? AIX support proposed. Reuse DSMs better, probably preallocated.
    *See “On the Impact of Instruction Address Translation Overhead”, Zhou, Dong, Cox, Dwarkadas

    View Slide

  7. Memory: Unexpected allocator effects
    PostgreSQL allocates memory for hash joins 32kB + a tiny header at a time. What does
    libc malloc() do with this stream of requests?

    Glibc (usually used on Linux): efficiently packs memory

    Jemalloc (FreeBSD): requests 32kb < x < 40kB are rounded up to 40kB, so hash
    joins eat 25% more memory on this platform! Oops.

    macOS: requests are rounded up to nearest 256 bytes

    Other OSs: who knows?
    We don’t have to use 32kB + a tiny header, but it’s hard to second-guess every system
    out there. It would probably be wise to use exact powers of two! Not done yet...

    View Slide

  8. Buffered I/O: Error reporting variations
    (1) What is the status of buffered data after fsync()
    returns EIO or ENOSPC (NFS)?

    Linux: dirty buffers may be marked clean
    and dropped immediately or eventually
    (depending on filesystem), so that they will
    never be written back; errors are “consumed”
    when they are reported, so a new call to
    fsync() might report success and a call to
    read() might read and earlier state of the

    FreeBSD (and maybe illumos?): dirty buffers
    remain dirty and can’t be replaced (until the
    device goes away, or writeback succeeds)
    (2) What if a writeback error occurs while no
    process holds the file open, and you later open the
    same file and call fsync()?

    Linux: an error flag persists so that at least
    one later fsync() call will report the error,
    but only as long as the inode remains in
    kernel memory

    FreeBSD (and maybe illumos?): see previous
    The PostgreSQL project had assumed FreeBSD
    behaviour until 2018. Now fsync() errors cause a
    panic, to fix (1). More work is needed for (2).

    View Slide

  9. Buffered I/O: Prefetching heuristics and control
    What happens if you hand out sequential block number to
    worker processes to read, like in Parallel Sequential Scan?

    Linux: The kernel detects sequential access from all
    the workers, as long as their step size is < window
    size, default 128kB, and generates large I/Os.

    FreeBSD UFS: It doesn’t detect sequential access,
    and performance is terrible.

    FreeBSD ZFS: Seems to be much smarter (but I
    don’t understand exactly why).
    What happens if one process interleaves two sequential
    streams of blocks, one reading and one writing?

    Linux and FreeBSD UFS: It doesn’t detect sequential
    access, and performance is terrible.

    ZFS doesn’t have this problem (?)
    What happens if we use
    posix_fadvise(POSIX_FADV_WILLNEED) to tell the
    kernel about our future random read plans?

    Linux: It works nicely, to start asynchronous reads.,
    on in-tree filesystems (not ZFS).

    FreeBSD: Ignored (though other advice works).

    Illumos/Solaris: It exists but is ignored (no syscall).

    macOS, Windows: Not provided at all.
    Currently this improves the performance of bitmap heap
    scans and VACUUM. In future we might use it for more
    things (index scans, recovery/replication speed
    improvements). This is a type of “poor man’s async I/O”.
    I created WIP pull request #9807 for openzfs to make this
    work on ZFS on Linux and FreeBSD -- but it needs more

    View Slide

  10. Buffered I/O: Writeback heuristics and control
    On Linux, we use sync_file_range() without waiting, to ask the kernel to initiate
    writeback of specific ranges of dirty files. This is a hint about which files we’ll be
    calling fsync() on, to spread the writing out over a longer time, without having other
    files also written back (temporary files which we’ll never fsync(), files belonging to
    unrelated software on the same server).
    No other system has a similar system call, and there are no great ways to emulate it.
    Perhaps there should be POSIX_FADV_WILLSYNC?

    View Slide

  11. Buffered I/O: Towards async and direct I/O

    POSTGRES originated at a time when there was no direct or async I/O; its authors wrote papers
    complaining about double buffering and other problems

    It’s possible that its authors even requested that fsync() be added to 4.2BSD (pure speculation on my part,
    given they were developed around the same time and place… how did anyone make reliable databases on
    Unix before that?)

    Async I/O will require a lot of new portability work: Linux’s POSIX and non-POSIX aio implementations
    are not usable by us for various reasons, but its brand new io_uring looks intriguing; Windows has a
    completely different model; for FreeBSD and friends we’d have to use POSIX aio.

    Direct I/O is not standardised, and will require a lot of new machinery inside PostgreSQL to replace the
    read-ahead, write-behind, buffering heuristics of the kernel.

    We’d still need to support buffered I/O for filesystems like ZFS, and synchronous I/O for OSes that we
    don’t have an async implementation.

    All of this will be a lot of work!

    View Slide

  12. IPC and networking: Signals
    Linux and FreeBSD have a “parent death signal” facility (inspired by an IRIX feature).
    Why do other Unices not have a thing like this? Without it, we expend a lot of system
    time polling a pipe so that processes can exit quickly if the postmaster process goes
    For Windows we have a complicated arrangement to simulate Unix signals.

    View Slide

  13. IPC and networking: I/O multiplexing
    Frequently PostgreSQL waits on a set of file descriptors: a socket, a self-pipe used by
    signal handlers and a pipe connected to the postmaster. Switching from a poll()-based
    implementation dramatically reduced system time for larger servers with many
    connections, due to contention on the pipe.

    Linux: epoll

    *BSD, macOS: kqueue (since PG13, not yet released)

    Other Un*x: poll

    Windows: win32 primitives

    View Slide

  14. IPC and networking: TCP/IP
    One obscure difference in TCP implementations: Windows doesn’t allow you to read
    from a socket that has already failed to write. This means that if a remote host sends a
    message like “authentication failed, goodbye” and then closes the socket, and the
    protocol is one that requires the local host to begin sending data before receiving any
    message from the remote host, then the local host risks an ECONNRESET error when
    it tries to send, but can no longer read the “authentication failed, goodbye” message.
    Concretely, this causes us to lose some kinds of helpful context from logs when things
    go wrong on Windows, in protocols that support unsoliticed messages from both sides.
    (It may be that such a message is not guaranteed to be available on other systems
    either, but it usually is at least.)

    View Slide

  15. EOF

    View Slide