Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PostgreSQL Portability Tales

PostgreSQL Portability Tales

https://www.youtube.com/watch?v=17XYIJbkGz4
https://www.meetup.com/pl-PL/Warsaw-PostgreSQL-Users-Group/events/269906534/

This is an informal talk I gave to the Warsaw PostgreSQL User Group via video link. If I sounded a bit spaced out, that might be because it was 4:30am my time and I hadn't had anywhere near enough coffee!

Description: "PostgreSQL runs on many operating systems. I'm going to talk about a selection of interesting differences between them, and things we have done or could to in the future to make things better. Mostly focusing on Linux and FreeBSD, with some discussion of macOS and Windows, and some passing mentions of the other BSDs, illumos and some ancient commercial Unix history. Topics include memory, I/O, IPC and management."

Here are some links to more information about things discussed or other talks/blogs mentioned in the talk or the Q&A session:

* Work-in-progress patch to get posix_fadvise(POSIX_FADV_WILLNEED) to work on OpenZFS (for Linux and FreeBSD at least): https://github.com/openzfs/zfs/pull/9807

* Async and direct I/O have been researched quite a bit by Andres Freund who has spoken about it at a couple of conferences: https://anarazel.de/talks/2020-01-31-fosdem-aio/aio.pdf

* I mentioned the paper "On the Impact of Instruction Address Translation Overhead", about the performance impact of page size for executable code on PostgreSQL performance: https://www.cs.rochester.edu/u/xdong/ispass-19-final.pdf

* I've also heard anecdotes reporting similar benefits for text segments on huge pages on Linux, for PostgreSQL, Oracle and other software. The instructions for getting this working on Linux are at: https://github.com/libhugetlbfs/libhugetlbfs/blob/master/HOWTO

* Some more information about the differences in write-back error handling on different operating systems can be found here: https://wiki.postgresql.org/wiki/Fsync_Errors

* A description of ancient Berkeley POSTGRES's Lisp origins, and also the famous promise to move from processes to threads, as mentioned by Adam in the Q&A: https://dsf.berkeley.edu/papers/ERL-M90-34.pdf

* Some information about PostgreSQL on ZFS from Sean Chittenden, that came up in the Q&A: https://www.slideshare.net/SeanChittenden/postgresql-zfs-best-practices ; Sean also presented on Joyent's pg_prefaulter at pgcon: https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html

* My attempt to get WAL prefetching into core PostgreSQL (work in progress): https://commitfest.postgresql.org/28/2410/

* A post mortem of an outage published by Joyent, that came up in the Q&A: https://www.joyent.com/blog/manta-postmortem-7-27-2015

* In the Q&A, Adam mentioned multithreading designs that involve event driven programming with one thread per core. I mentioned that one subproject involved in that problem is converting queries from a tree walker format that uses the C stack to represent state, to something more amenable to processing as 'coroutines', 'green threads' or 'continuation capture' (cf. the Scheme programming language). Andres Freund has been exploring that topic and has written about his designs and prototyping here: https://www.postgresql.org/message-id/flat/20180525033538.6ypfwcqcxce6zkjj%40alap3.anarazel.de

* As mentioned in the Q&A, some notes on some small projects to improve PostgreSQL on FreeBSD can be found here: https://wiki.postgresql.org/wiki/FreeBSD

* In the Q&A, Alicja mentioned that she'd seen some evidence that the traditional advice to turn Linux's THP off (which I repeated in this talk) might not be the whole story, and that she'd found cases here it helped a lot and had presented on that topic: https://www.postgresql.eu/events/pgconfeu2018/sessions/session/2125/slides/103/VMandFS_tuning.pdf

Thomas Munro

April 16, 2020
Tweet

More Decks by Thomas Munro

Other Decks in Programming

Transcript

  1. Talk topics • Introduction ◦ Who cares about portability? ◦

    Instruction set architectures • Memory ◦ Simulating fork() ◦ Page size effects and control ◦ Unexpected allocator effects • Buffered I/O ◦ Error reporting variations ◦ Prefetching heuristics and control ◦ Writeback heuristics and control ◦ Towards async & direct I/O • IPC and networking ◦ Signals ◦ I/O multiplexing ◦ TCP/IP
  2. Introduction: Who cares about portability? • Original POSTGRES development platforms

    at UC Berkeley seem to have been: SunOS, Sequent Dynix (= a BSD fork with large scale SMP), BSD • Today, developers mostly use Linux or macOS, a few use FreeBSD or Windows • According to buildfarm.postgresql.org, our community currently cares about: ◦ { Debian, Ubuntu, SUSE, RHEL, CentOS, Fedora, Photon, Raspbian, Arch, Amazon } Linux ◦ { Free, Open, Net, DragonFly } BSD ◦ macOS ◦ Solaris, Illumos ◦ AIX ◦ HPUX ◦ Windows • Over the years we’ve dropped support for many others (RIP): ◦ IRIX, SCO, DGUX, Ultrix, NeXTSTEP, SunOS4, SVR4, UnixWare, QNX, ...
  3. Introduction: Instruction set architectures • Modern and ancient ISAs we

    have in the build farm: ◦ X86 ◦ ARM ◦ POWER ◦ S390 ◦ IA64 ◦ SPARC ◦ MIPS ◦ PA-RISC • Modern ISAs that we don’t explicitly support yet -- build farm machine wanted! ◦ RISC-V • There’s explicit code to support various other long dead systems, but no build farm so status unknown to me (M68K, M88K, VAX)
  4. Memory: Simulating fork() • PostgreSQL is still process based after

    all these decades (one day someone will attack that…). • On Unix, the main shared memory area is inherited by children of the postmaster; we count on address stability in that region. • Windows has no fork(). Child processes have to recreate all required state from a file, and try to attach the main shared memory area at the same address. • For those of us working on Unix and wanting to test or extend that mechanism, it’s possible to build PostgreSQL that way on Unix too, with -DEXEC_BACKEND. • Windows support must have taken Herculean effort. Added in 8.0 (2005).
  5. Memory: Page size effects and control • Linux ◦ The

    “transparent huge pages” support is currently discouraged by PostgreSQL docs; one day?. ◦ We try to map the main shared memory segment using huge pages (if configured), for performance ◦ Putting the text (executable) into huge pages is known to speed up benchmarks by several percent, but requires some special magic not done out of the box (see libhugetlbfs linker scripts) ◦ Putting DSM segments into huge pages is known to speed up some parallel queries, especially parallel hash joins, significantly, but requires some extra configuration (/dev/shm mount options) • FreeBSD (and illumos too?) ◦ Most of the above happens transparently out of the box (but further improvements are possible*) • Windows ◦ We try to map the main shared memory segment only using huge pages (requires privileges) • Future research ◦ Different page size? AIX support proposed. Reuse DSMs better, probably preallocated. *See “On the Impact of Instruction Address Translation Overhead”, Zhou, Dong, Cox, Dwarkadas
  6. Memory: Unexpected allocator effects PostgreSQL allocates memory for hash joins

    32kB + a tiny header at a time. What does libc malloc() do with this stream of requests? • Glibc (usually used on Linux): efficiently packs memory • Jemalloc (FreeBSD): requests 32kb < x < 40kB are rounded up to 40kB, so hash joins eat 25% more memory on this platform! Oops. • macOS: requests are rounded up to nearest 256 bytes • Other OSs: who knows? We don’t have to use 32kB + a tiny header, but it’s hard to second-guess every system out there. It would probably be wise to use exact powers of two! Not done yet...
  7. Buffered I/O: Error reporting variations (1) What is the status

    of buffered data after fsync() returns EIO or ENOSPC (NFS)? • Linux: dirty buffers may be marked clean and dropped immediately or eventually (depending on filesystem), so that they will never be written back; errors are “consumed” when they are reported, so a new call to fsync() might report success and a call to read() might read and earlier state of the data • FreeBSD (and maybe illumos?): dirty buffers remain dirty and can’t be replaced (until the device goes away, or writeback succeeds) (2) What if a writeback error occurs while no process holds the file open, and you later open the same file and call fsync()? • Linux: an error flag persists so that at least one later fsync() call will report the error, but only as long as the inode remains in kernel memory • FreeBSD (and maybe illumos?): see previous answer The PostgreSQL project had assumed FreeBSD behaviour until 2018. Now fsync() errors cause a panic, to fix (1). More work is needed for (2).
  8. Buffered I/O: Prefetching heuristics and control What happens if you

    hand out sequential block number to worker processes to read, like in Parallel Sequential Scan? • Linux: The kernel detects sequential access from all the workers, as long as their step size is < window size, default 128kB, and generates large I/Os. • FreeBSD UFS: It doesn’t detect sequential access, and performance is terrible. • FreeBSD ZFS: Seems to be much smarter (but I don’t understand exactly why). What happens if one process interleaves two sequential streams of blocks, one reading and one writing? • Linux and FreeBSD UFS: It doesn’t detect sequential access, and performance is terrible. • ZFS doesn’t have this problem (?) What happens if we use posix_fadvise(POSIX_FADV_WILLNEED) to tell the kernel about our future random read plans? • Linux: It works nicely, to start asynchronous reads., on in-tree filesystems (not ZFS). • FreeBSD: Ignored (though other advice works). • Illumos/Solaris: It exists but is ignored (no syscall). • macOS, Windows: Not provided at all. Currently this improves the performance of bitmap heap scans and VACUUM. In future we might use it for more things (index scans, recovery/replication speed improvements). This is a type of “poor man’s async I/O”. I created WIP pull request #9807 for openzfs to make this work on ZFS on Linux and FreeBSD -- but it needs more work!
  9. Buffered I/O: Writeback heuristics and control On Linux, we use

    sync_file_range() without waiting, to ask the kernel to initiate writeback of specific ranges of dirty files. This is a hint about which files we’ll be calling fsync() on, to spread the writing out over a longer time, without having other files also written back (temporary files which we’ll never fsync(), files belonging to unrelated software on the same server). No other system has a similar system call, and there are no great ways to emulate it. Perhaps there should be POSIX_FADV_WILLSYNC?
  10. Buffered I/O: Towards async and direct I/O • POSTGRES originated

    at a time when there was no direct or async I/O; its authors wrote papers complaining about double buffering and other problems • It’s possible that its authors even requested that fsync() be added to 4.2BSD (pure speculation on my part, given they were developed around the same time and place… how did anyone make reliable databases on Unix before that?) • Async I/O will require a lot of new portability work: Linux’s POSIX and non-POSIX aio implementations are not usable by us for various reasons, but its brand new io_uring looks intriguing; Windows has a completely different model; for FreeBSD and friends we’d have to use POSIX aio. • Direct I/O is not standardised, and will require a lot of new machinery inside PostgreSQL to replace the read-ahead, write-behind, buffering heuristics of the kernel. • We’d still need to support buffered I/O for filesystems like ZFS, and synchronous I/O for OSes that we don’t have an async implementation. • All of this will be a lot of work!
  11. IPC and networking: Signals Linux and FreeBSD have a “parent

    death signal” facility (inspired by an IRIX feature). Why do other Unices not have a thing like this? Without it, we expend a lot of system time polling a pipe so that processes can exit quickly if the postmaster process goes away. For Windows we have a complicated arrangement to simulate Unix signals.
  12. IPC and networking: I/O multiplexing Frequently PostgreSQL waits on a

    set of file descriptors: a socket, a self-pipe used by signal handlers and a pipe connected to the postmaster. Switching from a poll()-based implementation dramatically reduced system time for larger servers with many connections, due to contention on the pipe. • Linux: epoll • *BSD, macOS: kqueue (since PG13, not yet released) • Other Un*x: poll • Windows: win32 primitives
  13. IPC and networking: TCP/IP One obscure difference in TCP implementations:

    Windows doesn’t allow you to read from a socket that has already failed to write. This means that if a remote host sends a message like “authentication failed, goodbye” and then closes the socket, and the protocol is one that requires the local host to begin sending data before receiving any message from the remote host, then the local host risks an ECONNRESET error when it tries to send, but can no longer read the “authentication failed, goodbye” message. Concretely, this causes us to lose some kinds of helpful context from logs when things go wrong on Windows, in protocols that support unsoliticed messages from both sides. (It may be that such a message is not guaranteed to be available on other systems either, but it usually is at least.)