PostgreSQL Portability Tales

Slide 1

Slide 1 text

PostgreSQL Portability Tales Warsaw PostgreSQL Users Group via video conference Thomas Munro, 2020-04-16 [email protected] [email protected] [email protected]

Slide 2

Slide 2 text

Talk topics ● Introduction ○ Who cares about portability? ○ Instruction set architectures ● Memory ○ Simulating fork() ○ Page size effects and control ○ Unexpected allocator effects ● Buffered I/O ○ Error reporting variations ○ Prefetching heuristics and control ○ Writeback heuristics and control ○ Towards async & direct I/O ● IPC and networking ○ Signals ○ I/O multiplexing ○ TCP/IP

Slide 3

Slide 3 text

Introduction: Who cares about portability? ● Original POSTGRES development platforms at UC Berkeley seem to have been: SunOS, Sequent Dynix (= a BSD fork with large scale SMP), BSD ● Today, developers mostly use Linux or macOS, a few use FreeBSD or Windows ● According to buildfarm.postgresql.org, our community currently cares about: ○ { Debian, Ubuntu, SUSE, RHEL, CentOS, Fedora, Photon, Raspbian, Arch, Amazon } Linux ○ { Free, Open, Net, DragonFly } BSD ○ macOS ○ Solaris, Illumos ○ AIX ○ HPUX ○ Windows ● Over the years we’ve dropped support for many others (RIP): ○ IRIX, SCO, DGUX, Ultrix, NeXTSTEP, SunOS4, SVR4, UnixWare, QNX, ...

Slide 4

Slide 4 text

Introduction: Instruction set architectures ● Modern and ancient ISAs we have in the build farm: ○ X86 ○ ARM ○ POWER ○ S390 ○ IA64 ○ SPARC ○ MIPS ○ PA-RISC ● Modern ISAs that we don’t explicitly support yet -- build farm machine wanted! ○ RISC-V ● There’s explicit code to support various other long dead systems, but no build farm so status unknown to me (M68K, M88K, VAX)

Slide 5

Slide 5 text

Memory: Simulating fork() ● PostgreSQL is still process based after all these decades (one day someone will attack that…). ● On Unix, the main shared memory area is inherited by children of the postmaster; we count on address stability in that region. ● Windows has no fork(). Child processes have to recreate all required state from a ﬁle, and try to attach the main shared memory area at the same address. ● For those of us working on Unix and wanting to test or extend that mechanism, it’s possible to build PostgreSQL that way on Unix too, with -DEXEC_BACKEND. ● Windows support must have taken Herculean eﬀort. Added in 8.0 (2005).

Slide 6

Slide 6 text

Memory: Page size effects and control ● Linux ○ The “transparent huge pages” support is currently discouraged by PostgreSQL docs; one day?. ○ We try to map the main shared memory segment using huge pages (if configured), for performance ○ Putting the text (executable) into huge pages is known to speed up benchmarks by several percent, but requires some special magic not done out of the box (see libhugetlbfs linker scripts) ○ Putting DSM segments into huge pages is known to speed up some parallel queries, especially parallel hash joins, significantly, but requires some extra configuration (/dev/shm mount options) ● FreeBSD (and illumos too?) ○ Most of the above happens transparently out of the box (but further improvements are possible*) ● Windows ○ We try to map the main shared memory segment only using huge pages (requires privileges) ● Future research ○ Different page size? AIX support proposed. Reuse DSMs better, probably preallocated. *See “On the Impact of Instruction Address Translation Overhead”, Zhou, Dong, Cox, Dwarkadas

Slide 7

Slide 7 text

Memory: Unexpected allocator effects PostgreSQL allocates memory for hash joins 32kB + a tiny header at a time. What does libc malloc() do with this stream of requests? ● Glibc (usually used on Linux): eﬃciently packs memory ● Jemalloc (FreeBSD): requests 32kb < x < 40kB are rounded up to 40kB, so hash joins eat 25% more memory on this platform! Oops. ● macOS: requests are rounded up to nearest 256 bytes ● Other OSs: who knows? We don’t have to use 32kB + a tiny header, but it’s hard to second-guess every system out there. It would probably be wise to use exact powers of two! Not done yet...

Slide 8

Slide 8 text

Buffered I/O: Error reporting variations (1) What is the status of buffered data after fsync() returns EIO or ENOSPC (NFS)? ● Linux: dirty buffers may be marked clean and dropped immediately or eventually (depending on filesystem), so that they will never be written back; errors are “consumed” when they are reported, so a new call to fsync() might report success and a call to read() might read and earlier state of the data ● FreeBSD (and maybe illumos?): dirty buffers remain dirty and can’t be replaced (until the device goes away, or writeback succeeds) (2) What if a writeback error occurs while no process holds the file open, and you later open the same file and call fsync()? ● Linux: an error flag persists so that at least one later fsync() call will report the error, but only as long as the inode remains in kernel memory ● FreeBSD (and maybe illumos?): see previous answer The PostgreSQL project had assumed FreeBSD behaviour until 2018. Now fsync() errors cause a panic, to fix (1). More work is needed for (2).

Slide 9

Slide 9 text

Buffered I/O: Prefetching heuristics and control What happens if you hand out sequential block number to worker processes to read, like in Parallel Sequential Scan? ● Linux: The kernel detects sequential access from all the workers, as long as their step size is < window size, default 128kB, and generates large I/Os. ● FreeBSD UFS: It doesn’t detect sequential access, and performance is terrible. ● FreeBSD ZFS: Seems to be much smarter (but I don’t understand exactly why). What happens if one process interleaves two sequential streams of blocks, one reading and one writing? ● Linux and FreeBSD UFS: It doesn’t detect sequential access, and performance is terrible. ● ZFS doesn’t have this problem (?) What happens if we use posix_fadvise(POSIX_FADV_WILLNEED) to tell the kernel about our future random read plans? ● Linux: It works nicely, to start asynchronous reads., on in-tree ﬁlesystems (not ZFS). ● FreeBSD: Ignored (though other advice works). ● Illumos/Solaris: It exists but is ignored (no syscall). ● macOS, Windows: Not provided at all. Currently this improves the performance of bitmap heap scans and VACUUM. In future we might use it for more things (index scans, recovery/replication speed improvements). This is a type of “poor man’s async I/O”. I created WIP pull request #9807 for openzfs to make this work on ZFS on Linux and FreeBSD -- but it needs more work!

Slide 10

Slide 10 text

Buffered I/O: Writeback heuristics and control On Linux, we use sync_file_range() without waiting, to ask the kernel to initiate writeback of specific ranges of dirty files. This is a hint about which files we’ll be calling fsync() on, to spread the writing out over a longer time, without having other files also written back (temporary files which we’ll never fsync(), files belonging to unrelated software on the same server). No other system has a similar system call, and there are no great ways to emulate it. Perhaps there should be POSIX_FADV_WILLSYNC?

Slide 11

Slide 11 text

Buffered I/O: Towards async and direct I/O ● POSTGRES originated at a time when there was no direct or async I/O; its authors wrote papers complaining about double buffering and other problems ● It’s possible that its authors even requested that fsync() be added to 4.2BSD (pure speculation on my part, given they were developed around the same time and place… how did anyone make reliable databases on Unix before that?) ● Async I/O will require a lot of new portability work: Linux’s POSIX and non-POSIX aio implementations are not usable by us for various reasons, but its brand new io_uring looks intriguing; Windows has a completely different model; for FreeBSD and friends we’d have to use POSIX aio. ● Direct I/O is not standardised, and will require a lot of new machinery inside PostgreSQL to replace the read-ahead, write-behind, buffering heuristics of the kernel. ● We’d still need to support buffered I/O for filesystems like ZFS, and synchronous I/O for OSes that we don’t have an async implementation. ● All of this will be a lot of work!

Slide 12

Slide 12 text

IPC and networking: Signals Linux and FreeBSD have a “parent death signal” facility (inspired by an IRIX feature). Why do other Unices not have a thing like this? Without it, we expend a lot of system time polling a pipe so that processes can exit quickly if the postmaster process goes away. For Windows we have a complicated arrangement to simulate Unix signals.

Slide 13

Slide 13 text

IPC and networking: I/O multiplexing Frequently PostgreSQL waits on a set of ﬁle descriptors: a socket, a self-pipe used by signal handlers and a pipe connected to the postmaster. Switching from a poll()-based implementation dramatically reduced system time for larger servers with many connections, due to contention on the pipe. ● Linux: epoll ● *BSD, macOS: kqueue (since PG13, not yet released) ● Other Un*x: poll ● Windows: win32 primitives

Slide 14

Slide 14 text

IPC and networking: TCP/IP One obscure diﬀerence in TCP implementations: Windows doesn’t allow you to read from a socket that has already failed to write. This means that if a remote host sends a message like “authentication failed, goodbye” and then closes the socket, and the protocol is one that requires the local host to begin sending data before receiving any message from the remote host, then the local host risks an ECONNRESET error when it tries to send, but can no longer read the “authentication failed, goodbye” message. Concretely, this causes us to lose some kinds of helpful context from logs when things go wrong on Windows, in protocols that support unsoliticed messages from both sides. (It may be that such a message is not guaranteed to be available on other systems either, but it usually is at least.)

Slide 15

Slide 15 text

EOF Questions?