Undo log hacking report

Undo log storage subsystem prototype/proposal overview pgcon unconference 2018 Thomas
Munro

old version old version old version row free space free
space row old version row Traditional heap

row row row row free space old version old version
old version row zheap

undo log storage zheap ??? ??? } Undo-aware access managers
<- I’m talking about this undo record API

Goals To support zheap, we need undo logs that provide:
• efficient write access, optimised for many concurrent writers without contention; like logs • efficient discarding: the usual outcome is that transactions is that data is discarded without ever being written to disk; like a queue • efficient read access through shared buffers, because the data they hold is needed for MVCC; older snapshots need to read it quickly; like a relation

Undo space allocation 0000000000000000 ffffffffffffffff • 64 bit address space
modelled by type UndoRecPtr • Only small fragments used at a time, and most data has a short life time • How to keep track of live data? 18 exabytes

Slightly too simple insert discard live undo data 0000000000000000 ffffffffffffffff
• contention among inserting sessions due to overlapping buﬀer use at the insertion point 18 exabytes

offset logno Solution: cut address space into arbitrary regions assigned
to sessions that write 0000000000000000 000000ffffffffff 0000010000000000 0000020000000000 000001ffffffffff 000002ffffffffff log 0 log 1 log 2 1 terabyte 0000000000000000

Meta-data postgres=# select * from pg_stat_undo_logs; log_number | persistence |
tablespace | discard | insert | end | xid | pid ------------+-------------+------------+------------------+------------------+------------------+-----+------- 0 | permanent | pg_default | 000000000000004A | 000000000000004A | 0000000000400000 | 559 | 56156 1 | permanent | pg_default | 00000100009C1908 | 00000100009C1908 | 0000010001000000 | 562 | 56163 2 | permanent | pg_default | 000002000000004A | 000002000000004A | 0000020000400000 | 563 | 56174 (3 rows) • The meta-data used for space management within each undo log is: discard <= insert <= end. Discard and insert we have met; end shows unused space that has been allocated on disk. • We also track the currently attached backend and xid, if there is one. These are visible in the pg_stat_undo_logs view.

API • Allocating and discarding undo data:    UndoRecPtr UndoLogAllocate(size_t
size,  UndoPersistence persistence);    void UndoLogDiscard(UndoRecPtr discard_point); • Finding in shared buﬀers:    UndoRecPtrAssignRelFileNode(relfilenode,  undo_record_pointer)    UndoRecPtrGetPageOffset(undo_record_pointer)

Persistence levels • Each session can be attached to up
to three undo logs at a given time, where it will write new data: • A “permanent” one for undo data from persistent relations; discarded only when no longer needed for rollback or MVCC • An “unlogged” one for undo data from persistent relations; as above but also deleted on startup after crash • A “temporary” one for for undo data from temporary relations; temporary buﬀers, deleted at startup

Files • The name of each 1MB file is the
UndoRecPtr address of the first byte in the file, with a dot inserted to separate the undo log number from the rest • When discarding files, we usually just rename them into position, so that they become new space (similar to what we do for WAL segments); this usually happens in the undo worker • This means that foreground processes usually avoid having to do slow filesystem operations $ ls -slaph base/undo/ | head -7 total 139264 0 drwx------ 70 munro staff 2.2K 26 Mar 09:35 ./ 0 drwx------ 7 munro staff 224B 26 Mar 09:33 ../ 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:38 000000.0000600000 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:33 000000.0000700000 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:38 000001.0000600000 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:33 000001.0000700000

File operations • WAL records generated when ﬁlesystem operations happen
(creating, unlinking, renaming segment ﬁles) • Filesystem operations are synchronous and must be fsync()ed, but they usually happen in the background • Note that changes to insert pointers are not WAL logged explicitly!

Segment recycling • We currently try to recycle one spare
segment in the background whenever discarding • If transactions generate less than 1MB of undo log each and there are no long running snapshots, we can continually discard data, rename files and otherwise mostly avoid touching the filesystem • Projected result: about 2MB of disk footprint per active backend on OLTP workload, very little IO if shared buffers big enough and checkpoints infrequent 2018-04-11 15:53:54.602 NZST [58692] LOG: recycled undo segment "base/undo/000004.0000000000" -> "base/undo/000004.0000200000" 2018-04-11 15:54:04.245 NZST [58692] LOG: recycled undo segment "base/undo/000000.0000100000" -> "base/undo/000000.0000300000" 2018-04-11 15:54:05.249 NZST [58692] LOG: recycled undo segment "base/undo/000003.0000100000" -> "base/undo/000003.0000300000" 2018-04-11 15:54:05.250 NZST [58692] LOG: recycled undo segment "base/undo/000005.0000100000" -> "base/undo/000005.0000300000" 2018-04-11 15:54:05.451 NZST [58692] LOG: recycled undo segment "base/undo/000001.0000100000" -> "base/undo/000001.0000300000" 2018-04-11 15:54:05.552 NZST [58692] LOG: recycled undo segment "base/undo/000006.0000100000" -> "base/undo/000006.0000300000" 2018-04-11 15:54:05.652 NZST [58692] LOG: recycled undo segment "base/undo/000002.0000100000" -> "base/undo/000002.0000300000" 2018-04-11 15:54:05.854 NZST [58692] LOG: recycled undo segment "base/undo/000007.0000100000" -> "base/undo/000007.0000300000" 2018-04-11 15:54:06.256 NZST [58692] LOG: recycled undo segment "base/undo/000004.0000100000" -> "base/undo/000004.0000300000" 2018-04-11 15:54:16.805 NZST [58692] LOG: recycled undo segment "base/undo/000000.0000200000" -> "base/undo/000000.0000400000" 2018-04-11 15:54:17.307 NZST [58692] LOG: recycled undo segment "base/undo/000003.0000200000" -> "base/undo/000003.0000400000" 2018-04-11 15:54:17.709 NZST [58692] LOG: recycled undo segment "base/undo/000005.0000200000" -> "base/undo/000005.0000400000" 2018-04-11 15:54:17.811 NZST [58692] LOG: recycled undo segment "base/undo/000001.0000200000" -> "base/undo/000001.0000400000" 2018-04-11 15:54:17.811 NZST [58692] LOG: recycled undo segment "base/undo/000006.0000200000" -> "base/undo/000006.0000400000" 2018-04-11 15:54:17.812 NZST [58692] LOG: recycled undo segment "base/undo/000007.0000200000" -> "base/undo/000007.0000400000" 2018-04-11 15:54:18.515 NZST [58692] LOG: recycled undo segment "base/undo/000002.0000200000" -> "base/undo/000002.0000400000" 2018-04-11 15:54:18.917 NZST [58692] LOG: recycled undo segment "base/undo/000004.0000200000" -> "base/undo/000004.0000400000" 2018-04-11 15:54:29.463 NZST [58692] LOG: recycled undo segment "base/undo/000000.0000300000" -> "base/undo/000000.0000500000" …200000 …100000 …00000

Tablespaces postgres=# create tablespace ts1 location '/tmp/ts1'; CREATE TABLESPACE postgres=#
set undo_tablespaces = ts1; SET postgres=# insert into foo values (42); INSERT 0 1 postgres=# select * from pg_stat_undo_logs where tablespace = 'ts1'; log_number | persistence | tablespace | discard | insert | end | xid | pid ------------+-------------+------------+------------------+------------------+------------------+--------+------- 60 | permanent | ts1 | 00003C0000000018 | 00003C0000000018 | 00003C0000100000 | 189257 | 46137 (1 row) postgres=# drop tablespace ts1; DROP TABLESPACE 2018-03-28 15:44:50.265 NZDT [46137] LOG: created undo segment "pg_tblspc/16416/PG_11_201802061/undo/00003C.0000000000" • GUC “undo_tablespaces” controls where your session writes undo data (similar to “temp_tablespaces”) • Tablespace can only be dropped when contained undo logs are empty (no attached transactions in progress, fully discarded); attached sessions will be forcibly detached

Buffers Recap of “steal, no force” buffering as used in
PostgreSQL: • “steal”: if you need a buffer and none are free, you steal one (write it out to disk if dirty); “no force”: committing doesn’t require writing out dirty buffers • buffers are written out by checkpoints and by “stealing” (= memory pressure); otherwise they don’t have to be written to disk • the checkpointer calls fsync() at appropriate times • We want all this existing machinery for free for our undo logs! • We also need a new way to “forget” buffers holding discarded data, to avoid all IO completely if we’re lucky

POSTGRES 4.2 (1994) smgr.c md.c mm.c sj.c bufmgr.c main memory
Sony “Jukebox” magnetic disk customers current_totals sales Each relation was associated with a “storage manager”. } buﬀer pool

PostgreSQL 11 smgr.c md.c bufmgr.c relation ﬁles customers current_totals sales
The “storage manager” API layer remains, but there is only a single implementation. } buﬀer pool

Enter undo logs smgr.c md.c bufmgr.c relation  files customers current_totals
sales We can use this API to provide a new storage manager to the buffer manager! undo log 0 undo log 1 undofile.c undo  files } buffer pool

Mapping buffer pages to storage managers if (rnode.dbNode == 9)
reln->smgr_which = 1; /* use undofile.c implementation */ else reln->smgr_which = 0; /* use md.c implementation */ Yeah, we could probably do better than this…

Buffer life cycle • If discarding at same rate as
inserting (pgbench): • rarely write undo data to disk (only at checkpoints) • recycle same 1-2 buffers constantly • If not able to discard (long lived snapshot): • compete with other buffer pool contents • … need ring? … different page reclamation?

DSM segment Shared memory undo log 0  insert = 000000000069BF50
discard = 000000000069BF50 end = 0000000000800000 … undo log 1  insert = 000000000069BF50 discard = 000000000069BF50 end = 0000000000800000 … undo log 2  insert = 000000000069BF50 discard = 000000000069BF50 end = 0000000000800000 … bank 0 bank 1 bank 2 • Conceptually we need an array of UndoLogControl objects in shared memory, for fast access to undo log meta-data by undo log number. An array would be too big; a dynamic hash table might work, but instead we cut the array into many “banks”, since active undo log numbers are clustered together; only map in the banks you need bank 2

Checkpoints • Whenever a checkpoint occurs, we dump the contents
of all meta-data from shared memory into an undo log checkpoint ﬁle under pg_undo. • For shutdown checkpoints, these are by deﬁnition consistent (no concurrent activity is allowed) • For online checkpoints, these contain a snapshot of each undo log’s meta-data from some arbitrary moment after the redo point, which is a problem…

Meta-data in the WAL • The first time any zheap
WAL record is written to each undo log after a checkpoint (ie after the redo point of a checkpoint), we first write an undo log ‘meta data’ record, which will compensate for the inconsistencies in the undo checkpoint file. All writes after that can omit the location because it’s implied, reducing WAL size. • When I showed this slide at pgcon unconf, I had something here about why we don’t need full_page_writes, but that was wrong: we do need them, but we can probably use REGBUF_WILL_INIT or something similar to avoid them almost always; I’m looking into that…

Sessions -> transactions • At “DO” time, there is an
association between sessions (= backends) and undo logs (ie currently attached) • At “REDO” time, sessions are gone: everything will be replayed by the start-up process. So we maintain an xid- >undo log mapping during recovery • The ﬁrst time any transaction writes to any undo log, it writes an “attach” record in the WAL

Undo log hacking report

Undo log hacking report

Thomas Munro

More Decks by Thomas Munro

Other Decks in Programming

Featured

Transcript

Undo log storage subsystem prototype/proposal overview pgcon unconference 2018 Thomas

old version old version old version row free space free

row row row row free space old version old version

undo log storage zheap ??? ??? } Undo-aware access managers

Goals To support zheap, we need undo logs that provide:

Undo space allocation 0000000000000000 ffffffffffffffff • 64 bit address space

Slightly too simple insert discard live undo data 0000000000000000 ffffffffffffffff

offset logno Solution: cut address space into arbitrary regions assigned

Meta-data postgres=# select * from pg_stat_undo_logs; log_number | persistence |

API • Allocating and discarding undo data:    UndoRecPtr UndoLogAllocate(size_t

Persistence levels • Each session can be attached to up

Files • The name of each 1MB ﬁle is the

File operations • WAL records generated when ﬁlesystem operations happen

Segment recycling • We currently try to recycle one spare

Tablespaces postgres=# create tablespace ts1 location '/tmp/ts1'; CREATE TABLESPACE postgres=#

Buffers Recap of “steal, no force” buﬀering as used in

POSTGRES 4.2 (1994) smgr.c md.c mm.c sj.c bufmgr.c main memory

PostgreSQL 11 smgr.c md.c bufmgr.c relation ﬁles customers current_totals sales

Enter undo logs smgr.c md.c bufmgr.c relation  ﬁles customers current_totals

Mapping buffer pages to storage managers if (rnode.dbNode == 9)

Buffer life cycle • If discarding at same rate as

DSM segment Shared memory undo log 0  insert = 000000000069BF50

Checkpoints • Whenever a checkpoint occurs, we dump the contents

Meta-data in the WAL • The ﬁrst time any zheap

Sessions -> transactions • At “DO” time, there is an

eof