Save 37% off PRO during our Black Friday Sale! »

Undo log hacking report

Undo log hacking report

An "unconference" talk I gave at PGCon 2018 in Ottawa. This is about on-going project to support ARIES-plus-MVCC in future PostgreSQL releases.

7b7e8e5a434cc7986bb95dcc523f59fa?s=128

Thomas Munro

May 30, 2018
Tweet

Transcript

  1. Undo log storage subsystem prototype/proposal overview pgcon unconference 2018 Thomas

    Munro
  2. old version old version old version row free space free

    space row old version row Traditional heap
  3. row row row row free space old version old version

    old version row zheap
  4. undo log storage zheap ??? ??? } Undo-aware access managers

    <- I’m talking about this undo record API
  5. Goals To support zheap, we need undo logs that provide:

    • efficient write access, optimised for many concurrent writers without contention; like logs • efficient discarding: the usual outcome is that transactions is that data is discarded without ever being written to disk; like a queue • efficient read access through shared buffers, because the data they hold is needed for MVCC; older snapshots need to read it quickly; like a relation
  6. Undo space allocation 0000000000000000 ffffffffffffffff • 64 bit address space

    modelled by type UndoRecPtr • Only small fragments used at a time, and most data has a short life time • How to keep track of live data? 18 exabytes
  7. Slightly too simple insert discard live undo data 0000000000000000 ffffffffffffffff

    • contention among inserting sessions due to overlapping buffer use at the insertion point 18 exabytes
  8. offset logno Solution: cut address space into arbitrary regions assigned

    to sessions that write 0000000000000000 000000ffffffffff 0000010000000000 0000020000000000 000001ffffffffff 000002ffffffffff log 0 log 1 log 2 1 terabyte 0000000000000000
  9. Meta-data postgres=# select * from pg_stat_undo_logs; log_number | persistence |

    tablespace | discard | insert | end | xid | pid ------------+-------------+------------+------------------+------------------+------------------+-----+------- 0 | permanent | pg_default | 000000000000004A | 000000000000004A | 0000000000400000 | 559 | 56156 1 | permanent | pg_default | 00000100009C1908 | 00000100009C1908 | 0000010001000000 | 562 | 56163 2 | permanent | pg_default | 000002000000004A | 000002000000004A | 0000020000400000 | 563 | 56174 (3 rows) • The meta-data used for space management within each undo log is: discard <= insert <= end. Discard and insert we have met; end shows unused space that has been allocated on disk. • We also track the currently attached backend and xid, if there is one. These are visible in the pg_stat_undo_logs view.
  10. API • Allocating and discarding undo data:
 
 UndoRecPtr UndoLogAllocate(size_t

    size,
 UndoPersistence persistence);
 
 void UndoLogDiscard(UndoRecPtr discard_point); • Finding in shared buffers:
 
 UndoRecPtrAssignRelFileNode(relfilenode,
 undo_record_pointer)
 
 UndoRecPtrGetPageOffset(undo_record_pointer)
  11. Persistence levels • Each session can be attached to up

    to three undo logs at a given time, where it will write new data: • A “permanent” one for undo data from persistent relations; discarded only when no longer needed for rollback or MVCC • An “unlogged” one for undo data from persistent relations; as above but also deleted on startup after crash • A “temporary” one for for undo data from temporary relations; temporary buffers, deleted at startup
  12. Files • The name of each 1MB file is the

    UndoRecPtr address of the first byte in the file, with a dot inserted to separate the undo log number from the rest • When discarding files, we usually just rename them into position, so that they become new space (similar to what we do for WAL segments); this usually happens in the undo worker • This means that foreground processes usually avoid having to do slow filesystem operations $ ls -slaph base/undo/ | head -7 total 139264 0 drwx------ 70 munro staff 2.2K 26 Mar 09:35 ./ 0 drwx------ 7 munro staff 224B 26 Mar 09:33 ../ 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:38 000000.0000600000 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:33 000000.0000700000 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:38 000001.0000600000 2048 -rw------- 1 munro staff 1.0M 26 Mar 09:33 000001.0000700000
  13. File operations • WAL records generated when filesystem operations happen

    (creating, unlinking, renaming segment files) • Filesystem operations are synchronous and must be fsync()ed, but they usually happen in the background • Note that changes to insert pointers are not WAL logged explicitly!
  14. Segment recycling • We currently try to recycle one spare

    segment in the background whenever discarding • If transactions generate less than 1MB of undo log each and there are no long running snapshots, we can continually discard data, rename files and otherwise mostly avoid touching the filesystem • Projected result: about 2MB of disk footprint per active backend on OLTP workload, very little IO if shared buffers big enough and checkpoints infrequent 2018-04-11 15:53:54.602 NZST [58692] LOG: recycled undo segment "base/undo/000004.0000000000" -> "base/undo/000004.0000200000" 2018-04-11 15:54:04.245 NZST [58692] LOG: recycled undo segment "base/undo/000000.0000100000" -> "base/undo/000000.0000300000" 2018-04-11 15:54:05.249 NZST [58692] LOG: recycled undo segment "base/undo/000003.0000100000" -> "base/undo/000003.0000300000" 2018-04-11 15:54:05.250 NZST [58692] LOG: recycled undo segment "base/undo/000005.0000100000" -> "base/undo/000005.0000300000" 2018-04-11 15:54:05.451 NZST [58692] LOG: recycled undo segment "base/undo/000001.0000100000" -> "base/undo/000001.0000300000" 2018-04-11 15:54:05.552 NZST [58692] LOG: recycled undo segment "base/undo/000006.0000100000" -> "base/undo/000006.0000300000" 2018-04-11 15:54:05.652 NZST [58692] LOG: recycled undo segment "base/undo/000002.0000100000" -> "base/undo/000002.0000300000" 2018-04-11 15:54:05.854 NZST [58692] LOG: recycled undo segment "base/undo/000007.0000100000" -> "base/undo/000007.0000300000" 2018-04-11 15:54:06.256 NZST [58692] LOG: recycled undo segment "base/undo/000004.0000100000" -> "base/undo/000004.0000300000" 2018-04-11 15:54:16.805 NZST [58692] LOG: recycled undo segment "base/undo/000000.0000200000" -> "base/undo/000000.0000400000" 2018-04-11 15:54:17.307 NZST [58692] LOG: recycled undo segment "base/undo/000003.0000200000" -> "base/undo/000003.0000400000" 2018-04-11 15:54:17.709 NZST [58692] LOG: recycled undo segment "base/undo/000005.0000200000" -> "base/undo/000005.0000400000" 2018-04-11 15:54:17.811 NZST [58692] LOG: recycled undo segment "base/undo/000001.0000200000" -> "base/undo/000001.0000400000" 2018-04-11 15:54:17.811 NZST [58692] LOG: recycled undo segment "base/undo/000006.0000200000" -> "base/undo/000006.0000400000" 2018-04-11 15:54:17.812 NZST [58692] LOG: recycled undo segment "base/undo/000007.0000200000" -> "base/undo/000007.0000400000" 2018-04-11 15:54:18.515 NZST [58692] LOG: recycled undo segment "base/undo/000002.0000200000" -> "base/undo/000002.0000400000" 2018-04-11 15:54:18.917 NZST [58692] LOG: recycled undo segment "base/undo/000004.0000200000" -> "base/undo/000004.0000400000" 2018-04-11 15:54:29.463 NZST [58692] LOG: recycled undo segment "base/undo/000000.0000300000" -> "base/undo/000000.0000500000" …200000 …100000 …00000
  15. Tablespaces postgres=# create tablespace ts1 location '/tmp/ts1'; CREATE TABLESPACE postgres=#

    set undo_tablespaces = ts1; SET postgres=# insert into foo values (42); INSERT 0 1 postgres=# select * from pg_stat_undo_logs where tablespace = 'ts1'; log_number | persistence | tablespace | discard | insert | end | xid | pid ------------+-------------+------------+------------------+------------------+------------------+--------+------- 60 | permanent | ts1 | 00003C0000000018 | 00003C0000000018 | 00003C0000100000 | 189257 | 46137 (1 row) postgres=# drop tablespace ts1; DROP TABLESPACE 2018-03-28 15:44:50.265 NZDT [46137] LOG: created undo segment "pg_tblspc/16416/PG_11_201802061/undo/00003C.0000000000" • GUC “undo_tablespaces” controls where your session writes undo data (similar to “temp_tablespaces”) • Tablespace can only be dropped when contained undo logs are empty (no attached transactions in progress, fully discarded); attached sessions will be forcibly detached
  16. Buffers Recap of “steal, no force” buffering as used in

    PostgreSQL: • “steal”: if you need a buffer and none are free, you steal one (write it out to disk if dirty); “no force”: committing doesn’t require writing out dirty buffers • buffers are written out by checkpoints and by “stealing” (= memory pressure); otherwise they don’t have to be written to disk • the checkpointer calls fsync() at appropriate times • We want all this existing machinery for free for our undo logs! • We also need a new way to “forget” buffers holding discarded data, to avoid all IO completely if we’re lucky
  17. POSTGRES 4.2 (1994) smgr.c md.c mm.c sj.c bufmgr.c main memory

    Sony “Jukebox” magnetic disk customers current_totals sales Each relation was associated with a “storage manager”. } buffer pool
  18. PostgreSQL 11 smgr.c md.c bufmgr.c relation files customers current_totals sales

    The “storage manager” API layer remains, but there is only a single implementation. } buffer pool
  19. Enter undo logs smgr.c md.c bufmgr.c relation
 files customers current_totals

    sales We can use this API to provide a new storage manager to the buffer manager! undo log 0 undo log 1 undofile.c undo
 files } buffer pool
  20. Mapping buffer pages to storage managers if (rnode.dbNode == 9)

    reln->smgr_which = 1; /* use undofile.c implementation */ else reln->smgr_which = 0; /* use md.c implementation */ Yeah, we could probably do better than this…
  21. Buffer life cycle • If discarding at same rate as

    inserting (pgbench): • rarely write undo data to disk (only at checkpoints) • recycle same 1-2 buffers constantly • If not able to discard (long lived snapshot): • compete with other buffer pool contents • … need ring? … different page reclamation?
  22. DSM segment Shared memory undo log 0
 insert = 000000000069BF50

    discard = 000000000069BF50 end = 0000000000800000 … undo log 1
 insert = 000000000069BF50 discard = 000000000069BF50 end = 0000000000800000 … undo log 2
 insert = 000000000069BF50 discard = 000000000069BF50 end = 0000000000800000 … bank 0 bank 1 bank 2 • Conceptually we need an array of UndoLogControl objects in shared memory, for fast access to undo log meta-data by undo log number. An array would be too big; a dynamic hash table might work, but instead we cut the array into many “banks”, since active undo log numbers are clustered together; only map in the banks you need bank 2
  23. Checkpoints • Whenever a checkpoint occurs, we dump the contents

    of all meta-data from shared memory into an undo log checkpoint file under pg_undo. • For shutdown checkpoints, these are by definition consistent (no concurrent activity is allowed) • For online checkpoints, these contain a snapshot of each undo log’s meta-data from some arbitrary moment after the redo point, which is a problem…
  24. Meta-data in the WAL • The first time any zheap

    WAL record is written to each undo log after a checkpoint (ie after the redo point of a checkpoint), we first write an undo log ‘meta data’ record, which will compensate for the inconsistencies in the undo checkpoint file. All writes after that can omit the location because it’s implied, reducing WAL size. • When I showed this slide at pgcon unconf, I had something here about why we don’t need full_page_writes, but that was wrong: we do need them, but we can probably use REGBUF_WILL_INIT or something similar to avoid them almost always; I’m looking into that…
  25. Sessions -> transactions • At “DO” time, there is an

    association between sessions (= backends) and undo logs (ie currently attached) • At “REDO” time, sessions are gone: everything will be replayed by the start-up process. So we maintain an xid- >undo log mapping during recovery • The first time any transaction writes to any undo log, it writes an “attach” record in the WAL
  26. eof