Slide 1

Slide 1 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Bloat in PostgreSQL: A Taxonomy Postgres Open — September 6, 2018

Slide 2

Slide 2 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Peter 
 Geoghegan @petervgeoghegan

Slide 3

Slide 3 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Outline Most discussions of VACUUM/bloat take a pragmatic, top-down approach. I’m going to take a bottom-up approach, though. This isn’t necessarily better. It might help you to develop better intuitions about VACUUM’s performance, though, especially if you’re familiar with the standard explanation already. 3

Slide 4

Slide 4 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Overview 1. Structure Logical vs. Physical, page model, layout of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 4

Slide 5

Slide 5 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Structure Pictured: The basic scheme of modern classification (Wikipedia)

Slide 6

Slide 6 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Logical vs. Physical Database systems use access methods to abstract away physical representation. MVCC more or less versions entries in objects (relations). Heap relations (tables) store rows more or less in arbitrary order. Index relations may also have multiple versions, though this can be avoided by HOT optimization. 6

Slide 7

Slide 7 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Heap (table) lookups Heap structure is optimized for sequential access, and access by index scans. Records are identified by TID (e.g., '(2,32)', '(43,89)')— must be stable for index scans. TID is a “physiological” identifier. - Physical across pages/blocks — block number. - Logical within pages/blocks — item pointer offset. 7

Slide 8

Slide 8 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen pg_hexedit tool, with pg_index catalog relation (table)

Slide 9

Slide 9 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Index Lookups Indexes are concerned with making access to specific records efficient. B-Trees mostly consist of all items that are indexed in natural sort order. (99%+ of blocks are leaf blocks.) B-Tree indexes are particular about which key values go on which page, complicating matters for MVCC garbage collection. Index pages will regularly need to “fit a new entry between existing entries”. Physiological-ness helps with that (index TIDs exist, but are not stable over time). 9

Slide 10

Slide 10 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Image generated using experimental Graphiz Postgres B-Tree visualization tool — Heikki Linnakangas 1 L 11 61 61 61 31 33 32 2 L 11 61 61 61 31 36 35 4 L 11 61 61 61 31 39 38 5 L 11 61 61 61 32 33 31 3 6 L 11 61 61 61 32 36 34 7 L 11 61 61 61 32 39 37 8 L 11 61 61 61 33 33 30 9 L 11 61 61 61 33 36 33 10 L 11 61 61 61 33 39 36 11 L 11 61 61 61 34 32 39 12 L 11 61 61 61 34 36 32 13 L 11 61 61 61 34 39 35 14 L 0f 61 61 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L 11 61 61 61 36 32 36 18 L 11 61 61 61 36 35 39 19 L 11 61 61 61 36 39 32 20 L 11 61 61 61 37 32 35 21 L 11 61 61 61 37 35 38 22 L 11 61 61 61 37 39 31 23 L 11 61 61 61 38 32 34 24 L 11 61 61 61 38 35 37 25 L 11 61 61 61 38 39 30 26 L 11 61 61 61 39 32 33 27 L 11 61 61 61 39 35 36 28 L 11 61 61 61 39 38 39 29 L 3 1 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L B-Tree Index Structure

Slide 11

Slide 11 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Image generated using experimental Graphiz Postgres B-Tree visualization tool — Heikki Linnakangas 1 L 11 61 61 61 31 33 32 2 L 11 61 61 61 31 36 35 4 L 11 61 61 61 31 39 38 5 L 11 61 61 61 32 33 31 3 6 L 11 61 61 61 32 36 34 7 L 11 61 61 61 32 39 37 8 L 11 61 61 61 33 33 30 9 L 11 61 61 61 33 36 33 10 L 11 61 61 61 33 39 36 11 L 11 61 61 61 34 32 39 12 L 11 61 61 61 34 36 32 13 L 11 61 61 61 34 39 35 14 L 0f 61 61 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L 11 61 61 61 36 32 36 18 L 11 61 61 61 36 35 39 19 L 11 61 61 61 36 39 32 20 L 11 61 61 61 37 32 35 21 L 11 61 61 61 37 35 38 22 L 11 61 61 61 37 39 31 23 L 11 61 61 61 38 32 34 24 L 11 61 61 61 38 35 37 25 L 11 61 61 61 38 39 30 26 L 11 61 61 61 39 32 33 27 L 11 61 61 61 39 35 36 28 L 11 61 61 61 39 38 39 29 L 3 1 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L B-Tree Index Structure

Slide 12

Slide 12 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Image generated using experimental Graphiz Postgres B-Tree visualization tool — Heikki Linnakangas 1 L 11 61 61 61 31 33 32 2 L 11 61 61 61 31 36 35 4 L 11 61 61 61 31 39 38 5 L 11 61 61 61 32 33 31 3 6 L 11 61 61 61 32 36 34 7 L 11 61 61 61 32 39 37 8 L 11 61 61 61 33 33 30 9 L 11 61 61 61 33 36 33 10 L 11 61 61 61 33 39 36 11 L 11 61 61 61 34 32 39 12 L 11 61 61 61 34 36 32 13 L 11 61 61 61 34 39 35 14 L 0f 61 61 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L 11 61 61 61 36 32 36 18 L 11 61 61 61 36 35 39 19 L 11 61 61 61 36 39 32 20 L 11 61 61 61 37 32 35 21 L 11 61 61 61 37 35 38 22 L 11 61 61 61 37 39 31 23 L 11 61 61 61 38 32 34 24 L 11 61 61 61 38 35 37 25 L 11 61 61 61 38 39 30 26 L 11 61 61 61 39 32 33 27 L 11 61 61 61 39 35 36 28 L 11 61 61 61 39 38 39 29 L 3 1 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L B-Tree Index Structure

Slide 13

Slide 13 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen pg_hexedit tool, with pg_index_indexrelid_index catalog relation (index)

Slide 14

Slide 14 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Transaction status data (“SLRUs”) Subsidiary structures for transaction status information. VACUUM “freezes” tuples to truncate structures. Commit log/pg_xact, pg_multixact. No opportunistic (non-VACUUM) processing. Also page-based (8KiB blocks). Bitmap-like representations. 14

Slide 15

Slide 15 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Free space map and Visibility map Relation “forks” that maintain summarizing information. Only heap has visibility map. Maintained by VACUUM. No opportunistic processing. Also comprised of 8KiB blocks. Visibility map allows avoidance of redundant work by VACUUM. As of 9.6, visibility map also tracks all-frozen pages. 15

Slide 16

Slide 16 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Overview 1. Structure Logical vs. Physical, page model, layout of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 16

Slide 17

Slide 17 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen A bottom-up take on bloat Pictured: Animal Cell (Wikipedia)

Slide 18

Slide 18 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Database pages as “cells” PostgreSQL storage consists of 8KiB pages. “Page model”. - Complex atomic operations (transactions) from simpler ones (WAL-logged atomic page ops). - Individual page modifications can be made atomic with low-level techniques. 18

Slide 19

Slide 19 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen “Linux is evolution, not intelligent design” “Bad programmers worry about the code. Good programmers worry about data structures and their relationships.” — Linus Torvalds

Slide 20

Slide 20 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Evolutionary pressure Like cells in living organisms, structure of pages has a lot in common across disparate access methods (e.g., heap, B-Tree, …). Too much complexity to manage without breaking down into manageable pieces with commonality. - What else could possibly work? Indexes are very different to tables, but their page layout has quite a lot in common. 20

Slide 21

Slide 21 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Bloat at the page level Bloat is often directly observable on pages. - “Hint bits” note that transaction that updated subsequently committed. - Opportunistic cleanup reclaims space at the page level, often due to observing a set hint bit. VACUUM does work that significantly overlaps with opportunistic cleanup work of pages — pruning. 21

Slide 22

Slide 22 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen pg_hexedit tool, with LP_DEAD item pointers (table) following pruning

Slide 23

Slide 23 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen VACUUM Bulk garbage collection. - Autovacuum is typically triggered by threshold having been exceeded. - VACUUM FULL is quite different. It rebuilds table from scratch. Avoid confusing them. - Batch operation. 23

Slide 24

Slide 24 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen VACUUM (Cont.) Shares a fair amount of logic with opportunistic garbage collection. - Runs reliably against whole table and all indexes, though. - Less important than opportunistic garbage collection for some workloads. - Sometimes skips pages that it cannot do cleanup on immediately, so is itself somewhat opportunistic. 24

Slide 25

Slide 25 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen VACUUM - processing order 1. Visits heap, performing pruning. Same as opportunistic pruning by HOT. Mark item pointers as dead, which index scans will need until after step 2. Free space for heap tuple structure. 2. Visits indexes, killing index tuples + item pointers together. 3. Second pass over heap, to mark dead item pointers LP_UNUSED/reusable. Frees less space than step 1. 25

Slide 26

Slide 26 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen VACUUM - processing order (Cont.) 26 Processing order makes sense when you think about page-level constraints. Indexes resolve key values to heap TIDs. Heap TIDs must be stable over time (within a VACUUM cycle). Index scans must always land on the right heap tuple. Heap cleanup is constrained by this. VACUUM remembers a list of heap TIDs to go kill in indexes.

Slide 27

Slide 27 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen HOT (Heap-only tuple) optimization 27 Avoids creating new index entries. Dependent on no indexed columns being modified by UPDATEs. Chain together heap tuples in heap page. Mostly valuable because it avoids bloat in the first place. Also enables pruning heap pages opportunistically.

Slide 28

Slide 28 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen HOT pruning “Prune away” heap tuples. Also removes item pointers for tuples that cannot be in any index due to being heap-only (HOT) tuples. Important for many workloads. - Can even remove dead tuple space when there are zero HOT UPDATEs. “Hot pruning” is a bit of a misnomer. Cannot reclaim dead item pointers, though. 28

Slide 29

Slide 29 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen LP_DEAD/kill_prior_tuple 29 B-Tree optimization has index scans represent that certain index tuples aren’t useful to anybody now. Sets LP_DEAD hint bit when the heap visibility info indicates that the tuple is dead to all possible MVCC snapshots. Simple read queries do this. The only case where visibility information is stored in indexes directly. Subsequent scans won’t visit the heap when LP_DEAD is found set.

Slide 30

Slide 30 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen LP_DEAD/kill_prior_tuple (Cont.) 30 Technique also enables space reclaim/garbage collection. Index scans can get away with setting LP_DEAD bit without exclusive buffer lock, but actual restructuring needs this. LP_DEAD hint bit index tuples are reclaimed during subsequent inserts into page that might need a page split. Prevents would-be page splits from actually splitting.

Slide 31

Slide 31 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Relationship between heap and index bloat An index scan always expects to be able to find the heap entry it points to, if only to discover that it’s logically dead. (Item pointers stick around as stubs if we can only prune.) Heap TIDs are supposed to be stable, at least within a VACUUM cycle. 31

Slide 32

Slide 32 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Overview 1. Structure Logical vs. Physical, page model, layout of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 32

Slide 33

Slide 33 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen VACUUM’s Priorities

Slide 34

Slide 34 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen VACUUM + Snapshots Garbage collection is blocked by active MVCC snapshots. The burden is not placed on old snapshots to do extra work by examining a subsidiary structure, such as UNDO. “Recently dead” tuples may be dead to most snapshots, but still cannot be garbage collected. “Horizon” before which dead tuples can be killed is conservative. 34

Slide 35

Slide 35 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen VACUUM design goals VACUUM is designed to be non-disruptive. Heavyweight lock strength doesn’t block user queries, including INSERTs, UPDATEs, and DELETEs. Indexes are scanned in physical order, not logical order (B-Tree indexes only). Query performance is the major consideration. - Reclaiming space not necessarily given that much weight. 35

Slide 36

Slide 36 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Space reclamation Eager vs. batch. Assumes that modest amounts of free space will get reclaimed eventually, when it’s convenient. Indexes have inherent restrictions, too. Only whole index pages can be reclaimed by the free space map — undersized pages cannot be merged together. 36

Slide 37

Slide 37 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Overview 1. Structure Logical vs. Physical, page model, layout of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 37

Slide 38

Slide 38 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Future work

Slide 39

Slide 39 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Coming in v11 Avoid vacuuming B-Tree indexes that don’t really need it. Can be thought of as finishing off freeze map work in 9.6. Append-only tables won’t have to have indexes VACUUM’d, even when anti-wraparound VACUUM is run. 39

Slide 40

Slide 40 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Cooperative cleanup Making all B-Tree entries unique. VACUUM can thereby cleanup large contiguous ranges of duplicates more efficiently. Needed for “retail index tuple deletion”. - Background worker to kill entries. - Eager. - Zheap project — “deletion marking”. 40

Slide 41

Slide 41 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Conclusions about bloat in PostgreSQL VACUUM is tasked with removing old garbage tuples that are obsolete to all possible transactions. Opportunistic techniques (HOT Pruning, LP_DEAD cleanup in B-Tree indexes) also exist. Garbage collection usually happens both ways. VACUUM is exclusively responsible for cleanup of transaction management status (e.g., Freezing tuples, truncating away commit status information). 41

Slide 42

Slide 42 text

https://speakerdeck.com/peterg/bloat-postgresql-pgopen Conclusions about bloat in PostgreSQL (cont.) Opportunistic techniques are restricted by the same ordering requirements that dictate high-level steps VACUUM performs. - Index scans cannot be allowed to land on an unrelated tuple due to heap TID recycling. Some amount of VACUUMing is almost inevitable, even without freezing, because dead item pointers must eventually be marked reusable. (See lazy_vacuum_heap() and lazy_vacuum_page().) 42

Slide 43

Slide 43 text

speakerdeck.com/peterg/bloat-postgresql-pgopen Thanks!