bloat-postgresql-pgconfeu

 bloat-postgresql-pgconfeu

F9a2dba12b94d0c204d846a29da56bf5?s=128

Peter Geoghegan

October 26, 2018
Tweet

Transcript

  1. 3.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Outline Most discussions of VACUUM/bloat take a pragmatic, top-down

    approach. I’m going to take a bottom-up approach, though. This isn’t necessarily better. It might help you to develop better intuitions about VACUUM’s performance, though, especially if you’re familiar with the standard explanation already. 3
  2. 4.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Overview 1. Structure Logical vs. Physical, page model, layout

    of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 4
  3. 6.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Logical vs. Physical Database systems use access methods to

    abstract away physical representation. MVCC more or less versions entries in objects (relations). Heap relations (tables) store rows more or less in arbitrary order. Index relations may also have multiple versions, though this can be avoided by HOT optimization. 6
  4. 7.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Heap (table) lookups Heap structure is optimized for sequential

    access, and access by index scans. Records are identified by TID (e.g., '(2,32)', '(43,89)')— must be stable for index scans. TID is a “physiological” identifier. - Physical across pages/blocks — block number. - Logical within pages/blocks — item pointer offset. 7
  5. 9.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Index Lookups Indexes are concerned with making access to

    specific records efficient. B-Trees mostly consist of all items that are indexed in natural sort order. (99%+ of blocks are leaf blocks.) B-Tree indexes are particular about which key values go on which page, complicating matters for MVCC garbage collection. Index pages will regularly need to “fit a new entry between existing entries”. Physiological-ness helps with that (index TIDs exist, but are not stable over time). 9
  6. 10.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Image generated using experimental Graphiz Postgres B-Tree visualization tool

    — Heikki Linnakangas 1 L 11 61 61 61 31 33 32 2 L 11 61 61 61 31 36 35 4 L 11 61 61 61 31 39 38 5 L 11 61 61 61 32 33 31 3 6 L 11 61 61 61 32 36 34 7 L 11 61 61 61 32 39 37 8 L 11 61 61 61 33 33 30 9 L 11 61 61 61 33 36 33 10 L 11 61 61 61 33 39 36 11 L 11 61 61 61 34 32 39 12 L 11 61 61 61 34 36 32 13 L 11 61 61 61 34 39 35 14 L 0f 61 61 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L 11 61 61 61 36 32 36 18 L 11 61 61 61 36 35 39 19 L 11 61 61 61 36 39 32 20 L 11 61 61 61 37 32 35 21 L 11 61 61 61 37 35 38 22 L 11 61 61 61 37 39 31 23 L 11 61 61 61 38 32 34 24 L 11 61 61 61 38 35 37 25 L 11 61 61 61 38 39 30 26 L 11 61 61 61 39 32 33 27 L 11 61 61 61 39 35 36 28 L 11 61 61 61 39 38 39 29 L 3 1 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L B-Tree Index Structure
  7. 11.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Image generated using experimental Graphiz Postgres B-Tree visualization tool

    — Heikki Linnakangas 1 L 11 61 61 61 31 33 32 2 L 11 61 61 61 31 36 35 4 L 11 61 61 61 31 39 38 5 L 11 61 61 61 32 33 31 3 6 L 11 61 61 61 32 36 34 7 L 11 61 61 61 32 39 37 8 L 11 61 61 61 33 33 30 9 L 11 61 61 61 33 36 33 10 L 11 61 61 61 33 39 36 11 L 11 61 61 61 34 32 39 12 L 11 61 61 61 34 36 32 13 L 11 61 61 61 34 39 35 14 L 0f 61 61 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L 11 61 61 61 36 32 36 18 L 11 61 61 61 36 35 39 19 L 11 61 61 61 36 39 32 20 L 11 61 61 61 37 32 35 21 L 11 61 61 61 37 35 38 22 L 11 61 61 61 37 39 31 23 L 11 61 61 61 38 32 34 24 L 11 61 61 61 38 35 37 25 L 11 61 61 61 38 39 30 26 L 11 61 61 61 39 32 33 27 L 11 61 61 61 39 35 36 28 L 11 61 61 61 39 38 39 29 L 3 1 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L B-Tree Index Structure
  8. 12.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Image generated using experimental Graphiz Postgres B-Tree visualization tool

    — Heikki Linnakangas 1 L 11 61 61 61 31 33 32 2 L 11 61 61 61 31 36 35 4 L 11 61 61 61 31 39 38 5 L 11 61 61 61 32 33 31 3 6 L 11 61 61 61 32 36 34 7 L 11 61 61 61 32 39 37 8 L 11 61 61 61 33 33 30 9 L 11 61 61 61 33 36 33 10 L 11 61 61 61 33 39 36 11 L 11 61 61 61 34 32 39 12 L 11 61 61 61 34 36 32 13 L 11 61 61 61 34 39 35 14 L 0f 61 61 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L 11 61 61 61 36 32 36 18 L 11 61 61 61 36 35 39 19 L 11 61 61 61 36 39 32 20 L 11 61 61 61 37 32 35 21 L 11 61 61 61 37 35 38 22 L 11 61 61 61 37 39 31 23 L 11 61 61 61 38 32 34 24 L 11 61 61 61 38 35 37 25 L 11 61 61 61 38 39 30 26 L 11 61 61 61 39 32 33 27 L 11 61 61 61 39 35 36 28 L 11 61 61 61 39 38 39 29 L 3 1 61 35 32 38 15 L 11 61 61 61 35 36 30 16 L 11 61 61 61 35 39 33 17 L B-Tree Index Structure
  9. 14.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Transaction status data (“SLRUs”) Subsidiary structures for transaction status

    information. VACUUM “freezes” tuples to truncate structures. Commit log/pg_xact, pg_multixact. No opportunistic (non-VACUUM) processing. Also page-based (8KiB blocks). Bitmap-like representations. 14
  10. 15.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Free space map and Visibility map Relation “forks” that

    maintain summarizing information. Only heap has visibility map. Maintained by VACUUM. No opportunistic processing. Also comprised of 8KiB blocks. Visibility map allows avoidance of redundant work by VACUUM. As of 9.6, visibility map also tracks all-frozen pages. 15
  11. 16.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Overview 1. Structure Logical vs. Physical, page model, layout

    of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 16
  12. 18.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Database pages as “cells” PostgreSQL storage consists of 8KiB

    pages. “Page model”. - Complex atomic operations (transactions) from simpler ones (WAL-logged atomic page ops). - Individual page modifications can be made atomic with low-level techniques. 18
  13. 19.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu “Linux is evolution, not intelligent design” “Bad programmers worry

    about the code. Good programmers worry about data structures and their relationships.” — Linus Torvalds
  14. 20.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Evolutionary pressure Like cells in living organisms, structure of

    pages has a lot in common across disparate access methods (e.g., heap, B-Tree, …). Too much complexity to manage without breaking down into manageable pieces with commonality. - What else could possibly work? Indexes are very different to tables, but their page layout has quite a lot in common. 20
  15. 21.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Bloat at the page level Bloat is often directly

    observable on pages. - “Hint bits” note that transaction that updated subsequently committed. - Opportunistic cleanup reclaims space at the page level, often due to observing a set hint bit. VACUUM does work that significantly overlaps with opportunistic cleanup work of pages — pruning. 21
  16. 23.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu VACUUM Bulk garbage collection. - Autovacuum is typically triggered

    by threshold having been exceeded. - VACUUM FULL is quite different. It rebuilds table from scratch. Avoid confusing them. - Batch operation. 23
  17. 24.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu VACUUM (Cont.) Shares a fair amount of logic with

    opportunistic garbage collection. - Runs reliably against whole table and all indexes, though. - Less important than opportunistic garbage collection for some workloads. - Sometimes skips pages that it cannot do cleanup on immediately, so is itself somewhat opportunistic. 24
  18. 25.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu VACUUM - processing order 1. Visits heap, performing pruning.

    Same as opportunistic pruning by HOT. Mark item pointers as dead, which index scans will need until after step 2. Free space for heap tuple structure. 2. Visits indexes, killing index tuples + item pointers together. 3. Second pass over heap, to mark dead item pointers LP_UNUSED/reusable. Frees less space than step 1. 25
  19. 26.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu VACUUM - processing order (Cont.) 26 Processing order makes

    sense when you think about page-level constraints. Indexes resolve key values to heap TIDs. Heap TIDs must be stable over time (within a VACUUM cycle). Index scans must always land on the right heap tuple. Heap cleanup is constrained by this. VACUUM remembers a list of heap TIDs to go kill in indexes.
  20. 27.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu HOT (Heap-only tuple) optimization 27 Avoids creating new index

    entries. Dependent on no indexed columns being modified by UPDATEs. Chain together heap tuples in heap page. Mostly valuable because it avoids bloat in the first place. Also enables pruning heap pages opportunistically.
  21. 28.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu HOT pruning “Prune away” heap tuples. Also removes item

    pointers for tuples that cannot be in any index due to being heap-only (HOT) tuples. Important for many workloads. - Can even remove dead tuple space when there are zero HOT UPDATEs. “Hot pruning” is a bit of a misnomer. Cannot reclaim dead item pointers, though. 28
  22. 29.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu LP_DEAD/kill_prior_tuple 29 B-Tree optimization has index scans represent that

    certain index tuples aren’t useful to anybody now. Sets LP_DEAD hint bit when the heap visibility info indicates that the tuple is dead to all possible MVCC snapshots. Simple read queries do this. The only case where visibility information is stored in indexes directly. Subsequent scans won’t visit the heap when LP_DEAD is found set.
  23. 30.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu LP_DEAD/kill_prior_tuple (Cont.) 30 Technique also enables space reclaim/garbage collection.

    Index scans can get away with setting LP_DEAD bit without exclusive buffer lock, but actual restructuring needs this. LP_DEAD hint bit index tuples are reclaimed during subsequent inserts into page that might need a page split. Prevents would-be page splits from actually splitting.
  24. 31.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Relationship between heap and index bloat An index scan

    always expects to be able to find the heap entry it points to, if only to discover that it’s logically dead. (Item pointers stick around as stubs if we can only prune.) Heap TIDs are supposed to be stable, at least within a VACUUM cycle. 31
  25. 32.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Overview 1. Structure Logical vs. Physical, page model, layout

    of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 32
  26. 34.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu VACUUM + Snapshots Garbage collection is blocked by active

    MVCC snapshots. The burden is not placed on old snapshots to do extra work by examining a subsidiary structure, such as UNDO. “Recently dead” tuples may be dead to most snapshots, but still cannot be garbage collected. “Horizon” before which dead tuples can be killed is conservative. 34
  27. 35.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu VACUUM design goals VACUUM is designed to be non-disruptive.

    Heavyweight lock strength doesn’t block user queries, including INSERTs, UPDATEs, and DELETEs. Indexes are scanned in physical order, not logical order (B-Tree indexes only). Query performance is the major consideration. - Reclaiming space not necessarily given that much weight. 35
  28. 36.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Space reclamation Eager vs. batch. Assumes that modest amounts

    of free space will get reclaimed eventually, when it’s convenient. Indexes have inherent restrictions, too. Only whole index pages can be reclaimed by the free space map — undersized pages cannot be merged together. 36
  29. 37.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Overview 1. Structure Logical vs. Physical, page model, layout

    of data structures. 2. A bottom-up take on bloat VACUUM, HOT, and the relationship between table bloat and index bloat. 3. VACUUM’s priorities What VACUUM prioritizes. 4. Future work Making garbage collection more efficient in future releases. 37
  30. 39.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Coming in v11 Avoid vacuuming B-Tree indexes that don’t

    really need it. Can be thought of as complementing freeze map work in 9.6. Append-only tables won’t have to have indexes VACUUM’d, even when anti-wraparound VACUUM is run. 39
  31. 40.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Cooperative cleanup Making all B-Tree entries unique. VACUUM can

    thereby cleanup large contiguous ranges of duplicates more efficiently. Needed for “retail index tuple deletion”. - Background worker to kill entries. - Eager. - Zheap project — “deletion marking”. 40
  32. 41.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Conclusions about bloat in PostgreSQL VACUUM is tasked with

    removing old garbage tuples that are obsolete to all possible transactions. Opportunistic techniques (HOT Pruning, LP_DEAD cleanup in B-Tree indexes) also exist. Garbage collection usually happens both ways. VACUUM is exclusively responsible for cleanup of transaction management status (e.g., Freezing tuples, truncating away commit status information). 41
  33. 42.

    https://speakerdeck.com/peterg/bloat-postgresql-pgconfeu Conclusions about bloat in PostgreSQL (cont.) Opportunistic techniques are

    restricted by the same ordering requirements that dictate high-level steps VACUUM performs. - Index scans cannot be allowed to land on an unrelated tuple due to heap TID recycling. Some amount of VACUUMing is almost inevitable, even without freezing, because dead item pointers must eventually be marked reusable. (See lazy_vacuum_heap() and lazy_vacuum_page().) 42