rights reserved. Radix Tree • Shape depends on key space and lengths. § Not depend on insertion order or existing keys. • No rebalancing operations. • The keys are sorted in lexicographic order. • Operations such as insertion are O(k) where k is the length of the key. • The keys are implicitly stored in the tree structure. § Can be constructed from the path to the leaf nodes. 3 M O U N S E D A I Keys: MEDAL, MEDIA, MOUNT, MOUSE T E A L
rights reserved. Adaptive Radix Tree • “The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases” Viktor Leis, Alfons Kemper, Thomas Neumann • Adaptive nodes with different algorithm for accessing them. § Node4, Node16, Node48, and Node256 § Use SIMD instructions • Lazy Expansion • Path compression • Key span is 8 (=1 byte). § The maximum fanout is 256. 4 https://db.in.tum.de/~leis/papers/ART.pdf
rights reserved. Leaf Nodes • Leaf nodes have values. • The ART paper describes three ways to implement leaf nodes • Single-value leaves • Multi-value leaves • Combined pointer/value leaves • ART in PostgreSQL uses all of them depending on value size and type (fixed-length or variable-length). 8
rights reserved. Development Timeline • Started a thread “[PoC] Improve dead tuple storage for lazy vacuum” on pgsql-hackers in 2021. • Started with looking for better data structures. • candidates: integerset.c, tidbitmap.c, Roaring Bitmap, radix tree etc. • Wrote PoC codes, evaluated these data structures performance. • Started to implement radix tree in 2022. • The first radix tree implementation is based on the prototype implemented by Andres Freund. • Committed on 7 March 2024 (ee1b30f128d8f6) by John Naylor. 11
rights reserved. ART in PostgreSQL – Overview – • 64-bit key (8-bit key span) • User defined value type • Shared memory (DSA) support • Macro teplate • Decoupling node class and node kind 12
rights reserved. Example • <RT_PREFIX>_radix_tree is declared and defined. • More example in src/test/modules/test_radixtree 14 rt_radix_tree *tree; rt_iter *iter; int64 value = 999; bool found; tree = rt_create(CurrentMemoryContext); rt_set(tree, 100, &value); found = rt_find(tree, 100); /* iteration */ iter = rt_begin_iterage(tree); while ((value = rt_iterage_next(tree, &key)) != NULL) { ... } rt_end_iterate(iter); rt_delete(tree, 100); rt_free(tree);
rights reserved. Value Types (#define RT_VALUE_TYPE) • #define RT_VALUE_TYPE is mandatory. • Fixed length values • Single-value leaves or Multi-value leaves decided at compile time. • Variable length values • Specify “#define RT_VARLEN_VALUE_SIZE()” • Single-value leaves • If RT_RUNTIME_EMBEDDABLE_VALUE is specified additionaly, the lowest bit of the value is reserved by radix tree for pointer tagging, but combined pointer/value leaves approach is used. 15
rights reserved. Decoupling node “size class” from “kind” • We have 4 node “kinds”: node-4, node-16, node-48, and node-256 • We can have multiple “size classes” per node kind. • The size classes with a node kind have the same underlying type. • A node can grow and shrink within the same node kind only by repalloc(). • node-16 has two size classes (hi and low). 16
rights reserved. TidStore (src/backend/access/common/tidstore.c) • New data structure for PostgreSQL 17 (30e144287). • Store a large set of TIDs efficiently. • Use a radix tree internally. • Key is BlockNumber and value is a bitmap of offsets. • Use combined pointer/value slots. • Support shared TidStore • Support basic operations: • TidStoreSetBlockOffset() and TidStoreIsMember() 18
rights reserved. Optimizations in TidStore • Embed 3 offsets in padding space in the value header. • Use bump memory context for storage for insert-only workload. 19 typedef struct BlocktableEntry { struct { uint8 flags; int8 nwords; OffsetNumber full_offsets[NUM_FULL_OFFSETS]; } header; bitmapword words[FLEXIBLE_ARRAY_MEMBER]; } BlocktableEntry; - At most 3 offsets can be embedded in the header. - flags is used for pointer tagging. - bitmap of offsets
rights reserved. Background – how lazy vacuum works? – • At the beginning of lazy vacuum, we allocate the maintenance_work_mem bytes for VacDeadItems at once. • 1st phase: scan a heap table and store dead TIDs into VacDeadItems. • 2nd phase: vacuuming all indexes and table. • Look for a TID from the TID array by bsearch(). 21 typedef struct VacDeadItems { int max_items; /* # slots allocated in array */ int num_items; /* current # of entries */ /* Sorted array of TIDs to delete from indexes */ ItemPointerData items[FLEXIBLE_ARRAY_MEMBER]; } VacDeadItems;
rights reserved. Problems and Limitation • Inefficient TID storage space • Use sizeof(ItemPointerData) per TID. • 1GB limitation • “Note that for the collection of dead tuple identifiers, VACUUM is only able to utilize up to a maximum of 1GB of memory”. • Allocate all maintenance_work_mem bytes at once. • Inefficient TID lookup performance • O(log N) complexity where N is the number of TIDs. 22
rights reserved. Using TidStore in Lazy Vacuum • Committed on April 2 (667e65aac) • Replace the TID array with TidStore • Use the bump memory context • Change column names of pg_stat_progress_vacuum view • num_dead_tuples -> dead_tuple_bytes • max_dead_tuples -> max_dead_tuple_bytes 23
rights reserved. Performance Tests • PG16 (TID array) and PG17-Beta1 (TidStore) • Table with one btree index (100M rows, 3GB ~ 4GB) • Delete some tuples and do VACUUM (VERBOSE) • SET maintenance_work_mem TO ‘1GB’; • Measure the execution time and memory comsumption. • Note that for PG16, itʼs (the number of deleted tuples) * 6 bytes (= sizeof(ItemPointerData)). • Three cases: sparse, dense, and UUID. • Note that other improvements in VACUUM in PG17 may contribute performance differences. 24
rights reserved. Summary (With TidStore) • Inefficient TID storage space (using sizeof(ItemPointerData) per TID) -> Space Efficient TID storage backed by radix tree, up to 20x memory reduction. • Inefficient TID lookup performance (O(log N)) -> Lookup in O(k), where k is fairly small, 4. 2x ~ 3x faster. • Bulk allocation -> Memory is increnetally allocated. • 1GB limitation -> Vacuum no longer has a 1GB limit. 28
rights reserved. Future Plans • Use TidStore for BitmapScan instead of tidbitmap. • Use radix tree in buffer manager for faster block lookups. • Better concurrency support in the radix tree (c.f. ROWEX). • Parallel Vacuum Scan 29
rights reserved. More Tests! • While using TidStore in lazy vacuum improves performance and memory usage much, these changes touch very critical part in PostgreSQL. • Both radix tree and TidStore are new. • There is no on/off switch. • Do we need something safeguard just in case? 30