PostgreSQL Meets ART - Using Adaptive Radix Tree To Speed Up Vacuuming -

Slide 1

Slide 1 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. PostgreSQL Meets ART - Using Adaptive Radix Tree To Speed Up Vacuuming - Masahiko Sawada Senior Software Engineer AWS

Slide 2

Slide 2 text

Slide 3

Slide 3 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Radix Tree • Shape depends on key space and lengths. § Not depend on insertion order or existing keys. • No rebalancing operations. • The keys are sorted in lexicographic order. • Operations such as insertion are O(k) where k is the length of the key. • The keys are implicitly stored in the tree structure. § Can be constructed from the path to the leaf nodes. 3 M O U N S E D A I Keys: MEDAL, MEDIA, MOUNT, MOUSE T E A L

Slide 4

Slide 4 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adaptive Radix Tree • “The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases” Viktor Leis, Alfons Kemper, Thomas Neumann • Adaptive nodes with different algorithm for accessing them. § Node4, Node16, Node48, and Node256 § Use SIMD instructions • Lazy Expansion • Path compression • Key span is 8 (=1 byte). § The maximum fanout is 256. 4 https://db.in.tum.de/~leis/papers/ART.pdf

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Leaf Nodes • Leaf nodes have values. • The ART paper describes three ways to implement leaf nodes • Single-value leaves • Multi-value leaves • Combined pointer/value leaves • ART in PostgreSQL uses all of them depending on value size and type (fixed-length or variable-length). 8

Slide 9

Slide 9 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to Store Values in Leaf Nodes (e.g., in Node-4) 9 0x11 0x2B 0x4F 0xE0 chunks slots val1 val2 val3 val4 0x11 0x2B 0x4F 0xE0 val1 val2 val3 val4 chunks slots 0x11 0x2B 0x4F 0xE0 val2 val4 chunks slots val1 val3 Single-value leaves Multi-value leaves Combined pointer/value slots

Slide 10

Slide 10 text

Slide 11

Slide 11 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Development Timeline • Started a thread “[PoC] Improve dead tuple storage for lazy vacuum” on pgsql-hackers in 2021. • Started with looking for better data structures. • candidates: integerset.c, tidbitmap.c, Roaring Bitmap, radix tree etc. • Wrote PoC codes, evaluated these data structures performance. • Started to implement radix tree in 2022. • The first radix tree implementation is based on the prototype implemented by Andres Freund. • Committed on 7 March 2024 (ee1b30f128d8f6) by John Naylor. 11

Slide 12

Slide 12 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. ART in PostgreSQL – Overview – • 64-bit key (8-bit key span) • User defined value type • Shared memory (DSA) support • Macro teplate • Decoupling node class and node kind 12

Slide 13

Slide 13 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Macro Template 13 #define RT_PREFIX rt #define RT_SCOPE #define RT_DECLARE #define RT_DEFINE #define RT_USE_DELETE #define RT_VALUE_TYPE int64 #include "lib/radixtree.h" /* create and free */ rt_radix_tree *rt_create(MemoryContext *cxt); void rt_free(rt_radix_tree *tree); /* find, set, and delete */ int64 rt_find(rt_radix_tree *tree, uint64 key); bool rt_set(rt_radix_tree *tree, uint64 key, int64 *value_p); bool rt_delete(rt_radix_tree *tree, uint64 key); /* iteration */ rt_iter *rt_begin_iterate(rt_radix_tree *tree); int64 rt_iterate_next(rt_iter *iter, uint64 *key_p); void rt_end_itarage(rt_iter *iter); uint64 rt_memory_usage(rt_radix_tree *tree); Generate

Slide 14

Slide 14 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Example • _radix_tree is declared and defined. • More example in src/test/modules/test_radixtree 14 rt_radix_tree *tree; rt_iter *iter; int64 value = 999; bool found; tree = rt_create(CurrentMemoryContext); rt_set(tree, 100, &value); found = rt_find(tree, 100); /* iteration */ iter = rt_begin_iterage(tree); while ((value = rt_iterage_next(tree, &key)) != NULL) { ... } rt_end_iterate(iter); rt_delete(tree, 100); rt_free(tree);

Slide 15

Slide 15 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Value Types (#define RT_VALUE_TYPE) • #define RT_VALUE_TYPE is mandatory. • Fixed length values • Single-value leaves or Multi-value leaves decided at compile time. • Variable length values • Specify “#define RT_VARLEN_VALUE_SIZE()” • Single-value leaves • If RT_RUNTIME_EMBEDDABLE_VALUE is specified additionaly, the lowest bit of the value is reserved by radix tree for pointer tagging, but combined pointer/value leaves approach is used. 15

Slide 16

Slide 16 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Decoupling node “size class” from “kind” • We have 4 node “kinds”: node-4, node-16, node-48, and node-256 • We can have multiple “size classes” per node kind. • The size classes with a node kind have the same underlying type. • A node can grow and shrink within the same node kind only by repalloc(). • node-16 has two size classes (hi and low). 16

Slide 17

Slide 17 text

Slide 18

Slide 18 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. TidStore (src/backend/access/common/tidstore.c) • New data structure for PostgreSQL 17 (30e144287). • Store a large set of TIDs efficiently. • Use a radix tree internally. • Key is BlockNumber and value is a bitmap of offsets. • Use combined pointer/value slots. • Support shared TidStore • Support basic operations: • TidStoreSetBlockOffset() and TidStoreIsMember() 18

Slide 19

Slide 19 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimizations in TidStore • Embed 3 offsets in padding space in the value header. • Use bump memory context for storage for insert-only workload. 19 typedef struct BlocktableEntry { struct { uint8 flags; int8 nwords; OffsetNumber full_offsets[NUM_FULL_OFFSETS]; } header; bitmapword words[FLEXIBLE_ARRAY_MEMBER]; } BlocktableEntry; - At most 3 offsets can be embedded in the header. - flags is used for pointer tagging. - bitmap of offsets

Slide 20

Slide 20 text

Slide 21

Slide 21 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Background – how lazy vacuum works? – • At the beginning of lazy vacuum, we allocate the maintenance_work_mem bytes for VacDeadItems at once. • 1st phase: scan a heap table and store dead TIDs into VacDeadItems. • 2nd phase: vacuuming all indexes and table. • Look for a TID from the TID array by bsearch(). 21 typedef struct VacDeadItems { int max_items; /* # slots allocated in array */ int num_items; /* current # of entries */ /* Sorted array of TIDs to delete from indexes */ ItemPointerData items[FLEXIBLE_ARRAY_MEMBER]; } VacDeadItems;

Slide 22

Slide 22 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Problems and Limitation • Inefficient TID storage space • Use sizeof(ItemPointerData) per TID. • 1GB limitation • “Note that for the collection of dead tuple identifiers, VACUUM is only able to utilize up to a maximum of 1GB of memory”. • Allocate all maintenance_work_mem bytes at once. • Inefficient TID lookup performance • O(log N) complexity where N is the number of TIDs. 22

Slide 23

Slide 23 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Using TidStore in Lazy Vacuum • Committed on April 2 (667e65aac) • Replace the TID array with TidStore • Use the bump memory context • Change column names of pg_stat_progress_vacuum view • num_dead_tuples -> dead_tuple_bytes • max_dead_tuples -> max_dead_tuple_bytes 23

Slide 24

Slide 24 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance Tests • PG16 (TID array) and PG17-Beta1 (TidStore) • Table with one btree index (100M rows, 3GB ~ 4GB) • Delete some tuples and do VACUUM (VERBOSE) • SET maintenance_work_mem TO ‘1GB’; • Measure the execution time and memory comsumption. • Note that for PG16, itʼs (the number of deleted tuples) * 6 bytes (= sizeof(ItemPointerData)). • Three cases: sparse, dense, and UUID. • Note that other improvements in VACUUM in PG17 may contribute performance differences. 24

Slide 25

Slide 25 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance Tests (Dense Case) • Data load by INSERT INTO tbl SELECT generate_series(1, 100 * 1000 * 1000); • First 50% of the table is garbage (making 50M deleted tuples in total). • Memory usage: 286MB (PG16) vs. 17MB (PG17) 25 0 2 4 6 8 10 12 logged unlogged logged unlogged PG16 PG17 Execution Time (sec)

Slide 26

Slide 26 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance Tests (Sparse Case) • Data load by INSERT INTO tbl SELECT generate_series(1, 100 * 1000 * 1000); • Delete one tuple on every 4 blocks (making about 100k deleted tuples in total). • Memory usage: 586kB (PG16) vs. 15MB (PG17) 26 0 2 4 6 8 10 12 14 16 logged unlogged logged unlogged PG16 PG17 Execution Time (sec)

Slide 27

Slide 27 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Performance Tests (UUID Case) • Data load by: INSERT INTO tbl SELECT gen_random_uuid() FROM generate_series(1, 100 * 1000 * 1000); • Delete random 25% of total tuples (making 25M deleted tuples in total). • Memory usage: 143MB (PG16) vs. 28MB (PG17) 27 0 5 10 15 20 25 30 35 40 45 logged unlogged logged unlogged PG16 PG17 Execution Time (sec)

Slide 28

Slide 28 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary (With TidStore) • Inefficient TID storage space (using sizeof(ItemPointerData) per TID) -> Space Efficient TID storage backed by radix tree, up to 20x memory reduction. • Inefficient TID lookup performance (O(log N)) -> Lookup in O(k), where k is fairly small, 4. 2x ~ 3x faster. • Bulk allocation -> Memory is increnetally allocated. • 1GB limitation -> Vacuum no longer has a 1GB limit. 28

Slide 29

Slide 29 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Future Plans • Use TidStore for BitmapScan instead of tidbitmap. • Use radix tree in buffer manager for faster block lookups. • Better concurrency support in the radix tree (c.f. ROWEX). • Parallel Vacuum Scan 29

Slide 30

Slide 30 text

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. More Tests! • While using TidStore in lazy vacuum improves performance and memory usage much, these changes touch very critical part in PostgreSQL. • Both radix tree and TidStore are new. • There is no on/off switch. • Do we need something safeguard just in case? 30