Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PostgreSQL Meets ART - Using Adaptive Radix Tre...

PostgreSQL Meets ART - Using Adaptive Radix Tree To Speed Up Vacuuming -

A talk at PGConf.dev 2024.

Masahiko Sawada

June 04, 2024
Tweet

Other Decks in Technology

Transcript

  1. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. PostgreSQL Meets ART - Using Adaptive Radix Tree To Speed Up Vacuuming - Masahiko Sawada Senior Software Engineer AWS
  2. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Index • Radix tree and Adaptive Radix Tree (ART) • ART in PostgreSQL • TidStore • Vacuum improvements • Future Plans 2
  3. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Radix Tree • Shape depends on key space and lengths. § Not depend on insertion order or existing keys. • No rebalancing operations. • The keys are sorted in lexicographic order. • Operations such as insertion are O(k) where k is the length of the key. • The keys are implicitly stored in the tree structure. § Can be constructed from the path to the leaf nodes. 3 M O U N S E D A I Keys: MEDAL, MEDIA, MOUNT, MOUSE T E A L
  4. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Adaptive Radix Tree • “The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases” Viktor Leis, Alfons Kemper, Thomas Neumann • Adaptive nodes with different algorithm for accessing them. § Node4, Node16, Node48, and Node256 § Use SIMD instructions • Lazy Expansion • Path compression • Key span is 8 (=1 byte). § The maximum fanout is 256. 4 https://db.in.tum.de/~leis/papers/ART.pdf
  5. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. (Traditional) Radix Tree vs. Adaptive Radix Tree 5
  6. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Adaptive Nodes (Node4 and Node16) 6 https://db.in.tum.de/~leis/papers/ART.pdf
  7. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Adaptive Nodes (Node48 and Node 256) 7 https://db.in.tum.de/~leis/papers/ART.pdf
  8. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Leaf Nodes • Leaf nodes have values. • The ART paper describes three ways to implement leaf nodes • Single-value leaves • Multi-value leaves • Combined pointer/value leaves • ART in PostgreSQL uses all of them depending on value size and type (fixed-length or variable-length). 8
  9. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. How to Store Values in Leaf Nodes (e.g., in Node-4) 9 0x11 0x2B 0x4F 0xE0 chunks slots val1 val2 val3 val4 0x11 0x2B 0x4F 0xE0 val1 val2 val3 val4 chunks slots 0x11 0x2B 0x4F 0xE0 val2 val4 chunks slots val1 val3 Single-value leaves Multi-value leaves Combined pointer/value slots
  10. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. ART in PostgreSQL 10
  11. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Development Timeline • Started a thread “[PoC] Improve dead tuple storage for lazy vacuum” on pgsql-hackers in 2021. • Started with looking for better data structures. • candidates: integerset.c, tidbitmap.c, Roaring Bitmap, radix tree etc. • Wrote PoC codes, evaluated these data structures performance. • Started to implement radix tree in 2022. • The first radix tree implementation is based on the prototype implemented by Andres Freund. • Committed on 7 March 2024 (ee1b30f128d8f6) by John Naylor. 11
  12. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. ART in PostgreSQL – Overview – • 64-bit key (8-bit key span) • User defined value type • Shared memory (DSA) support • Macro teplate • Decoupling node class and node kind 12
  13. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Macro Template 13 #define RT_PREFIX rt #define RT_SCOPE #define RT_DECLARE #define RT_DEFINE #define RT_USE_DELETE #define RT_VALUE_TYPE int64 #include "lib/radixtree.h" /* create and free */ rt_radix_tree *rt_create(MemoryContext *cxt); void rt_free(rt_radix_tree *tree); /* find, set, and delete */ int64 rt_find(rt_radix_tree *tree, uint64 key); bool rt_set(rt_radix_tree *tree, uint64 key, int64 *value_p); bool rt_delete(rt_radix_tree *tree, uint64 key); /* iteration */ rt_iter *rt_begin_iterate(rt_radix_tree *tree); int64 rt_iterate_next(rt_iter *iter, uint64 *key_p); void rt_end_itarage(rt_iter *iter); uint64 rt_memory_usage(rt_radix_tree *tree); Generate
  14. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Example • <RT_PREFIX>_radix_tree is declared and defined. • More example in src/test/modules/test_radixtree 14 rt_radix_tree *tree; rt_iter *iter; int64 value = 999; bool found; tree = rt_create(CurrentMemoryContext); rt_set(tree, 100, &value); found = rt_find(tree, 100); /* iteration */ iter = rt_begin_iterage(tree); while ((value = rt_iterage_next(tree, &key)) != NULL) { ... } rt_end_iterate(iter); rt_delete(tree, 100); rt_free(tree);
  15. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Value Types (#define RT_VALUE_TYPE) • #define RT_VALUE_TYPE is mandatory. • Fixed length values • Single-value leaves or Multi-value leaves decided at compile time. • Variable length values • Specify “#define RT_VARLEN_VALUE_SIZE()” • Single-value leaves • If RT_RUNTIME_EMBEDDABLE_VALUE is specified additionaly, the lowest bit of the value is reserved by radix tree for pointer tagging, but combined pointer/value leaves approach is used. 15
  16. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Decoupling node “size class” from “kind” • We have 4 node “kinds”: node-4, node-16, node-48, and node-256 • We can have multiple “size classes” per node kind. • The size classes with a node kind have the same underlying type. • A node can grow and shrink within the same node kind only by repalloc(). • node-16 has two size classes (hi and low). 16
  17. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. TidStore 17
  18. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. TidStore (src/backend/access/common/tidstore.c) • New data structure for PostgreSQL 17 (30e144287). • Store a large set of TIDs efficiently. • Use a radix tree internally. • Key is BlockNumber and value is a bitmap of offsets. • Use combined pointer/value slots. • Support shared TidStore • Support basic operations: • TidStoreSetBlockOffset() and TidStoreIsMember() 18
  19. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Optimizations in TidStore • Embed 3 offsets in padding space in the value header. • Use bump memory context for storage for insert-only workload. 19 typedef struct BlocktableEntry { struct { uint8 flags; int8 nwords; OffsetNumber full_offsets[NUM_FULL_OFFSETS]; } header; bitmapword words[FLEXIBLE_ARRAY_MEMBER]; } BlocktableEntry; - At most 3 offsets can be embedded in the header. - flags is used for pointer tagging. - bitmap of offsets
  20. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Using TidStore in Lazy Vacuum 20
  21. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Background – how lazy vacuum works? – • At the beginning of lazy vacuum, we allocate the maintenance_work_mem bytes for VacDeadItems at once. • 1st phase: scan a heap table and store dead TIDs into VacDeadItems. • 2nd phase: vacuuming all indexes and table. • Look for a TID from the TID array by bsearch(). 21 typedef struct VacDeadItems { int max_items; /* # slots allocated in array */ int num_items; /* current # of entries */ /* Sorted array of TIDs to delete from indexes */ ItemPointerData items[FLEXIBLE_ARRAY_MEMBER]; } VacDeadItems;
  22. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Problems and Limitation • Inefficient TID storage space • Use sizeof(ItemPointerData) per TID. • 1GB limitation • “Note that for the collection of dead tuple identifiers, VACUUM is only able to utilize up to a maximum of 1GB of memory”. • Allocate all maintenance_work_mem bytes at once. • Inefficient TID lookup performance • O(log N) complexity where N is the number of TIDs. 22
  23. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Using TidStore in Lazy Vacuum • Committed on April 2 (667e65aac) • Replace the TID array with TidStore • Use the bump memory context • Change column names of pg_stat_progress_vacuum view • num_dead_tuples -> dead_tuple_bytes • max_dead_tuples -> max_dead_tuple_bytes 23
  24. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Performance Tests • PG16 (TID array) and PG17-Beta1 (TidStore) • Table with one btree index (100M rows, 3GB ~ 4GB) • Delete some tuples and do VACUUM (VERBOSE) • SET maintenance_work_mem TO ‘1GB’; • Measure the execution time and memory comsumption. • Note that for PG16, itʼs (the number of deleted tuples) * 6 bytes (= sizeof(ItemPointerData)). • Three cases: sparse, dense, and UUID. • Note that other improvements in VACUUM in PG17 may contribute performance differences. 24
  25. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Performance Tests (Dense Case) • Data load by INSERT INTO tbl SELECT generate_series(1, 100 * 1000 * 1000); • First 50% of the table is garbage (making 50M deleted tuples in total). • Memory usage: 286MB (PG16) vs. 17MB (PG17) 25 0 2 4 6 8 10 12 logged unlogged logged unlogged PG16 PG17 Execution Time (sec)
  26. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Performance Tests (Sparse Case) • Data load by INSERT INTO tbl SELECT generate_series(1, 100 * 1000 * 1000); • Delete one tuple on every 4 blocks (making about 100k deleted tuples in total). • Memory usage: 586kB (PG16) vs. 15MB (PG17) 26 0 2 4 6 8 10 12 14 16 logged unlogged logged unlogged PG16 PG17 Execution Time (sec)
  27. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Performance Tests (UUID Case) • Data load by: INSERT INTO tbl SELECT gen_random_uuid() FROM generate_series(1, 100 * 1000 * 1000); • Delete random 25% of total tuples (making 25M deleted tuples in total). • Memory usage: 143MB (PG16) vs. 28MB (PG17) 27 0 5 10 15 20 25 30 35 40 45 logged unlogged logged unlogged PG16 PG17 Execution Time (sec)
  28. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Summary (With TidStore) • Inefficient TID storage space (using sizeof(ItemPointerData) per TID) -> Space Efficient TID storage backed by radix tree, up to 20x memory reduction. • Inefficient TID lookup performance (O(log N)) -> Lookup in O(k), where k is fairly small, 4. 2x ~ 3x faster. • Bulk allocation -> Memory is increnetally allocated. • 1GB limitation -> Vacuum no longer has a 1GB limit. 28
  29. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Future Plans • Use TidStore for BitmapScan instead of tidbitmap. • Use radix tree in buffer manager for faster block lookups. • Better concurrency support in the radix tree (c.f. ROWEX). • Parallel Vacuum Scan 29
  30. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. More Tests! • While using TidStore in lazy vacuum improves performance and memory usage much, these changes touch very critical part in PostgreSQL. • Both radix tree and TidStore are new. • There is no on/off switch. • Do we need something safeguard just in case? 30
  31. © 2024, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you 31