Performance Improvements of Partitioning: Past and Future - PGConf.dev 2024

Performance Improvements of Partitioning: Past and Future Yuya Watari PGConf.dev
2024

2 © 2024 NTT CORPORATION About me Yuya Watari 
From Tokyo, Japan Work  For NTT Open Source Software Center/NTT Software Innovation Center  Developing distributed databases Interests  Planning  Optimization  Distributed transaction  Photography

3 © 2024 NTT CORPORATION Agenda 1. Introduction – Partitioning
and Its Performance Problem 2. A Deep Dive into the Planner – Why Does Planning Have 𝑶𝑶 𝒏𝒏𝟐𝟐 ? 3. Hacking the Planner – How to Make It 𝑶𝑶 𝒏𝒏 ? 4. Experiments 5. Tips 6. Open Issues 7. Conclusion

4 © 2024 NTT CORPORATION Introduction – Partitioning and Its
Performance Problem

5 © 2024 NTT CORPORATION Partitioning is essential and important
feature in big-data era  Important to realize high scalability and manageability  User demand for partitioning is increasing year by year  Use case example: management of log data  Splitting log data per date Partitioning Partitioned Table logs Table logs_20220101 Table logs_20220102 Table logs_20221231

6 © 2024 NTT CORPORATION History of partitioning PostgreSQL 9.6
and before  Table inheritance and CHECK constraint PostgreSQL 10  Declarative partitioning (including syntax support)  Partition pruning CREATE TABLE t1(id INT, name TEXT) PARTITION BY RANGE (id); CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES FROM (0) TO (100); CREATE TABLE t1_2 PARTITION OF t1 FOR VALUES FROM (100) TO (200); SELECT * FROM t1 WHERE id < 100 Unrelated so eliminated from query plan Partitioned Table Table Table t1 t1_1 t1_2 0 <= id < 100 100 <= id < 200

7 © 2024 NTT CORPORATION History of partitioning PostgreSQL 11
 Partition-wise join/aggregation  Run-time partition pruning  Allows us to prune partitions based on values that will not be determined until execution time  Expands scope of partition pruning and speeds up various workloads  HASH partitioning  Distributes accesses to child partitions Partitioned Table Table a1 Table a2 id = 1 id = 2 Partitioned Table Table b1 Table b2 id = 1 id = 2 ⋈ Table a1 Table b1 id = 1 id = 1 ⋈ Table a2 Table b2 id = 2 id = 2 ⋈ Append

8 © 2024 NTT CORPORATION History of partitioning PostgreSQL 11
 Indexes for partitioned tables  Faster partition pruning  Partition pruning previously involved linear search over partition’s metadata  Faster algorithms, such as binary search or hashing function, have been utilized for partition pruning since PostgreSQL 11 PostgreSQL 12  Faster COPY for partitioned tables ⋮ What’s next?  Today’s topic – Planning for highly partitioned tables proposed for PostgreSQL 18

9 © 2024 NTT CORPORATION Highly partitioned tables Handling highly
partitioned tables is challenging  In example of log data management, we need to handle about 1000 partitions for 3 years of data  Number of child partitions affects partitioning performance Partitioned Table logs Table logs_20220101 Table logs_20220102 Table logs_20220103 Table logs_20220104 Table logs_20241231

10 © 2024 NTT CORPORATION Problem with planning Planning for
highly partitioned tables is slow in current implementation  Problem happens when queries have join operations on partitioned tables with many non-pruned partitions Partitioned Table Table Table Table Partitioned Table Table Table Table ⋈ Most of child partitions are relevant Planning is slow!

11 © 2024 NTT CORPORATION Inefficient handling of equivalences Causes
𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity when planning join operations  𝑛𝑛 is number of child partitions Proposed patch reduces time complexity to 𝑶𝑶 𝒏𝒏  Improves architecture of equivalences  Under discussion on pgsql-hackers 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 200 400 600 800 1000 Planning Time (seconds) # of partitions (n) PostgreSQL 17devel Patched Experiment #1: Planning time when joining three partitioned tables

12 © 2024 NTT CORPORATION A Deep Dive into the
Planner – Why Does Planning Have 𝑶𝑶 𝒏𝒏𝟐𝟐 ?

13 © 2024 NTT CORPORATION Deep dive into planner We
will deep dive into planner to see why it has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity  Let’s see how next SQL is planned  Table t1 has two child partitions, namely t1_1 and t1_2  Same for t2 and t3  Query simply joins these three tables SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id Partitioned Table Table Table t1 t1_1 t1_2 Partitioned Table Table Table t2 t2_1 t2_2 Partitioned Table Table Table t3 t3_1 t3_2

14 © 2024 NTT CORPORATION Join orders and equivalences Planner
searches for join orders  Query has three possible join orders  Here, we ignore swapping of inner and outer tables Question  How does planner know that t2.id is equal to t3.id even though clause does not appear in given query? SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id; t1 t2 t3 ⋈ ⋈ t1 t3 t2 ⋈ ⋈ t2 t3 t1 ⋈ ⋈ t1.id = t2.id t1.id = t3.id t1.id = t2.id t1.id = t2.id t1.id = t3.id t2.id = t3.id

15 © 2024 NTT CORPORATION EquivalenceClass and EquivalenceMember Mechanism for
finding join clauses that do not explicitly appear in original query  If clause (t1.id = t2.id or t1.id = t3.id in this example) is found, we split it into EquivalenceMembers, and EquivalenceClass keeps them in list (ec_members)  EquivalenceMember stands for variable and has Relids related to member  List will have variables that are known to be equal to each other  We can find join clauses that do not explicitly appear in given query like t2.id = t3.id EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) t1.id = t2.id t1.id = t3.id ec_members EM: EquivalenceMember

16 © 2024 NTT CORPORATION Child EquivalenceMembers Planner makes child
EquivalenceMembers to handle partitioned tables EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) EM t1_1.id (Relids: {4}) EM t2_1.id (Relids: {5}) EM t3_1.id (Relids: {6}) EM t1_2.id (Relids: {7}) EM t2_2.id (Relids: {8}) EM t3_2.id (Relids: {9}) Parent members Child members

17 © 2024 NTT CORPORATION Why does planning have 𝑶𝑶
𝒏𝒏𝟐𝟐 time complexity? Lookups of EquivalenceMembers when building paths are time-consuming  There are 𝑛𝑛 paths to be built (𝑛𝑛 is number of child partitions)  Building each path involves EquivalenceMember lookups with 𝑂𝑂(𝑛𝑛) time complexity  Total time complexity becomes 𝑂𝑂 𝑛𝑛2 Partitioned Table 𝑛𝑛 paths 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity! Path Path Path Path Path Table Table Table Table Table Involves EquivalenceMember lookups with 𝑶𝑶(𝒏𝒏) complexity

18 © 2024 NTT CORPORATION Simplified flow of planning Everything
about planning starts with planner() function To build paths for partitioned tables, planner calls set_append_rel_pathlist()  We will look into source code of set_append_rel_pathlist() function set_append_rel_pathlist() planner() standard_planner() grouping_planner() query_planner() subquery_planner()

19 © 2024 NTT CORPORATION set_append_rel_pathlist() Builds paths for given
partitioned table static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { … foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); … /* * Compute the child's access paths. */ set_rel_pathlist(root, childrel, childRTindex, childRTE); … } … } Arguments other than PlannerInfo stand for partitioned table for which we want to build paths

partitioned table static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { … foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); … /* * Compute the child's access paths. */ set_rel_pathlist(root, childrel, childRTindex, childRTE); … } … } Contains 𝒏𝒏 AppendRelInfos  AppendRelInfo: Parent-child relationship AppendRelInfo Parent: t1 Child: t1_1 AppendRelInfo Parent: t2 Child: t2_1 AppendRelInfo Parent: t3 Child: t3_1 AppendRelInfo Parent: t1 Child: t1_2 AppendRelInfo Parent: t2 Child: t2_2 AppendRelInfo Parent: t3 Child: t3_2

partitioned table static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { … foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); … /* * Compute the child's access paths. */ set_rel_pathlist(root, childrel, childRTindex, childRTE); … } … } Calls this function to build paths for each partition  Function takes child relation as its arguments  There are 𝒏𝒏 child partitions, so this is called 𝒏𝒏 times

22 © 2024 NTT CORPORATION Function call tree so far
can be illustrated as below Call tree set_append_rel_pathlist() set_rel_pathlist() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times

23 © 2024 NTT CORPORATION Function call tree so far
can be illustrated as below Call tree We need to handle equivalences of join operations when building paths set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times create_index_path()

24 © 2024 NTT CORPORATION generate_join_implied_equalities() Generates any join clauses
that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … }

that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } {1, 6} = {t1, t3_1} Contains relations involved in this join operation (In this example, we try to join t1 and t3_1)

that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } {1, 6} = {t1, t3_1} Performs simple linear search over EquivalenceMembers

that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } Checks if current EquivalenceMember is related to given join_relids {1, 6} = {t1, t3_1}

28 © 2024 NTT CORPORATION Problem with highly partitioned tables
Search for ec_members is linear search with 𝑶𝑶 𝒏𝒏 time complexity  Slow for highly partitioned cases because EquivalenceClasses have many child members EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) EM t1_1.id (Relids: {4}) EM t2_1.id (Relids: {5}) EM t3_1.id (Relids: {6}) EM t1_2.id (Relids: {7}) EM t2_2.id (Relids: {8}) EM t3_2.id (Relids: {9}) Relids {1, 6} bms_is_subset()? Match!

29 © 2024 NTT CORPORATION Total time complexity is 𝑶𝑶
𝒏𝒏𝟐𝟐  Very slow! Call tree set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times Equivalence Member Equivalence Member Equivalence Member Equivalence Member 𝑂𝑂 𝑛𝑛 members Linear search Equivalence Member create_index_path()

30 © 2024 NTT CORPORATION Hacking the Planner – How
to Make It 𝑶𝑶(𝒏𝒏)?

31 © 2024 NTT CORPORATION Hacking planner Fundamental problem is
that linear search over so many child members in EquivalenceClass is time-consuming We propose introducing child members when needed, without need to do time-consuming search  EquivalenceClasses no longer have child members in proposed patch  Since there are a few parent members, search for EquivalenceMembers is fast  Patch does not change plans  Patch is under discussion, so there may be (large) changes in future

32 © 2024 NTT CORPORATION Algorithm – How proposed patch
works What we want to do  Find EquivalenceMembers whose Relids is subset of given one  Given Relids: How to do this in proposed patch 1. Get parent representation of given Relids  Parent representation: 2. If EquivalenceMember matches parent representation, its child members may also match. So, we introduce and iterate over them  In this example, EquivalenceMember for t3 matches parent representation, so we introduce its child members (t3_1 and t3_2) {1, 6} = {t1, t3_1} {1, 3} = {t1, t3}

33 © 2024 NTT CORPORATION static List * generate_join_implied_equalities_normal(…, EquivalenceClass
*ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } generate_join_implied_equalities() Original implementation {1, 6} = {t1, t3_1}

34 © 2024 NTT CORPORATION With proposed patch (pseudocode) generate_join_implied_equalities()
static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1}

static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members only contains parent members ec_members

static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members Get parent representation of given Relids {1, 3} = {t1, t3}

static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members {1, 3} = {t1, t3}

static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members {1, 3} = {t1, t3} If member matches parent representation, its child members may also match. We iterate over child members. (Note: actual code is more complicated so as not to break ec_members) {1, 3} = {t1, t3} EM t3_1.id (Relids: {6})

static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members {1, 3} = {t1, t3} {1, 3} = {t1, t3} EM t3_1.id (Relids: {6}) In this example, EquivalenceMember for t3_2 is also child but non-relevant. To eliminate such members, proposed patch utilizes Bitmapset-based indexes

40 © 2024 NTT CORPORATION Bitmapset-based indexes Key technique to
get child members efficiently  Parent EquivalenceMember keeps its child members’ Relids as Bitmapset  Intersecting this index with given Relids gives us EquivalenceMembers that we actually want  Bitmapset operations are fast, so we can introduce child members efficiently EM t3.id (Relids: {3}) (Child members’ Relids: {6, 9}) Relids: {1, 6} join_relids EM t3_1.id (Relids: {6}) Intersect

41 © 2024 NTT CORPORATION Total time complexity is reduced
to almost 𝑶𝑶 𝒏𝒏  Very fast! Call tree – With proposed patch Equivalence Member Equivalence Member Only a few parent members Almost 𝑂𝑂(1) set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times create_index_path()

42 © 2024 NTT CORPORATION Pros and cons of proposed
patch Pros  Drastically reduces time complexity of planning Cons  May have some negative impacts for smaller sizes due to additional operations  We will investigate this at end of this talk

44 © 2024 NTT CORPORATION Experiment #1 Conducted simple experiment
to test proposed patch  Three partitioned tables with 𝑛𝑛 partitions (t1, t2, and t3)  Example when 𝑛𝑛 was 2  Query joining these three tables  Partition-wise join was off in this experiment  Varied 𝑛𝑛 and measured planning time SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id CREATE TABLE t1(id INT PRIMARY KEY, name TEXT) PARTITION BY RANGE (id); CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES FROM (0) TO (100); CREATE TABLE t1_2 PARTITION OF t1 FOR VALUES FROM (100) TO (200); -- Same for t2 and t3

45 © 2024 NTT CORPORATION 0 0.05 0.1 0.15 0.2
0.25 0.3 0.35 0 200 400 600 800 1000 Planning Time (seconds) # of partitions (n) PostgreSQL 17devel Patched Experiment #1 – Result Patch obtained significant speedup of 6.7 times 6.7 times speed up! better

46 © 2024 NTT CORPORATION Conducted same experiment with and
without partition-wise join  Number of partitions was set to 1024 Result  Proposed patch worked effectively both with and without partition-wise join 0 0.1 0.2 0.3 0.4 0.5 0.6 PostgreSQL 17devel Patched PostgreSQL 17devel Patched Partition-wise join is off Partition-wise join is on Planning Time (seconds) 1024 partitions case Experiment #2 6.7 times speedup 2.6 times speedup better

47 © 2024 NTT CORPORATION Experiment #3 Conducted experiment by
varying number of tables being joined  Schema was same as in experiments #1 and #2  Query  Number of partitions was set to 1024  Partition-wise join was off SELECT * FROM t1, t2, …, tm WHERE t1.id = t2.id … AND t1.id = tm.id Varied number of tables being joined

48 © 2024 NTT CORPORATION Proposed patch worked very effectively
when joining many tables  Speedup was up to 23.5 when joining 6 tables 2 3 4 5 6 PostgreSQL 17devel 0.06 0.31 1.14 2.90 6.18 Patched 0.02 0.05 0.08 0.15 0.26 0 1 2 3 4 5 6 7 Planning Time (seconds) # of tables being joined 1024 partitions case Experiment #3 – Result better 23.5x 19.4x 13.4x 6.7x 2.4x

49 © 2024 NTT CORPORATION Experiment #4 – Join Order
Benchmark Speedup of patch was demonstrated using Join Order Benchmark  Join Order Benchmark: famous benchmark for testing OLAP workloads  Experimental results for 1024 partitions case 0 2 4 6 8 10 12 14 16 18 20 Speedup of planning time Query better 1 10.7 times speedup on average Up to 17.5 times speedup

50 © 2024 NTT CORPORATION Various workloads benefit Following workloads
will benefit from proposed patch  Tables have many partitions  Partition pruning does not work effectively  Queries contain joins involving many tables  Queries are complicated (having sorts, index accesses, etc.)

52 © 2024 NTT CORPORATION Profiling Experiments showed significant effects
of proposed patch Let’s confirm bottleneck was actually reduced by profiling PostgreSQL  perf  Famous profiler available on Linux  Usable for finding bottlenecks  Profilers are powerful when developing PostgreSQL Here, we profile 1024 partitions case in experiment #1  In experiment, proposed patch obtained 6.7 times speedup

53 © 2024 NTT CORPORATION How to profile Steps to
profile backend process of PostgreSQL 1. Run psql command to establish connection 2. Confirm process ID of backend process 3. Run perf to profile $ ps x PID TTY STAT TIME COMMAND … 99 ? Ss 0:00 postgres: ubuntu postgres [local] idle $ psql psql (17devel) Type "help" for help. postgres=# $ sudo perf record -ag -p 99

54 © 2024 NTT CORPORATION How to profile 4. Result
will be shown on console Samples: 1K of event 'cpu-clock:pppH', Event count (approx.): 327250000 Children Self Command Shared Object Symbol … + 91.90% 0.00% postgres postgres [.] planner + 91.90% 0.00% postgres postgres [.] standard_planner + 91.52% 0.00% postgres postgres [.] subquery_planner + 91.52% 0.00% postgres postgres [.] grouping_planner + 91.52% 0.00% postgres postgres [.] query_planner … + 91.29% 0.15% postgres postgres [.] set_append_rel_pathlist + 90.60% 0.00% postgres postgres [.] set_plain_rel_pathlist + 85.94% 0.08% postgres postgres [.] create_index_paths … + 40.95% 0.23% postgres postgres [.] generate_join_implied_equalities + 25.44% 25.44% postgres postgres [.] bms_is_subset + 23.99% 0.08% postgres postgres [.] build_index_pathkeys + 23.91% 0.00% postgres postgres [.] make_pathkey_from_sortinfo + 22.77% 21.47% postgres postgres [.] create_join_clause + 20.32% 20.32% postgres postgres [.] bms_equal … Output illustrates functions that take much time Such functions may be bottlenecks

55 © 2024 NTT CORPORATION We can confirm that searches
for EquivalenceClasses were bottlenecks Flame graph – Unpatched Bottlenecks

56 © 2024 NTT CORPORATION Bottlenecks were drastically reduced by
patch  We can now see collapsed functions on last flame graph Flame graph – Patched

57 © 2024 NTT CORPORATION printf debug Next, we check
that original implementation has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity by printf debug  Inserts logging codes in functions to see how they are called  In PostgreSQL hacking, elog and ereport are helpful for printf debug  Modifying PostgreSQL source code as follows tells us value of ‘count’  LOG, NOTICE, and DEBUG are available as log level  For more details, see PostgreSQL official document  https://www.postgresql.org/docs/16/error-message-reporting.html elog(LOG, "%d", count);

58 © 2024 NTT CORPORATION printf debug – generate_join_implied_equalities() Adds
following code to count bms_is_subset() comparisons static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … elog(LOG, "[printfdebug] Checking an EquivalenceMember (ec_members has %d members)", list_length(ec->ec_members)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … }

59 © 2024 NTT CORPORATION printf debug – Result (Unpatched)
Ran experiment #1 for 𝒏𝒏 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 partitions case joining three tables We found following from this output  EquivalenceClass had 3075 = 3 𝑛𝑛 + 1 members  Parent and its children for each table  Comparison in linear search was done 18,929,700 times  This is approximately 18𝑛𝑛2 Original implementation actually has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity $ grep "printfdebug" logfile | sed "s/.*LOG: //g" | sort | uniq -c 18929700 [printfdebug] Checking an EquivalenceMember (ec_members has 3075 members)

60 © 2024 NTT CORPORATION printf debug – Result (Patched)
Ran same with proposed patch applied Changes from unpatched result  Number of EquivalenceMembers: 3075 = 3 𝑛𝑛 + 1  3  Number of comparisons: 18,929,700 ≈ 18𝑛𝑛2  24,612 ≈ 24𝑛𝑛 We confirmed proposed patch greatly reduced time complexity $ grep "printfdebug" logfile | sed "s/.*LOG: //g" | sort | uniq -c 24612 [printfdebug] Checking an EquivalenceMember (ec_members has 3 members)

61 © 2024 NTT CORPORATION Experiments are key to addressing
problems When we suffer some performance problem, what should we do to solve it?  Clarifying bottleneck is first step  Profilers are helpful to do it  I profiled PostgreSQL many times to address this planning performance problem and found bms_is_subset() operation was heavy in query planner  Finding fundamental cause is then required  Bottleneck that profilers show is not always fundamental cause  We need to investigate why bottleneck is actually caused  In today’s talk, problem was not that bms_is_subset() was heavy but that linear search over so many EquivalenceMembers was time-consuming Trial and errors are particularly important to address performance issues  I did experiments more than 100 times to write first PoC patch

62 © 2024 NTT CORPORATION When proposing patches After finding
fundamental cause, we will then improve PostgreSQL source code to address performance problem and propose patches When proposing patches, following should be with patches  Queries or steps to reproduce the performance problem  Important to make sure that other developers can easily investigate problem on their own machines  Reproduction queries or steps should be as simple as possible  Detailed experimental results  Support effectiveness of patches  As results are more detailed, patches will be more attractive  When are patches effective? How large is performance improvement?

64 © 2024 NTT CORPORATION Performance issue Proposed patch has
slightly negative impacts on smaller sizes  Experiment shows regression is on microseconds order  18 microseconds for 2 partitions case  This is much smaller than variance of planning times  Avoiding regression is key to committing this patch  We need to conduct as many experiments as possible to examine effects of proposed patch better

65 © 2024 NTT CORPORATION Other issues Memory consumption and
assertion are also discussed  Memory consumption  Proposed patch is known to increase memory consumption  This seems to be due to Bitmapset-based indexes for eliminating non-relevant child EquivalenceMembers  Further investigation is needed to clarify cause  Assertion  It is very important to avoid adding bugs, and assertions are helpful  However, overly excessive assertions slow down regression test time  Speeding up debug builds is also necessary

67 © 2024 NTT CORPORATION Summary  Partitioning is essential
and important feature of PostgreSQL  PostgreSQL has improved partitioning performance since its introduction  Planning of highly partitioned tables gets slow in current PostgreSQL implementation  Handling equivalences of join operations causes this problem  Proposed patch improves architecture of equivalences and reduces time complexity from 𝑂𝑂 𝑛𝑛2 to 𝑂𝑂 𝑛𝑛  Experimental results show proposed patch obtains 10x or more speedup on Join Order Benchmark

68 © 2024 NTT CORPORATION References Discussion and commitfest entry
are on following pages  https://commitfest.postgresql.org/48/3701/  https://www.postgresql.org/message-id/flat/CAJ2pMkZNCgoUKSE+_5LthD+Kb XKvq6h2hQN8Esxpxd+cxmgomg@mail.gmail.com Any comments are welcome!

69 © 2024 NTT CORPORATION Acknowledgements I would like to
thank everyone who has contributed to this patch Especially,  David Rowley  He has worked with me extensively and deeply at the source code level since I first submitted the patch. He has made numerous contributions and given a lot of kind advice

Performance Improvements of Partitioning: Past ...

Performance Improvements of Partitioning: Past and Future - PGConf.dev 2024

More Decks by Yuya Watari

Other Decks in Technology

Featured

Transcript