Slide 1

Slide 1 text

Performance Improvements of Partitioning: Past and Future Yuya Watari PGConf.dev 2024

Slide 2

Slide 2 text

2 © 2024 NTT CORPORATION About me Yuya Watari  From Tokyo, Japan Work  For NTT Open Source Software Center/NTT Software Innovation Center  Developing distributed databases Interests  Planning  Optimization  Distributed transaction  Photography

Slide 3

Slide 3 text

3 © 2024 NTT CORPORATION Agenda 1. Introduction – Partitioning and Its Performance Problem 2. A Deep Dive into the Planner – Why Does Planning Have 𝑶𝑶 𝒏𝒏𝟐𝟐 ? 3. Hacking the Planner – How to Make It 𝑶𝑶 𝒏𝒏 ? 4. Experiments 5. Tips 6. Open Issues 7. Conclusion

Slide 4

Slide 4 text

4 © 2024 NTT CORPORATION Introduction – Partitioning and Its Performance Problem

Slide 5

Slide 5 text

5 © 2024 NTT CORPORATION Partitioning is essential and important feature in big-data era  Important to realize high scalability and manageability  User demand for partitioning is increasing year by year  Use case example: management of log data  Splitting log data per date Partitioning Partitioned Table logs Table logs_20220101 Table logs_20220102 Table logs_20221231

Slide 6

Slide 6 text

6 © 2024 NTT CORPORATION History of partitioning PostgreSQL 9.6 and before  Table inheritance and CHECK constraint PostgreSQL 10  Declarative partitioning (including syntax support)  Partition pruning CREATE TABLE t1(id INT, name TEXT) PARTITION BY RANGE (id); CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES FROM (0) TO (100); CREATE TABLE t1_2 PARTITION OF t1 FOR VALUES FROM (100) TO (200); SELECT * FROM t1 WHERE id < 100 Unrelated so eliminated from query plan Partitioned Table Table Table t1 t1_1 t1_2 0 <= id < 100 100 <= id < 200

Slide 7

Slide 7 text

7 © 2024 NTT CORPORATION History of partitioning PostgreSQL 11  Partition-wise join/aggregation  Run-time partition pruning  Allows us to prune partitions based on values that will not be determined until execution time  Expands scope of partition pruning and speeds up various workloads  HASH partitioning  Distributes accesses to child partitions Partitioned Table Table a1 Table a2 id = 1 id = 2 Partitioned Table Table b1 Table b2 id = 1 id = 2 ⋈ Table a1 Table b1 id = 1 id = 1 ⋈ Table a2 Table b2 id = 2 id = 2 ⋈ Append

Slide 8

Slide 8 text

8 © 2024 NTT CORPORATION History of partitioning PostgreSQL 11  Indexes for partitioned tables  Faster partition pruning  Partition pruning previously involved linear search over partition’s metadata  Faster algorithms, such as binary search or hashing function, have been utilized for partition pruning since PostgreSQL 11 PostgreSQL 12  Faster COPY for partitioned tables ⋮ What’s next?  Today’s topic – Planning for highly partitioned tables proposed for PostgreSQL 18

Slide 9

Slide 9 text

9 © 2024 NTT CORPORATION Highly partitioned tables Handling highly partitioned tables is challenging  In example of log data management, we need to handle about 1000 partitions for 3 years of data  Number of child partitions affects partitioning performance Partitioned Table logs Table logs_20220101 Table logs_20220102 Table logs_20220103 Table logs_20220104 Table logs_20241231

Slide 10

Slide 10 text

10 © 2024 NTT CORPORATION Problem with planning Planning for highly partitioned tables is slow in current implementation  Problem happens when queries have join operations on partitioned tables with many non-pruned partitions Partitioned Table Table Table Table Partitioned Table Table Table Table ⋈ Most of child partitions are relevant Planning is slow!

Slide 11

Slide 11 text

11 © 2024 NTT CORPORATION Inefficient handling of equivalences Causes 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity when planning join operations  𝑛𝑛 is number of child partitions Proposed patch reduces time complexity to 𝑶𝑶 𝒏𝒏  Improves architecture of equivalences  Under discussion on pgsql-hackers 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 200 400 600 800 1000 Planning Time (seconds) # of partitions (n) PostgreSQL 17devel Patched Experiment #1: Planning time when joining three partitioned tables

Slide 12

Slide 12 text

12 © 2024 NTT CORPORATION A Deep Dive into the Planner – Why Does Planning Have 𝑶𝑶 𝒏𝒏𝟐𝟐 ?

Slide 13

Slide 13 text

13 © 2024 NTT CORPORATION Deep dive into planner We will deep dive into planner to see why it has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity  Let’s see how next SQL is planned  Table t1 has two child partitions, namely t1_1 and t1_2  Same for t2 and t3  Query simply joins these three tables SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id Partitioned Table Table Table t1 t1_1 t1_2 Partitioned Table Table Table t2 t2_1 t2_2 Partitioned Table Table Table t3 t3_1 t3_2

Slide 14

Slide 14 text

14 © 2024 NTT CORPORATION Join orders and equivalences Planner searches for join orders  Query has three possible join orders  Here, we ignore swapping of inner and outer tables Question  How does planner know that t2.id is equal to t3.id even though clause does not appear in given query? SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id; t1 t2 t3 ⋈ ⋈ t1 t3 t2 ⋈ ⋈ t2 t3 t1 ⋈ ⋈ t1.id = t2.id t1.id = t3.id t1.id = t2.id t1.id = t2.id t1.id = t3.id t2.id = t3.id

Slide 15

Slide 15 text

15 © 2024 NTT CORPORATION EquivalenceClass and EquivalenceMember Mechanism for finding join clauses that do not explicitly appear in original query  If clause (t1.id = t2.id or t1.id = t3.id in this example) is found, we split it into EquivalenceMembers, and EquivalenceClass keeps them in list (ec_members)  EquivalenceMember stands for variable and has Relids related to member  List will have variables that are known to be equal to each other  We can find join clauses that do not explicitly appear in given query like t2.id = t3.id EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) t1.id = t2.id t1.id = t3.id ec_members EM: EquivalenceMember

Slide 16

Slide 16 text

16 © 2024 NTT CORPORATION Child EquivalenceMembers Planner makes child EquivalenceMembers to handle partitioned tables EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) EM t1_1.id (Relids: {4}) EM t2_1.id (Relids: {5}) EM t3_1.id (Relids: {6}) EM t1_2.id (Relids: {7}) EM t2_2.id (Relids: {8}) EM t3_2.id (Relids: {9}) Parent members Child members

Slide 17

Slide 17 text

17 © 2024 NTT CORPORATION Why does planning have 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity? Lookups of EquivalenceMembers when building paths are time-consuming  There are 𝑛𝑛 paths to be built (𝑛𝑛 is number of child partitions)  Building each path involves EquivalenceMember lookups with 𝑂𝑂(𝑛𝑛) time complexity  Total time complexity becomes 𝑂𝑂 𝑛𝑛2 Partitioned Table 𝑛𝑛 paths 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity! Path Path Path Path Path Table Table Table Table Table Involves EquivalenceMember lookups with 𝑶𝑶(𝒏𝒏) complexity

Slide 18

Slide 18 text

18 © 2024 NTT CORPORATION Simplified flow of planning Everything about planning starts with planner() function To build paths for partitioned tables, planner calls set_append_rel_pathlist()  We will look into source code of set_append_rel_pathlist() function set_append_rel_pathlist() planner() standard_planner() grouping_planner() query_planner() subquery_planner()

Slide 19

Slide 19 text

19 © 2024 NTT CORPORATION set_append_rel_pathlist() Builds paths for given partitioned table static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { … foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); … /* * Compute the child's access paths. */ set_rel_pathlist(root, childrel, childRTindex, childRTE); … } … } Arguments other than PlannerInfo stand for partitioned table for which we want to build paths

Slide 20

Slide 20 text

20 © 2024 NTT CORPORATION set_append_rel_pathlist() Builds paths for given partitioned table static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { … foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); … /* * Compute the child's access paths. */ set_rel_pathlist(root, childrel, childRTindex, childRTE); … } … } Contains 𝒏𝒏 AppendRelInfos  AppendRelInfo: Parent-child relationship AppendRelInfo Parent: t1 Child: t1_1 AppendRelInfo Parent: t2 Child: t2_1 AppendRelInfo Parent: t3 Child: t3_1 AppendRelInfo Parent: t1 Child: t1_2 AppendRelInfo Parent: t2 Child: t2_2 AppendRelInfo Parent: t3 Child: t3_2

Slide 21

Slide 21 text

21 © 2024 NTT CORPORATION set_append_rel_pathlist() Builds paths for given partitioned table static void set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, Index rti, RangeTblEntry *rte) { … foreach(l, root->append_rel_list) { AppendRelInfo *appinfo = (AppendRelInfo *) lfirst(l); … /* * Compute the child's access paths. */ set_rel_pathlist(root, childrel, childRTindex, childRTE); … } … } Calls this function to build paths for each partition  Function takes child relation as its arguments  There are 𝒏𝒏 child partitions, so this is called 𝒏𝒏 times

Slide 22

Slide 22 text

22 © 2024 NTT CORPORATION Function call tree so far can be illustrated as below Call tree set_append_rel_pathlist() set_rel_pathlist() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times

Slide 23

Slide 23 text

23 © 2024 NTT CORPORATION Function call tree so far can be illustrated as below Call tree We need to handle equivalences of join operations when building paths set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times create_index_path()

Slide 24

Slide 24 text

24 © 2024 NTT CORPORATION generate_join_implied_equalities() Generates any join clauses that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … }

Slide 25

Slide 25 text

25 © 2024 NTT CORPORATION generate_join_implied_equalities() Generates any join clauses that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } {1, 6} = {t1, t3_1} Contains relations involved in this join operation (In this example, we try to join t1 and t3_1)

Slide 26

Slide 26 text

26 © 2024 NTT CORPORATION generate_join_implied_equalities() Generates any join clauses that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } {1, 6} = {t1, t3_1} Performs simple linear search over EquivalenceMembers

Slide 27

Slide 27 text

27 © 2024 NTT CORPORATION generate_join_implied_equalities() Generates any join clauses that we can deduce from equivalence static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … } Checks if current EquivalenceMember is related to given join_relids {1, 6} = {t1, t3_1}

Slide 28

Slide 28 text

28 © 2024 NTT CORPORATION Problem with highly partitioned tables Search for ec_members is linear search with 𝑶𝑶 𝒏𝒏 time complexity  Slow for highly partitioned cases because EquivalenceClasses have many child members EquivalenceClass EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) EM t1_1.id (Relids: {4}) EM t2_1.id (Relids: {5}) EM t3_1.id (Relids: {6}) EM t1_2.id (Relids: {7}) EM t2_2.id (Relids: {8}) EM t3_2.id (Relids: {9}) Relids {1, 6} bms_is_subset()? Match!

Slide 29

Slide 29 text

29 © 2024 NTT CORPORATION Total time complexity is 𝑶𝑶 𝒏𝒏𝟐𝟐  Very slow! Call tree set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times Equivalence Member Equivalence Member Equivalence Member Equivalence Member 𝑂𝑂 𝑛𝑛 members Linear search Equivalence Member create_index_path()

Slide 30

Slide 30 text

30 © 2024 NTT CORPORATION Hacking the Planner – How to Make It 𝑶𝑶(𝒏𝒏)?

Slide 31

Slide 31 text

31 © 2024 NTT CORPORATION Hacking planner Fundamental problem is that linear search over so many child members in EquivalenceClass is time-consuming We propose introducing child members when needed, without need to do time-consuming search  EquivalenceClasses no longer have child members in proposed patch  Since there are a few parent members, search for EquivalenceMembers is fast  Patch does not change plans  Patch is under discussion, so there may be (large) changes in future

Slide 32

Slide 32 text

32 © 2024 NTT CORPORATION Algorithm – How proposed patch works What we want to do  Find EquivalenceMembers whose Relids is subset of given one  Given Relids: How to do this in proposed patch 1. Get parent representation of given Relids  Parent representation: 2. If EquivalenceMember matches parent representation, its child members may also match. So, we introduce and iterate over them  In this example, EquivalenceMember for t3 matches parent representation, so we introduce its child members (t3_1 and t3_2) {1, 6} = {t1, t3_1} {1, 3} = {t1, t3}

Slide 33

Slide 33 text

33 © 2024 NTT CORPORATION static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } generate_join_implied_equalities() Original implementation {1, 6} = {t1, t3_1}

Slide 34

Slide 34 text

34 © 2024 NTT CORPORATION With proposed patch (pseudocode) generate_join_implied_equalities() static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1}

Slide 35

Slide 35 text

35 © 2024 NTT CORPORATION With proposed patch (pseudocode) generate_join_implied_equalities() static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members only contains parent members ec_members

Slide 36

Slide 36 text

36 © 2024 NTT CORPORATION With proposed patch (pseudocode) generate_join_implied_equalities() static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members Get parent representation of given Relids {1, 3} = {t1, t3}

Slide 37

Slide 37 text

37 © 2024 NTT CORPORATION With proposed patch (pseudocode) generate_join_implied_equalities() static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members {1, 3} = {t1, t3}

Slide 38

Slide 38 text

38 © 2024 NTT CORPORATION With proposed patch (pseudocode) generate_join_implied_equalities() static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members {1, 3} = {t1, t3} If member matches parent representation, its child members may also match. We iterate over child members. (Note: actual code is more complicated so as not to break ec_members) {1, 3} = {t1, t3} EM t3_1.id (Relids: {6})

Slide 39

Slide 39 text

39 © 2024 NTT CORPORATION With proposed patch (pseudocode) generate_join_implied_equalities() static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { Relids top_parent_join_relids = find_relids_top_parents(root, join_relids); … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … if (bms_is_subset(cur_em->em_relids, top_parent_join_relids)) ec->ec_members = list_concat(ec->ec_members, get_child_members(ec, cur_em, join_relids)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } } {1, 6} = {t1, t3_1} EM t1.id (Relids: {1}) EM t2.id (Relids: {2}) EM t3.id (Relids: {3}) ec_members {1, 3} = {t1, t3} {1, 3} = {t1, t3} EM t3_1.id (Relids: {6}) In this example, EquivalenceMember for t3_2 is also child but non-relevant. To eliminate such members, proposed patch utilizes Bitmapset-based indexes

Slide 40

Slide 40 text

40 © 2024 NTT CORPORATION Bitmapset-based indexes Key technique to get child members efficiently  Parent EquivalenceMember keeps its child members’ Relids as Bitmapset  Intersecting this index with given Relids gives us EquivalenceMembers that we actually want  Bitmapset operations are fast, so we can introduce child members efficiently EM t3.id (Relids: {3}) (Child members’ Relids: {6, 9}) Relids: {1, 6} join_relids EM t3_1.id (Relids: {6}) Intersect

Slide 41

Slide 41 text

41 © 2024 NTT CORPORATION Total time complexity is reduced to almost 𝑶𝑶 𝒏𝒏  Very fast! Call tree – With proposed patch Equivalence Member Equivalence Member Only a few parent members Almost 𝑂𝑂(1) set_append_rel_pathlist() set_rel_pathlist() generate_join_implied_equalities() set_rel_pathlist() set_rel_pathlist() Calls 𝑛𝑛 times create_index_path()

Slide 42

Slide 42 text

42 © 2024 NTT CORPORATION Pros and cons of proposed patch Pros  Drastically reduces time complexity of planning Cons  May have some negative impacts for smaller sizes due to additional operations  We will investigate this at end of this talk

Slide 43

Slide 43 text

43 © 2024 NTT CORPORATION Experiments

Slide 44

Slide 44 text

44 © 2024 NTT CORPORATION Experiment #1 Conducted simple experiment to test proposed patch  Three partitioned tables with 𝑛𝑛 partitions (t1, t2, and t3)  Example when 𝑛𝑛 was 2  Query joining these three tables  Partition-wise join was off in this experiment  Varied 𝑛𝑛 and measured planning time SELECT * FROM t1, t2, t3 WHERE t1.id = t2.id AND t1.id = t3.id CREATE TABLE t1(id INT PRIMARY KEY, name TEXT) PARTITION BY RANGE (id); CREATE TABLE t1_1 PARTITION OF t1 FOR VALUES FROM (0) TO (100); CREATE TABLE t1_2 PARTITION OF t1 FOR VALUES FROM (100) TO (200); -- Same for t2 and t3

Slide 45

Slide 45 text

45 © 2024 NTT CORPORATION 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 200 400 600 800 1000 Planning Time (seconds) # of partitions (n) PostgreSQL 17devel Patched Experiment #1 – Result Patch obtained significant speedup of 6.7 times 6.7 times speed up! better

Slide 46

Slide 46 text

46 © 2024 NTT CORPORATION Conducted same experiment with and without partition-wise join  Number of partitions was set to 1024 Result  Proposed patch worked effectively both with and without partition-wise join 0 0.1 0.2 0.3 0.4 0.5 0.6 PostgreSQL 17devel Patched PostgreSQL 17devel Patched Partition-wise join is off Partition-wise join is on Planning Time (seconds) 1024 partitions case Experiment #2 6.7 times speedup 2.6 times speedup better

Slide 47

Slide 47 text

47 © 2024 NTT CORPORATION Experiment #3 Conducted experiment by varying number of tables being joined  Schema was same as in experiments #1 and #2  Query  Number of partitions was set to 1024  Partition-wise join was off SELECT * FROM t1, t2, …, tm WHERE t1.id = t2.id … AND t1.id = tm.id Varied number of tables being joined

Slide 48

Slide 48 text

48 © 2024 NTT CORPORATION Proposed patch worked very effectively when joining many tables  Speedup was up to 23.5 when joining 6 tables 2 3 4 5 6 PostgreSQL 17devel 0.06 0.31 1.14 2.90 6.18 Patched 0.02 0.05 0.08 0.15 0.26 0 1 2 3 4 5 6 7 Planning Time (seconds) # of tables being joined 1024 partitions case Experiment #3 – Result better 23.5x 19.4x 13.4x 6.7x 2.4x

Slide 49

Slide 49 text

49 © 2024 NTT CORPORATION Experiment #4 – Join Order Benchmark Speedup of patch was demonstrated using Join Order Benchmark  Join Order Benchmark: famous benchmark for testing OLAP workloads  Experimental results for 1024 partitions case 0 2 4 6 8 10 12 14 16 18 20 Speedup of planning time Query better 1 10.7 times speedup on average Up to 17.5 times speedup

Slide 50

Slide 50 text

50 © 2024 NTT CORPORATION Various workloads benefit Following workloads will benefit from proposed patch  Tables have many partitions  Partition pruning does not work effectively  Queries contain joins involving many tables  Queries are complicated (having sorts, index accesses, etc.)

Slide 51

Slide 51 text

51 © 2024 NTT CORPORATION Tips

Slide 52

Slide 52 text

52 © 2024 NTT CORPORATION Profiling Experiments showed significant effects of proposed patch Let’s confirm bottleneck was actually reduced by profiling PostgreSQL  perf  Famous profiler available on Linux  Usable for finding bottlenecks  Profilers are powerful when developing PostgreSQL Here, we profile 1024 partitions case in experiment #1  In experiment, proposed patch obtained 6.7 times speedup

Slide 53

Slide 53 text

53 © 2024 NTT CORPORATION How to profile Steps to profile backend process of PostgreSQL 1. Run psql command to establish connection 2. Confirm process ID of backend process 3. Run perf to profile $ ps x PID TTY STAT TIME COMMAND … 99 ? Ss 0:00 postgres: ubuntu postgres [local] idle $ psql psql (17devel) Type "help" for help. postgres=# $ sudo perf record -ag -p 99

Slide 54

Slide 54 text

54 © 2024 NTT CORPORATION How to profile 4. Result will be shown on console Samples: 1K of event 'cpu-clock:pppH', Event count (approx.): 327250000 Children Self Command Shared Object Symbol … + 91.90% 0.00% postgres postgres [.] planner + 91.90% 0.00% postgres postgres [.] standard_planner + 91.52% 0.00% postgres postgres [.] subquery_planner + 91.52% 0.00% postgres postgres [.] grouping_planner + 91.52% 0.00% postgres postgres [.] query_planner … + 91.29% 0.15% postgres postgres [.] set_append_rel_pathlist + 90.60% 0.00% postgres postgres [.] set_plain_rel_pathlist + 85.94% 0.08% postgres postgres [.] create_index_paths … + 40.95% 0.23% postgres postgres [.] generate_join_implied_equalities + 25.44% 25.44% postgres postgres [.] bms_is_subset + 23.99% 0.08% postgres postgres [.] build_index_pathkeys + 23.91% 0.00% postgres postgres [.] make_pathkey_from_sortinfo + 22.77% 21.47% postgres postgres [.] create_join_clause + 20.32% 20.32% postgres postgres [.] bms_equal … Output illustrates functions that take much time Such functions may be bottlenecks

Slide 55

Slide 55 text

55 © 2024 NTT CORPORATION We can confirm that searches for EquivalenceClasses were bottlenecks Flame graph – Unpatched Bottlenecks

Slide 56

Slide 56 text

56 © 2024 NTT CORPORATION Bottlenecks were drastically reduced by patch  We can now see collapsed functions on last flame graph Flame graph – Patched

Slide 57

Slide 57 text

57 © 2024 NTT CORPORATION printf debug Next, we check that original implementation has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity by printf debug  Inserts logging codes in functions to see how they are called  In PostgreSQL hacking, elog and ereport are helpful for printf debug  Modifying PostgreSQL source code as follows tells us value of ‘count’  LOG, NOTICE, and DEBUG are available as log level  For more details, see PostgreSQL official document  https://www.postgresql.org/docs/16/error-message-reporting.html elog(LOG, "%d", count);

Slide 58

Slide 58 text

58 © 2024 NTT CORPORATION printf debug – generate_join_implied_equalities() Adds following code to count bms_is_subset() comparisons static List * generate_join_implied_equalities_normal(…, EquivalenceClass *ec, Relids join_relids, …) { … foreach(lc1, ec->ec_members) { EquivalenceMember *cur_em = (EquivalenceMember *) lfirst(lc1); … elog(LOG, "[printfdebug] Checking an EquivalenceMember (ec_members has %d members)", list_length(ec->ec_members)); if (!bms_is_subset(cur_em->em_relids, join_relids)) continue; /* not computable yet, or wrong child */ … } … }

Slide 59

Slide 59 text

59 © 2024 NTT CORPORATION printf debug – Result (Unpatched) Ran experiment #1 for 𝒏𝒏 = 𝟏𝟏𝟏𝟏𝟏𝟏𝟏𝟏 partitions case joining three tables We found following from this output  EquivalenceClass had 3075 = 3 𝑛𝑛 + 1 members  Parent and its children for each table  Comparison in linear search was done 18,929,700 times  This is approximately 18𝑛𝑛2 Original implementation actually has 𝑶𝑶 𝒏𝒏𝟐𝟐 time complexity $ grep "printfdebug" logfile | sed "s/.*LOG: //g" | sort | uniq -c 18929700 [printfdebug] Checking an EquivalenceMember (ec_members has 3075 members)

Slide 60

Slide 60 text

60 © 2024 NTT CORPORATION printf debug – Result (Patched) Ran same with proposed patch applied Changes from unpatched result  Number of EquivalenceMembers: 3075 = 3 𝑛𝑛 + 1  3  Number of comparisons: 18,929,700 ≈ 18𝑛𝑛2  24,612 ≈ 24𝑛𝑛 We confirmed proposed patch greatly reduced time complexity $ grep "printfdebug" logfile | sed "s/.*LOG: //g" | sort | uniq -c 24612 [printfdebug] Checking an EquivalenceMember (ec_members has 3 members)

Slide 61

Slide 61 text

61 © 2024 NTT CORPORATION Experiments are key to addressing problems When we suffer some performance problem, what should we do to solve it?  Clarifying bottleneck is first step  Profilers are helpful to do it  I profiled PostgreSQL many times to address this planning performance problem and found bms_is_subset() operation was heavy in query planner  Finding fundamental cause is then required  Bottleneck that profilers show is not always fundamental cause  We need to investigate why bottleneck is actually caused  In today’s talk, problem was not that bms_is_subset() was heavy but that linear search over so many EquivalenceMembers was time-consuming Trial and errors are particularly important to address performance issues  I did experiments more than 100 times to write first PoC patch

Slide 62

Slide 62 text

62 © 2024 NTT CORPORATION When proposing patches After finding fundamental cause, we will then improve PostgreSQL source code to address performance problem and propose patches When proposing patches, following should be with patches  Queries or steps to reproduce the performance problem  Important to make sure that other developers can easily investigate problem on their own machines  Reproduction queries or steps should be as simple as possible  Detailed experimental results  Support effectiveness of patches  As results are more detailed, patches will be more attractive  When are patches effective? How large is performance improvement?

Slide 63

Slide 63 text

63 © 2024 NTT CORPORATION Open Issues

Slide 64

Slide 64 text

64 © 2024 NTT CORPORATION Performance issue Proposed patch has slightly negative impacts on smaller sizes  Experiment shows regression is on microseconds order  18 microseconds for 2 partitions case  This is much smaller than variance of planning times  Avoiding regression is key to committing this patch  We need to conduct as many experiments as possible to examine effects of proposed patch better

Slide 65

Slide 65 text

65 © 2024 NTT CORPORATION Other issues Memory consumption and assertion are also discussed  Memory consumption  Proposed patch is known to increase memory consumption  This seems to be due to Bitmapset-based indexes for eliminating non-relevant child EquivalenceMembers  Further investigation is needed to clarify cause  Assertion  It is very important to avoid adding bugs, and assertions are helpful  However, overly excessive assertions slow down regression test time  Speeding up debug builds is also necessary

Slide 66

Slide 66 text

66 © 2024 NTT CORPORATION Conclusion

Slide 67

Slide 67 text

67 © 2024 NTT CORPORATION Summary  Partitioning is essential and important feature of PostgreSQL  PostgreSQL has improved partitioning performance since its introduction  Planning of highly partitioned tables gets slow in current PostgreSQL implementation  Handling equivalences of join operations causes this problem  Proposed patch improves architecture of equivalences and reduces time complexity from 𝑂𝑂 𝑛𝑛2 to 𝑂𝑂 𝑛𝑛  Experimental results show proposed patch obtains 10x or more speedup on Join Order Benchmark

Slide 68

Slide 68 text

68 © 2024 NTT CORPORATION References Discussion and commitfest entry are on following pages  https://commitfest.postgresql.org/48/3701/  https://www.postgresql.org/message-id/flat/CAJ2pMkZNCgoUKSE+_5LthD+Kb XKvq6h2hQN8Esxpxd+cxmgomg@mail.gmail.com Any comments are welcome!

Slide 69

Slide 69 text

69 © 2024 NTT CORPORATION Acknowledgements I would like to thank everyone who has contributed to this patch Especially,  David Rowley  He has worked with me extensively and deeply at the source code level since I first submitted the patch. He has made numerous contributions and given a lot of kind advice

Slide 70

Slide 70 text

70 © 2024 NTT CORPORATION Thank you!