Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Speeding Up Analysis Queries by Sharing Commona...

Yuya Watari
November 20, 2020

Speeding Up Analysis Queries by Sharing Commonalities - PGConf.Asia 2020

Yuya Watari

November 20, 2020
Tweet

More Decks by Yuya Watari

Other Decks in Technology

Transcript

  1. Keywords and Our Goal Keywords • Analysis query • Important

    in many situations such as Business Intelligence (BI) • Commonalities among queries • Queries have a lot of common operations like same join Our goal • Improve the query performance by sharing commonalities among queries • Do not execute the same operation more than once • Store the intermediate results and reuse them in other queries
  2. Why are Analysis Queries Important? • Analyzing data leads to

    new insights and findings • The market size of DWH is increasing year by year • PostgreSQL is required to have the ability to be a backend for BI tools 0 10 20 30 40 2025 2018 Market size (USD) Year 30 Billion 13 Billion https://www.gminsights.com/ industry-analysis/data- warehousing-market
  3. Commonalities among Queries • Users tend to issue similar queries

    • A series of similar queries lead to a lot of common operations in their query plans Simple example • Assume a user operates the following two queries • Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 • Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 • 𝐴 ⋈ 𝐵 appears in both of the queries and this join operation is a commonality ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Commonality
  4. Sharing Commonalities Sharing commonalities among queries improves performance • Does

    not execute the same operation more than once • Stores the intermediate results • Reuses them in other queries Previous example • Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 • Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 • We do not need to execute 𝐴 ⋈ 𝐵 twice • Store the result of 𝐴 ⋈ 𝐵 and reuse it in the other query ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Commonality
  5. Existing Methods PostgreSQL offers methods that share commonalities • E.g.

    common table expression (CTE), materialized views These methods require users to rewrite queries by hand We propose methods that automatically share commonalities SELECT * … Rewrite SELECT * … Rewrite SELECT * … Rewrite Rewriting many queries manually is unrealistic!
  6. Keywords and Our Goal Keywords • Analysis query • Important

    in many situations such as Business Intelligence (BI) • Commonalities among queries • Queries have a lot of common operations like same join Our goal • Improve the query performance by sharing commonalities among queries • Do not execute the same operation more than once • Store the intermediate results and reuse them in other queries
  7. Commonalities Exist in Many Cases Multiple queries • Users tend

    to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query
  8. Use Case – Multiple Queries • Business Intelligence (BI) is

    one of the use cases • According to a BI survey, users tend to repeatedly issue similar queries in the refining process [1] • Many users do not submit a single complete query • The users need to refine the queries, so they issue several incomplete queries repeatedly • Example of the refining process • Adding missing WHERE clauses • Commonalities exist in a series of these queries DB Query Query Refine the query and issue it again [1] Vogelsgesang, Adrian, et al. "Get real: How benchmarks fail to represent the real world." Proceedings of the Workshop on Testing Database Systems. ACM, 2018.
  9. Use Case – Single Query • Commonalities sometimes exist in

    a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store
  10. Use Case – Single Query • Commonalities sometimes exist in

    a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store Similar join operation appears 8 times in a single query There is no need to execute this join operation 8 times This query lists the number of sold item in the specific time periods
  11. Sharing Methods that We Propose We propose two sharing methods

    • One for multiple queries • One for a single query These two methods assume read-only workloads • Current methods cannot handle write queries • Handling write queries is one of the major future works of our technology
  12. Commonalities Exist in Many Cases Multiple queries • Users tend

    to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query Caching intermediate results works for multiple queries
  13. Caching Intermediate Results Caching the intermediate results enables us to

    share commonalities among queries Our caching method materializes intermediates results in memory and reuses them in subsequent queries DB Commonalities Memory Cache Reuse
  14. The Flow of The Caching Method 1. Record history •

    Record the number of occurrences of query plan trees 2. Cache results • If a plan tree appears more than some threshold number of times, the caching method materializes the execution result of it into memory 3. Reuse cached results • If the arrived query (partially) matches previous ones and their results are cached, the caching method reuses them instead of executing the query again
  15. Simple Example of The Caching Method Assume the following three

    queries successively arrive In this example, 𝐴 ⋈ 𝐵 is a commonality ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 ⋈ ⋈ 𝐸 𝐴 𝐵 Query 3: 𝐴 ⋈ 𝐵 ⋈ 𝐸
  16. Simple Example of The Caching Method The caching method records

    the number of occurrences of each node • Here, all numbers are 1 Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 1 1 1 1 1 ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 Memory
  17. Simple Example of The Caching Method Node 𝐴 𝐵 𝐶

    𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 2 2 1 2 1 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Memory Node 𝐷 𝐴 ⋈ 𝐵 ⋈ 𝐷 # of occurrences 1 1
  18. Simple Example of The Caching Method 𝐴 ⋈ 𝐵 appeared

    twice, so we cache the result of it • Table 𝐴 and 𝐵 will not be cached because they are child nodes of 𝐴 ⋈ 𝐵 Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 2 2 1 2 1 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Memory Node 𝐷 𝐴 ⋈ 𝐵 ⋈ 𝐷 # of occurrences 1 1 The result of 𝑨 ⋈ 𝑩 Cache
  19. Simple Example of The Caching Method The result of 𝐴

    ⋈ 𝐵 is already cached, so we reuse it instead of executing 𝐴 ⋈ 𝐵 again Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 3 3 1 3 1 ⋈ ⋈ 𝐸 𝐴 𝐵 Query 3: 𝐴 ⋈ 𝐵 ⋈ 𝐸 Memory Node 𝐷 𝐴 ⋈ 𝐵 ⋈ 𝐷 𝐸 𝐴 ⋈ 𝐵 ⋈ 𝐸 # of occurrences 1 1 1 1 The result of 𝑨 ⋈ 𝑩 Reuse
  20. Cache Replacement Algorithm We adopt Least Frequently Used (LFU) policy

    • The method keeps more frequently referenced caches in memory When trying to cache some results, the caching method releases data whose frequency is less
  21. Our Method is Different from Others Our method can accelerate

    queries that partially match past ones, while result caching can work only when the same queries arrive Our method is transparent, so users do not need to explicitly manage the cache, while materialize views require users to maintain them
  22. Demo We use the following queries as a demo ⋈

    ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 ⋈ ⋈ 𝐸 𝐴 𝐵 Query 3: 𝐴 ⋈ 𝐵 ⋈ 𝐸
  23. Demo 1. Create the following tables • CREATE TABLE A(x

    int, y int) • CREATE TABLE B(x int, y int) • CREATE TABLE C(x int, y int) • CREATE TABLE D(x int, y int) • CREATE TABLE E(x int, y int) 2. Execute the next queries successively • SELECT * FROM A JOIN B ON A.x = B.x JOIN C ON A.x = C.x LIMIT 1 • SELECT * FROM A JOIN B ON A.x = B.x JOIN D ON A.x = D.x LIMIT 1 • SELECT * FROM A JOIN B ON A.x = B.x JOIN E ON A.x = E.x LIMIT 1
  24. Without the caching method With the caching method 𝐴 ⋈

    𝐵 was cached during the execution of the second query The execution time was reduced because the cached result was reused
  25. Without the caching method With the caching method 𝐴 ⋈

    𝐵 was cached during the execution of the second query The execution time was reduced because the cached result was reused • Three queries are different from each other but the caching method worked • Since the caching method is transparent, there is nothing to do to use it
  26. Experiment • We conducted an experiment to test the caching

    method • We used the next benchmark • Public BI benchmark • Environment • OS: CentOS 7.3 • PostgreSQL: 13 • CPU: Xeon E5-2640 (6 cores / 12 threads) • Memory: 16GB • HDD: 500GB
  27. Experiment – Public BI benchmark • Public BI benchmark •

    Contains real data and queries from workbooks in Tableau Public • Represents queries that BI dashboard (Tableau) generates • We used HashTag benchmark in Public BI benchmark • HashTag benchmark analyzes tweets • We sequentially executed four queries in this benchmark and measured their execution time • Four queries are different from each other
  28. Experiment – Public BI benchmark The query in this benchmark

    SELECT (NOT("t0"."_Tableau_join_flag" IS NULL)) AS "io:1 time:nk", (NOT("t1"."_Tableau_join_flag" IS NULL)) AS "io:2 times:nk", (NOT("t2"."_Tableau_join_flag" IS NULL)) AS "io:20+ times:nk", (NOT("t3"."_Tableau_join_flag" IS NULL)) AS "io:3 times:nk", (NOT("t4"."_Tableau_join_flag" IS NULL)) AS "io:4 times:nk", (NOT("t5"."_Tableau_join_flag" IS NULL)) AS "io:5 times:nk", COUNT(DISTINCT "HashTags_1"."twitter#user#screen_name") AS "usr:Calculation_6330207195516273:ok" FROM "HashTags_1" LEFT JOIN ( SELECT "HashTags_1"."twitter#user#screen_name" AS "twitter#user#screen_name", MIN(1) AS "_Tableau_join_flag" FROM "HashTags_1" Aggregation Join operation
  29. Experiment – Public BI benchmark The caching method reduced total

    execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4
  30. Experiment – Public BI benchmark The caching method reduced total

    execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Executed query 1, 2, 3, and 4 successively (Each query is distinct)
  31. Experiment – Public BI benchmark The caching method reduced total

    execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Query 1 took almost the same time in both methods
  32. Experiment – Public BI benchmark The caching method reduced total

    execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Since the same operation appeared twice, the caching method cached its result. Caching took some costs, so the execution time increased.
  33. Experiment – Public BI benchmark The caching method reduced total

    execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 In queries 3 and 4, the caching method could reuse the cached results, so we obtained a 28% improvement
  34. Experiment – Public BI benchmark The caching method reduced total

    execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Totally, the execution time was reduced by 11%
  35. Experiment – Public BI benchmark Similar join operation appeared in

    the four queries, so we could obtain performance improvements In this benchmark, a user repeatedly issue similar queries while changing aggregated columns or conditional clauses The caching method is effective for this kind of workload DB Similar aggregations with different conditions Query
  36. Commonalities Exist in Many Cases Multiple queries • Users tend

    to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query We propose a shared execution method to speed up a single query
  37. Use Case – Single Query • Commonalities sometimes exist in

    a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store
  38. Question • TPC-DS Query 88 is too complex, so we

    use a simple example in this presentation • Assume the following four tables exist in the database • Each table has integer columns x and y x y 10 20 20 30 𝐴 x y 10 20 20 30 𝐵 x y 10 20 20 30 𝐶 x y 10 20 20 30 𝐷
  39. Question How do you optimize the following query? ⋈ ⋈

    𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Corresponding plan tree SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15
  40. Answer to the Question In this example, the join operation

    between A and B is a commonality There are a lot of ways to optimize this query, and one solution is Common Table Expression (CTE) • (Not-inlined) CTE materializes the execution result and reuses it in all occurrences Example of introduction of CTE WITH cte as ( SELECT * FROM A JOIN B ON A.x = B.x WHERE A.y = 10 OR A.y = 15 ) SELECT * FROM cte JOIN C ON cte.x = C.x WHERE cte.y = 10 UNION ALL SELECT * FROM cte JOIN D ON cte.x = D.x WHERE cte.y = 15
  41. Another Question Question • Is this optimization good? Will it

    really reduce the execution time? Answer • It depends on the workload • Materializing takes some cost • If the cost is high, the performance will not increase • We have to determine whether the optimization is appropriate or not • We prevent bad optimizations by cost evaluation
  42. The Flow of the Shared Execution Our method automatically performs

    the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
  43. The Flow of the Shared Execution Our method automatically performs

    the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
  44. Step 1 – Find Commonalities The shared execution method extracts

    commonalities from the query plan of the given query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Commonality
  45. Step 1 – Find Commonalities The shared execution method extracts

    commonalities from the query plan of the given query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Commonality These structures are regarded as a commonality even if their conditional clauses differ
  46. Conditional Clauses In the previous example, the conditional clauses of

    each occurrence differ The caching method merges them into ORed conditions • Original occurrences filter the CTE with their original conditions SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15 WITH cte as ( SELECT * FROM A JOIN B ON A.x = B.x WHERE A.y = 10 OR A.y = 15 ) SELECT * FROM cte JOIN C ON cte.x = C.x WHERE cte.y = 10 UNION ALL SELECT * FROM cte JOIN D ON cte.x = D.x WHERE cte.y = 15
  47. The Flow of the Shared Execution Our method automatically performs

    the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
  48. Step 2 – Cost Evaluation Sharing commonalities does not always

    improve performance • Materializing the result is a little time-consuming task • If reusing does not reduce the execution cost drastically, the total performance might decrease We perform a cost evaluation to overcome this problem
  49. Good optimization (We perform optimization) Bad optimization (We do not

    modify the query) Step 2 – Cost Evaluation If the cost with shared execution (𝐶𝑠ℎ𝑎𝑟𝑒) is less than that without shared execution (𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒), it is reasonable to perform shared execution 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 𝐶𝑠ℎ𝑎𝑟𝑒 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 𝐶𝑠ℎ𝑎𝑟𝑒
  50. Step 2 – Cost without the Shared Execution Cost without

    shared execution is simply the sum of all execution costs 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 = ෍ 𝑖 𝑐𝑜𝑠𝑡 𝑒𝑖 (= 𝑐𝑜𝑠𝑡 𝑒1 + 𝑐𝑜𝑠𝑡 𝑒2 ) ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) 𝑒1 𝑒2 We call occurrences of the commonality 𝒆𝟏, 𝒆𝟐
  51. Step 2 – Cost with the Shared Execution 1. The

    cost of executing a commonality only once 2. The cost of storing the result into memory 3. The cost of reusing the result 𝐶𝑠ℎ𝑎𝑟𝑒 = 𝑐𝑜𝑠𝑡 𝑒+ + 𝑚𝑎𝑡𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑒+ + 𝑁 × 𝑟𝑒𝑢𝑠𝑒 𝑒+ 1 2 3 Estimated by the planner Proportional to the number of result rows N is the number of occurrences of the commonalities
  52. Step 2 – Cost Evaluation If 𝐶𝑠ℎ𝑎𝑟𝑒 < 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 satisfies,

    we perform the shared execution 𝐶𝑠ℎ𝑎𝑟𝑒 = 𝑐𝑜𝑠𝑡 𝑒+ + 𝑚𝑎𝑡𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑒+ + 𝑁 × 𝑟𝑒𝑢𝑠𝑒 𝑒+ 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 = ෍ 𝑖 𝑐𝑜𝑠𝑡 𝑒𝑖
  53. The Flow of the Shared Execution Our method automatically performs

    the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
  54. Step 3 – Introduce CTEs The caching method introduces CTEs

    while merging conditional clauses ⋈ 𝐴 𝐵 (y = 10 OR y = 15) CTE ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Introduce CTEs
  55. The Flow of the Shared Execution Our method automatically performs

    the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute
  56. Step 4 – Execute ORed condition ⋈ 𝐴 𝐵 (y

    = 10 OR y = 15) CTE Execute commonalities only once 1 Memory Intermediate results Materialize the results into memory 2 Reuse the materialized results 3 ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15)
  57. The Difference between Two Methods Caching method (for multiple queries)

    • Commonalities are not known in advance • We need to predict future queries • Cost evaluation based on the predication is a future work Shared execution (for a single query) • Commonalities are known in advance • We can easily perform cost evaluation DB Commonalities Query 1 Query 2 DB Commonalities Query The commonality is not known when query 1 arrives We can detect all commonalities from the given query
  58. Demo – Commonalities in a Single Query 1. Create the

    following four tables (the same ones in the example of the caching method) • CREATE TABLE A(x int, y int) • CREATE TABLE B(x int, y int) • CREATE TABLE C(x int, y int) • CREATE TABLE D(x int, y int) 2. Execute the following query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15
  59. Experiment • We conducted an experiment by using TPC-DS •

    Environment • OS: CentOS 7.3 • PostgreSQL: 13 • CPU: Xeon E5-2640 (6 cores / 12 threads) • Memory: 16GB • HDD: 500GB
  60. Experiment – Single Query We executed TPC-DS query 88 •

    This query has same join operations select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store
  61. Result – Experiment by using TPC-DS The execution time with

    shared execution is 7.8 times faster than that without shared execution Sharing commonalities in a single query greatly improved query performance 0 5 10 15 20 With shared execution Without shared execution Execution time (s) better
  62. Future works • An application of cost evaluation to the

    caching method • Caching method (for multiple queries) lacks cost evaluation • Prediction of future queries and cost evaluation based on it are major future works • Handling write queries • Current methods will collapse when tables change • We have to implement invalidation of the cache whose corresponding data is updated • Managing join order • Breaking the optimal join order is very effective to utilize the caching method • Cache replacement algorithms
  63. Conclusion We implemented caching and shared execution methods on PostgreSQL

    to share commonalities Experiments by Public BI Benchmark and TPC-DS revealed that our methods obtained high performance improvements Sharing commonalities among queries is very effective at reducing the query execution time Handling write queries and managing join order are future works Thank you!
  64. Managing Join Order Commonalities depend on the join order of

    queries ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 Our methods work for this plan If the queries are planned as the following join order, 𝐴 ⋈ 𝐵 becomes a commonality However, the commonality does not appear in the following query plan ⋈ ⋈ 𝐵 𝐴 𝐶 ⋈ ⋈ 𝐵 𝐴 𝐷 We cannot apply our methods to 𝑨 ⋈ 𝑩
  65. Managing Join Order – Simple Experiment We conducted a simple

    experiment to test the effect of join order We issued one TPC-DS query repeatedly while varying its “date” column ⋈ ⋈ 𝐴 𝐶 𝐵 date =1999/1/1 Query 1’ ⋈ ⋈ 𝐴 𝐶 𝐵 date =2000/1/1 Query 1 DB Since table A and B do not change, 𝐴 ⋈ 𝐵 is a commonality. However, it does not appear in the optimal plan, so we cannot cache its result
  66. Managing Join Order – Simple Experiment We modified the join

    order by using pg_hint_plan extension ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan (We cannot cache 𝐀 ⋈B) Plan whose join order is manually changed (We can cache 𝐀 ⋈ 𝑩)
  67. Result of the Experiment 0 5 10 15 20 Without

    caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan
  68. Result of the Experiment 0 5 10 15 20 Without

    caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan The performance did not improve because we could not cache 𝑨 ⋈ 𝑩
  69. Result of the Experiment 0 5 10 15 20 Without

    caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan Execution time increased because the modified join order is not optimal
  70. Result of the Experiment 0 5 10 15 20 Without

    caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan We obtained a 47% improvement because we cached 𝑨 ⋈ 𝑩
  71. Result of the Experiment 0 5 10 15 20 Without

    caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan • It is effective to break the optimal join order in order to utilize our caching method • Managing join order automatically is future work