Slide 1

Slide 1 text

Speeding Up Analysis Queries by Sharing Commonalities Yu ya Wa ta r i

Slide 2

Slide 2 text

Introduction

Slide 3

Slide 3 text

Keywords and Our Goal Keywords • Analysis query • Important in many situations such as Business Intelligence (BI) • Commonalities among queries • Queries have a lot of common operations like same join Our goal • Improve the query performance by sharing commonalities among queries • Do not execute the same operation more than once • Store the intermediate results and reuse them in other queries

Slide 4

Slide 4 text

Why are Analysis Queries Important? • Analyzing data leads to new insights and findings • The market size of DWH is increasing year by year • PostgreSQL is required to have the ability to be a backend for BI tools 0 10 20 30 40 2025 2018 Market size (USD) Year 30 Billion 13 Billion https://www.gminsights.com/ industry-analysis/data- warehousing-market

Slide 5

Slide 5 text

Commonalities among Queries • Users tend to issue similar queries • A series of similar queries lead to a lot of common operations in their query plans Simple example • Assume a user operates the following two queries • Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 • Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 • 𝐴 ⋈ 𝐵 appears in both of the queries and this join operation is a commonality ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Commonality

Slide 6

Slide 6 text

Sharing Commonalities Sharing commonalities among queries improves performance • Does not execute the same operation more than once • Stores the intermediate results • Reuses them in other queries Previous example • Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 • Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 • We do not need to execute 𝐴 ⋈ 𝐵 twice • Store the result of 𝐴 ⋈ 𝐵 and reuse it in the other query ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Commonality

Slide 7

Slide 7 text

Existing Methods PostgreSQL offers methods that share commonalities • E.g. common table expression (CTE), materialized views These methods require users to rewrite queries by hand We propose methods that automatically share commonalities SELECT * … Rewrite SELECT * … Rewrite SELECT * … Rewrite Rewriting many queries manually is unrealistic!

Slide 8

Slide 8 text

Keywords and Our Goal Keywords • Analysis query • Important in many situations such as Business Intelligence (BI) • Commonalities among queries • Queries have a lot of common operations like same join Our goal • Improve the query performance by sharing commonalities among queries • Do not execute the same operation more than once • Store the intermediate results and reuse them in other queries

Slide 9

Slide 9 text

Examples of Commonalities

Slide 10

Slide 10 text

Commonalities Exist in Many Cases Multiple queries • Users tend to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query

Slide 11

Slide 11 text

Use Case – Multiple Queries • Business Intelligence (BI) is one of the use cases • According to a BI survey, users tend to repeatedly issue similar queries in the refining process [1] • Many users do not submit a single complete query • The users need to refine the queries, so they issue several incomplete queries repeatedly • Example of the refining process • Adding missing WHERE clauses • Commonalities exist in a series of these queries DB Query Query Refine the query and issue it again [1] Vogelsgesang, Adrian, et al. "Get real: How benchmarks fail to represent the real world." Proceedings of the Workshop on Testing Database Systems. ACM, 2018.

Slide 12

Slide 12 text

Use Case – Single Query • Commonalities sometimes exist in a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store

Slide 13

Slide 13 text

Use Case – Single Query • Commonalities sometimes exist in a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store Similar join operation appears 8 times in a single query There is no need to execute this join operation 8 times This query lists the number of sold item in the specific time periods

Slide 14

Slide 14 text

Sharing Methods that We Propose We propose two sharing methods • One for multiple queries • One for a single query These two methods assume read-only workloads • Current methods cannot handle write queries • Handling write queries is one of the major future works of our technology

Slide 15

Slide 15 text

Caching Method for Multiple Queries

Slide 16

Slide 16 text

Commonalities Exist in Many Cases Multiple queries • Users tend to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query Caching intermediate results works for multiple queries

Slide 17

Slide 17 text

Caching Intermediate Results Caching the intermediate results enables us to share commonalities among queries Our caching method materializes intermediates results in memory and reuses them in subsequent queries DB Commonalities Memory Cache Reuse

Slide 18

Slide 18 text

The Flow of The Caching Method 1. Record history • Record the number of occurrences of query plan trees 2. Cache results • If a plan tree appears more than some threshold number of times, the caching method materializes the execution result of it into memory 3. Reuse cached results • If the arrived query (partially) matches previous ones and their results are cached, the caching method reuses them instead of executing the query again

Slide 19

Slide 19 text

Simple Example of The Caching Method Assume the following three queries successively arrive In this example, 𝐴 ⋈ 𝐵 is a commonality ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 ⋈ ⋈ 𝐸 𝐴 𝐵 Query 3: 𝐴 ⋈ 𝐵 ⋈ 𝐸

Slide 20

Slide 20 text

Simple Example of The Caching Method The caching method records the number of occurrences of each node • Here, all numbers are 1 Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 1 1 1 1 1 ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 Memory

Slide 21

Slide 21 text

Simple Example of The Caching Method Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 2 2 1 2 1 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Memory Node 𝐷 𝐴 ⋈ 𝐵 ⋈ 𝐷 # of occurrences 1 1

Slide 22

Slide 22 text

Simple Example of The Caching Method 𝐴 ⋈ 𝐵 appeared twice, so we cache the result of it • Table 𝐴 and 𝐵 will not be cached because they are child nodes of 𝐴 ⋈ 𝐵 Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 2 2 1 2 1 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 Memory Node 𝐷 𝐴 ⋈ 𝐵 ⋈ 𝐷 # of occurrences 1 1 The result of 𝑨 ⋈ 𝑩 Cache

Slide 23

Slide 23 text

Simple Example of The Caching Method The result of 𝐴 ⋈ 𝐵 is already cached, so we reuse it instead of executing 𝐴 ⋈ 𝐵 again Node 𝐴 𝐵 𝐶 𝐴 ⋈ 𝐵 𝐴 ⋈ 𝐵 ⋈ 𝐶 # of occurrences 3 3 1 3 1 ⋈ ⋈ 𝐸 𝐴 𝐵 Query 3: 𝐴 ⋈ 𝐵 ⋈ 𝐸 Memory Node 𝐷 𝐴 ⋈ 𝐵 ⋈ 𝐷 𝐸 𝐴 ⋈ 𝐵 ⋈ 𝐸 # of occurrences 1 1 1 1 The result of 𝑨 ⋈ 𝑩 Reuse

Slide 24

Slide 24 text

Cache Replacement Algorithm We adopt Least Frequently Used (LFU) policy • The method keeps more frequently referenced caches in memory When trying to cache some results, the caching method releases data whose frequency is less

Slide 25

Slide 25 text

Our Method is Different from Others Our method can accelerate queries that partially match past ones, while result caching can work only when the same queries arrive Our method is transparent, so users do not need to explicitly manage the cache, while materialize views require users to maintain them

Slide 26

Slide 26 text

Demo We use the following queries as a demo ⋈ ⋈ 𝐶 𝐴 𝐵 Query 1: 𝐴 ⋈ 𝐵 ⋈ 𝐶 ⋈ ⋈ 𝐷 𝐴 𝐵 Query 2: 𝐴 ⋈ 𝐵 ⋈ 𝐷 ⋈ ⋈ 𝐸 𝐴 𝐵 Query 3: 𝐴 ⋈ 𝐵 ⋈ 𝐸

Slide 27

Slide 27 text

Demo 1. Create the following tables • CREATE TABLE A(x int, y int) • CREATE TABLE B(x int, y int) • CREATE TABLE C(x int, y int) • CREATE TABLE D(x int, y int) • CREATE TABLE E(x int, y int) 2. Execute the next queries successively • SELECT * FROM A JOIN B ON A.x = B.x JOIN C ON A.x = C.x LIMIT 1 • SELECT * FROM A JOIN B ON A.x = B.x JOIN D ON A.x = D.x LIMIT 1 • SELECT * FROM A JOIN B ON A.x = B.x JOIN E ON A.x = E.x LIMIT 1

Slide 28

Slide 28 text

Demo Video

Slide 29

Slide 29 text

Without the caching method With the caching method 𝐴 ⋈ 𝐵 was cached during the execution of the second query The execution time was reduced because the cached result was reused

Slide 30

Slide 30 text

Without the caching method With the caching method 𝐴 ⋈ 𝐵 was cached during the execution of the second query The execution time was reduced because the cached result was reused • Three queries are different from each other but the caching method worked • Since the caching method is transparent, there is nothing to do to use it

Slide 31

Slide 31 text

Experiment • We conducted an experiment to test the caching method • We used the next benchmark • Public BI benchmark • Environment • OS: CentOS 7.3 • PostgreSQL: 13 • CPU: Xeon E5-2640 (6 cores / 12 threads) • Memory: 16GB • HDD: 500GB

Slide 32

Slide 32 text

Experiment – Public BI benchmark • Public BI benchmark • Contains real data and queries from workbooks in Tableau Public • Represents queries that BI dashboard (Tableau) generates • We used HashTag benchmark in Public BI benchmark • HashTag benchmark analyzes tweets • We sequentially executed four queries in this benchmark and measured their execution time • Four queries are different from each other

Slide 33

Slide 33 text

Experiment – Public BI benchmark The query in this benchmark SELECT (NOT("t0"."_Tableau_join_flag" IS NULL)) AS "io:1 time:nk", (NOT("t1"."_Tableau_join_flag" IS NULL)) AS "io:2 times:nk", (NOT("t2"."_Tableau_join_flag" IS NULL)) AS "io:20+ times:nk", (NOT("t3"."_Tableau_join_flag" IS NULL)) AS "io:3 times:nk", (NOT("t4"."_Tableau_join_flag" IS NULL)) AS "io:4 times:nk", (NOT("t5"."_Tableau_join_flag" IS NULL)) AS "io:5 times:nk", COUNT(DISTINCT "HashTags_1"."twitter#user#screen_name") AS "usr:Calculation_6330207195516273:ok" FROM "HashTags_1" LEFT JOIN ( SELECT "HashTags_1"."twitter#user#screen_name" AS "twitter#user#screen_name", MIN(1) AS "_Tableau_join_flag" FROM "HashTags_1" Aggregation Join operation

Slide 34

Slide 34 text

Experiment – Public BI benchmark The caching method reduced total execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4

Slide 35

Slide 35 text

Experiment – Public BI benchmark The caching method reduced total execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Executed query 1, 2, 3, and 4 successively (Each query is distinct)

Slide 36

Slide 36 text

Experiment – Public BI benchmark The caching method reduced total execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Query 1 took almost the same time in both methods

Slide 37

Slide 37 text

Experiment – Public BI benchmark The caching method reduced total execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Since the same operation appeared twice, the caching method cached its result. Caching took some costs, so the execution time increased.

Slide 38

Slide 38 text

Experiment – Public BI benchmark The caching method reduced total execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 In queries 3 and 4, the caching method could reuse the cached results, so we obtained a 28% improvement

Slide 39

Slide 39 text

Experiment – Public BI benchmark The caching method reduced total execution time by 11% 0 5 10 15 20 25 30 With caching Without caching Execution time (s) better +3% ±0% -28% -28% Totally -11% Query 1 Query 2 Query 3 Query 4 Totally, the execution time was reduced by 11%

Slide 40

Slide 40 text

Experiment – Public BI benchmark Similar join operation appeared in the four queries, so we could obtain performance improvements In this benchmark, a user repeatedly issue similar queries while changing aggregated columns or conditional clauses The caching method is effective for this kind of workload DB Similar aggregations with different conditions Query

Slide 41

Slide 41 text

Shared Execution Method for a Single Query

Slide 42

Slide 42 text

Commonalities Exist in Many Cases Multiple queries • Users tend to issue several queries repeatedly when refining them • Solution: Caching result Single query • Commonalities sometimes appear in a single query • Solution: Shared execution DB Commonalities Query Query DB Commonalities Query We propose a shared execution method to speed up a single query

Slide 43

Slide 43 text

Use Case – Single Query • Commonalities sometimes exist in a single query • TPC-DS query 88 select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store

Slide 44

Slide 44 text

Question • TPC-DS Query 88 is too complex, so we use a simple example in this presentation • Assume the following four tables exist in the database • Each table has integer columns x and y x y 10 20 20 30 𝐴 x y 10 20 20 30 𝐵 x y 10 20 20 30 𝐶 x y 10 20 20 30 𝐷

Slide 45

Slide 45 text

Question How do you optimize the following query? ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Corresponding plan tree SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15

Slide 46

Slide 46 text

Answer to the Question In this example, the join operation between A and B is a commonality There are a lot of ways to optimize this query, and one solution is Common Table Expression (CTE) • (Not-inlined) CTE materializes the execution result and reuses it in all occurrences Example of introduction of CTE WITH cte as ( SELECT * FROM A JOIN B ON A.x = B.x WHERE A.y = 10 OR A.y = 15 ) SELECT * FROM cte JOIN C ON cte.x = C.x WHERE cte.y = 10 UNION ALL SELECT * FROM cte JOIN D ON cte.x = D.x WHERE cte.y = 15

Slide 47

Slide 47 text

Another Question Question • Is this optimization good? Will it really reduce the execution time? Answer • It depends on the workload • Materializing takes some cost • If the cost is high, the performance will not increase • We have to determine whether the optimization is appropriate or not • We prevent bad optimizations by cost evaluation

Slide 48

Slide 48 text

The Flow of the Shared Execution Our method automatically performs the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute

Slide 49

Slide 49 text

The Flow of the Shared Execution Our method automatically performs the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute

Slide 50

Slide 50 text

Step 1 – Find Commonalities The shared execution method extracts commonalities from the query plan of the given query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Commonality

Slide 51

Slide 51 text

Step 1 – Find Commonalities The shared execution method extracts commonalities from the query plan of the given query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Commonality These structures are regarded as a commonality even if their conditional clauses differ

Slide 52

Slide 52 text

Conditional Clauses In the previous example, the conditional clauses of each occurrence differ The caching method merges them into ORed conditions • Original occurrences filter the CTE with their original conditions SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15 WITH cte as ( SELECT * FROM A JOIN B ON A.x = B.x WHERE A.y = 10 OR A.y = 15 ) SELECT * FROM cte JOIN C ON cte.x = C.x WHERE cte.y = 10 UNION ALL SELECT * FROM cte JOIN D ON cte.x = D.x WHERE cte.y = 15

Slide 53

Slide 53 text

The Flow of the Shared Execution Our method automatically performs the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute

Slide 54

Slide 54 text

Step 2 – Cost Evaluation Sharing commonalities does not always improve performance • Materializing the result is a little time-consuming task • If reusing does not reduce the execution cost drastically, the total performance might decrease We perform a cost evaluation to overcome this problem

Slide 55

Slide 55 text

Good optimization (We perform optimization) Bad optimization (We do not modify the query) Step 2 – Cost Evaluation If the cost with shared execution (𝐶𝑠ℎ𝑎𝑟𝑒) is less than that without shared execution (𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒), it is reasonable to perform shared execution 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 𝐶𝑠ℎ𝑎𝑟𝑒 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 𝐶𝑠ℎ𝑎𝑟𝑒

Slide 56

Slide 56 text

Step 2 – Cost without the Shared Execution Cost without shared execution is simply the sum of all execution costs 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 = ෍ 𝑖 𝑐𝑜𝑠𝑡 𝑒𝑖 (= 𝑐𝑜𝑠𝑡 𝑒1 + 𝑐𝑜𝑠𝑡 𝑒2 ) ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) 𝑒1 𝑒2 We call occurrences of the commonality 𝒆𝟏, 𝒆𝟐

Slide 57

Slide 57 text

Step 2 – Cost with the Shared Execution 1. The cost of executing a commonality only once 2. The cost of storing the result into memory 3. The cost of reusing the result 𝐶𝑠ℎ𝑎𝑟𝑒 = 𝑐𝑜𝑠𝑡 𝑒+ + 𝑚𝑎𝑡𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑒+ + 𝑁 × 𝑟𝑒𝑢𝑠𝑒 𝑒+ 1 2 3 Estimated by the planner Proportional to the number of result rows N is the number of occurrences of the commonalities

Slide 58

Slide 58 text

Step 2 – Cost Evaluation If 𝐶𝑠ℎ𝑎𝑟𝑒 < 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 satisfies, we perform the shared execution 𝐶𝑠ℎ𝑎𝑟𝑒 = 𝑐𝑜𝑠𝑡 𝑒+ + 𝑚𝑎𝑡𝑒𝑟𝑖𝑎𝑙𝑖𝑧𝑒 𝑒+ + 𝑁 × 𝑟𝑒𝑢𝑠𝑒 𝑒+ 𝐶𝑛𝑜𝑛𝑠ℎ𝑎𝑟𝑒 = ෍ 𝑖 𝑐𝑜𝑠𝑡 𝑒𝑖

Slide 59

Slide 59 text

The Flow of the Shared Execution Our method automatically performs the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute

Slide 60

Slide 60 text

Step 3 – Introduce CTEs The caching method introduces CTEs while merging conditional clauses ⋈ 𝐴 𝐵 (y = 10 OR y = 15) CTE ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) Introduce CTEs

Slide 61

Slide 61 text

The Flow of the Shared Execution Our method automatically performs the following steps 1. Find commonalities from the given query 2. Cost evaluation • To prevent bad optimization 3. Introduce CTEs which stand for the commonalities 4. Execute

Slide 62

Slide 62 text

Step 4 – Execute ORed condition ⋈ 𝐴 𝐵 (y = 10 OR y = 15) CTE Execute commonalities only once 1 Memory Intermediate results Materialize the results into memory 2 Reuse the materialized results 3 ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15)

Slide 63

Slide 63 text

The Difference between Two Methods Caching method (for multiple queries) • Commonalities are not known in advance • We need to predict future queries • Cost evaluation based on the predication is a future work Shared execution (for a single query) • Commonalities are known in advance • We can easily perform cost evaluation DB Commonalities Query 1 Query 2 DB Commonalities Query The commonality is not known when query 1 arrives We can detect all commonalities from the given query

Slide 64

Slide 64 text

Demo – Commonalities in a Single Query 1. Create the following four tables (the same ones in the example of the caching method) • CREATE TABLE A(x int, y int) • CREATE TABLE B(x int, y int) • CREATE TABLE C(x int, y int) • CREATE TABLE D(x int, y int) 2. Execute the following query ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 UNION ALL (y = 10) (y = 15) SELECT * FROM A JOIN B ON A.x = b.x JOIN C ON A.x = C.x WHERE A.y = 10 UNION ALL SELECT * FROM A JOIN B ON A.x = b.x JOIN D ON A.x = D.x WHERE A.y = 15

Slide 65

Slide 65 text

Demo Video

Slide 66

Slide 66 text

Experiment • We conducted an experiment by using TPC-DS • Environment • OS: CentOS 7.3 • PostgreSQL: 13 • CPU: Xeon E5-2640 (6 cores / 12 threads) • Memory: 16GB • HDD: 500GB

Slide 67

Slide 67 text

Experiment – Single Query We executed TPC-DS query 88 • This query has same join operations select * from (select count(*) h8_30_to_9 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1, (select count(*) h9_to_9_30 from store_sales, household_demographics , time_dim, store where ss_sold_time_sk = time_dim.t_time_sk and ss_hdemo_sk = household_demographics.hd_demo_sk and ss_store_sk = s_store_sk and time_dim.t_hour = 9 and time_dim.t_minute < 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s2, (select count(*) h9_30_to_10 from store_sales, household_demographics , time_dim, store

Slide 68

Slide 68 text

Result – Experiment by using TPC-DS The execution time with shared execution is 7.8 times faster than that without shared execution Sharing commonalities in a single query greatly improved query performance 0 5 10 15 20 With shared execution Without shared execution Execution time (s) better

Slide 69

Slide 69 text

Future Works

Slide 70

Slide 70 text

Future works • An application of cost evaluation to the caching method • Caching method (for multiple queries) lacks cost evaluation • Prediction of future queries and cost evaluation based on it are major future works • Handling write queries • Current methods will collapse when tables change • We have to implement invalidation of the cache whose corresponding data is updated • Managing join order • Breaking the optimal join order is very effective to utilize the caching method • Cache replacement algorithms

Slide 71

Slide 71 text

Conclusion

Slide 72

Slide 72 text

Conclusion We implemented caching and shared execution methods on PostgreSQL to share commonalities Experiments by Public BI Benchmark and TPC-DS revealed that our methods obtained high performance improvements Sharing commonalities among queries is very effective at reducing the query execution time Handling write queries and managing join order are future works Thank you!

Slide 73

Slide 73 text

Appendix

Slide 74

Slide 74 text

Managing Join Order Commonalities depend on the join order of queries ⋈ ⋈ 𝐶 𝐴 𝐵 ⋈ ⋈ 𝐷 𝐴 𝐵 Our methods work for this plan If the queries are planned as the following join order, 𝐴 ⋈ 𝐵 becomes a commonality However, the commonality does not appear in the following query plan ⋈ ⋈ 𝐵 𝐴 𝐶 ⋈ ⋈ 𝐵 𝐴 𝐷 We cannot apply our methods to 𝑨 ⋈ 𝑩

Slide 75

Slide 75 text

Managing Join Order – Simple Experiment We conducted a simple experiment to test the effect of join order We issued one TPC-DS query repeatedly while varying its “date” column ⋈ ⋈ 𝐴 𝐶 𝐵 date =1999/1/1 Query 1’ ⋈ ⋈ 𝐴 𝐶 𝐵 date =2000/1/1 Query 1 DB Since table A and B do not change, 𝐴 ⋈ 𝐵 is a commonality. However, it does not appear in the optimal plan, so we cannot cache its result

Slide 76

Slide 76 text

Managing Join Order – Simple Experiment We modified the join order by using pg_hint_plan extension ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan (We cannot cache 𝐀 ⋈B) Plan whose join order is manually changed (We can cache 𝐀 ⋈ 𝑩)

Slide 77

Slide 77 text

Result of the Experiment 0 5 10 15 20 Without caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan

Slide 78

Slide 78 text

Result of the Experiment 0 5 10 15 20 Without caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan The performance did not improve because we could not cache 𝑨 ⋈ 𝑩

Slide 79

Slide 79 text

Result of the Experiment 0 5 10 15 20 Without caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan Execution time increased because the modified join order is not optimal

Slide 80

Slide 80 text

Result of the Experiment 0 5 10 15 20 Without caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan We obtained a 47% improvement because we cached 𝑨 ⋈ 𝑩

Slide 81

Slide 81 text

Result of the Experiment 0 5 10 15 20 Without caching With caching Without caching With caching Execution time (s) ⋈ ⋈ 𝐴 𝐶 𝐵 date ⋈ ⋈ 𝐶 𝐴 𝐵 date Optimal plan Modified plan • It is effective to break the optimal join order in order to utilize our caching method • Managing join order automatically is future work