Slide 1

Slide 1 text

DEX Query Execution in a Delta- based Storage System Amit Chavan Amol Deshpande Department of Computer Science University of Maryland

Slide 2

Slide 2 text

Immutability is gaining traction on several fronts… ▷ co-ordination in distributed systems ▷ design of micro-services & infrastructure ▷ immutable files, datasets, etc. Immutability & Dataset Versioning Immutability Changes Everything, Pat Helland, ACM Queue

Slide 3

Slide 3 text

Immutability is gaining traction on several fronts… ▷ co-ordination in distributed systems ▷ design of micro-services & infrastructure ▷ immutable files, datasets, etc. Dataset Versioning can help… ▷ Auditability & forensics ▷ Reproducibility in data science ▷ Version management in collaborative data analysis settings Immutability & Dataset Versioning Immutability Changes Everything, Pat Helland, ACM Queue

Slide 4

Slide 4 text

Dataset Version Control Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. https://git-scm.com/ A Dataset Version Control System (DVCS) is designed to store large files, which traditional VCS (e.g., git, SVN) are not good at.

Slide 5

Slide 5 text

Dataset Version Control Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later. https://git-scm.com/ A Dataset Version Control System (DVCS) is designed to store large files, which traditional VCS (e.g., git, SVN) are not good at. Various solutions for different types of data and environment ▷ Git – Large File Storage [https://git-lfs.github.com/] ▷ Unstructured files and use of delta encoding [VLDB’15] ▷ Relational data [Decibel, VLDB’16; OrpheusDB, VLDB’17] ▷ Tensors (e.g., in Deep Learning) [ModelHub, ICDE’17] ▷ …

Slide 6

Slide 6 text

Delta encoding Version Graph Logical immutable copy of set of files V 2 V 3 V 4 V 5 V 6 V 7 V 1 1 2 1 3 4 5 6 5 1 7 4 7 “merge”

Slide 7

Slide 7 text

Delta encoding Version Graph Logical immutable copy of set of files V 2 V 3 V 4 V 5 V 6 V 7 V 1 1 2 1 3 4 5 6 5 1 7 4 7 “merge” 0 3 1100 1 4 5 7 6 2 20 1000 30 50 30 100 delta size materialized files Storage Graph Physical organization of files and deltas Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff, VLDB’15 $ git repack https://git-scm.com/docs/git-repack

Slide 8

Slide 8 text

Delta encoding Version Graph Logical immutable copy of set of files V 2 V 3 V 4 V 5 V 6 V 7 V 1 1 2 1 3 4 5 6 5 1 7 4 7 “merge” 0 3 1100 1 4 5 7 6 2 20 1000 30 50 80 30 100 delta size materialized files Storage Graph Physical organization of files and deltas Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff, VLDB’15 $ git repack https://git-scm.com/docs/git-repack

Slide 9

Slide 9 text

Opportunities & Challenges A DVCS contains vast trove of information ▷ how datasets evolved over time? ▷ what analyses were performed over them? ▷ how do the datasets differ?

Slide 10

Slide 10 text

Opportunities & Challenges A DVCS contains vast trove of information ▷ how datasets evolved over time? ▷ what analyses were performed over them? ▷ how do the datasets differ? Primary query interface in existing delta-based storage engines is “checkout” Q: Common records in result-1.csv & result-2.csv? Checkout[result-1.csv] Checkout[result-2.csv] Query processing

Slide 11

Slide 11 text

DEX Architecture Git Storage Version Graph Storage Query Processor Storage Graph Builder Git Translator Version Graph DEX CLI Backend Data Store Delta Storage Δ Metadata re: derivations, provenance, etc. git config, code, scripts, etc.

Slide 12

Slide 12 text

Data Model datafile set of records ex: CSV/TSV Delta Storage Δ

Slide 13

Slide 13 text

Data Model datafile set of records ex: CSV/TSV a b c y z a b c y z Δ− Δ+ deletions datafile insertions datafile Delta Storage Δ A 1 A 2

Slide 14

Slide 14 text

Data Model datafile set of records ex: CSV/TSV a b c y z a b c y z Δ− Δ+ deletions datafile insertions datafile Delta Storage Δ This Paper Transformation Rules + Cost-based Optimization Framework for effective query execution algorithms Queries 1. Checkout 2. Intersection 3. Union 4. t-Threshold A 1 A 2

Slide 15

Slide 15 text

Delta Properties & Operations Consistency Δ− ∩ Δ+ = Patch A2 = A1 ⊕ Δ ➢2 = 1 − Δ− ∪ Δ+ = 1 ∪ Δ+ − Δ− Δ A 1 A 2

Slide 16

Slide 16 text

Delta Properties & Operations Consistency Δ− ∩ Δ+ = Patch A2 = A1 ⊕ Δ ➢2 = 1 − Δ− ∪ Δ+ = 1 ∪ Δ+ − Δ− Contraction Δ = Δ1 ⊕ Δ2 ➢Δ− = Δ1 − − Δ2 + ∪ Δ2 − ➢Δ+ = Δ1 + − Δ2 − ∪ Δ2 + Δ A 1 A 3 Δ1 A 1 A 2 A 3 Δ2 Δ A 1 A 2

Slide 17

Slide 17 text

Delta Properties & Operations Consistency Δ− ∩ Δ+ = Patch A2 = A1 ⊕ Δ ➢2 = 1 − Δ− ∪ Δ+ = 1 ∪ Δ+ − Δ− Contraction Δ = Δ1 ⊕ Δ2 ➢Δ− = Δ1 − − Δ2 + ∪ Δ2 − ➢Δ+ = Δ1 + − Δ2 − ∪ Δ2 + Associativity Δ1 ⊕ Δ2 ⊕ Δ3 = Δ1 ⊕ (Δ2 ⊕ Δ3 ) Δ A 1 A 3 Δ1 A 1 A 2 A 3 Δ2 Δ A 1 A 2

Slide 18

Slide 18 text

Single datafile Checkout 1100 1000 150 A 0 A 1 A 2 A 3 A 5 A 4 A 6 A 7 A 8 A 9 A 11 A 10 A 12 20 30 10 10 80 50 30 50 50 50 100 50 Storage Graph Checkout A 12

Slide 19

Slide 19 text

Single datafile Checkout 1100 1000 150 A 0 A 1 A 2 A 3 A 5 A 4 A 6 A 7 A 8 A 9 A 11 A 10 A 12 20 30 10 10 80 50 30 50 50 50 100 50 Storage Graph 1100 A 3 A 7 A 11 A 12 50 50 50 A 0 Checkout A 12 Access Tree Shortest path between A 0 and A 12

Slide 20

Slide 20 text

Single datafile Checkout Δ1 Δ2 Δ3 Δk M A 1 A 2 A 3 A k ⊕ is associative Expression: ⊕ Δ1 ⊕ Δ2 ⊕ Δ3 ⊕ ⋯ ⊕ Δ Key Question: In which order to “apply” (i.e., combine/patch) the deltas?

Slide 21

Slide 21 text

Single datafile Checkout Δ1 Δ2 Δ3 Δk M A 1 A 2 A 3 A k ⊕ is associative Expression: ⊕ Δ1 ⊕ Δ2 ⊕ Δ3 ⊕ ⋯ ⊕ Δ Key Question: In which order to “apply” (i.e., combine/patch) the deltas? Algorithms ▷ Left-to-Right (LR) ▷ Greedy ▷ Path Contraction evaluate⊕ having the least cost; iterate find the best parenthesizing of ⊕ using dynamic programming evaluate⊕ from left to right (in the expression)

Slide 22

Slide 22 text

Single datafile Checkout Δ1 Δ2 Δ3 Δk M A 1 A 2 A 3 A k ⊕ is associative Expression: ⊕ Δ1 ⊕ Δ2 ⊕ Δ3 ⊕ ⋯ ⊕ Δ Key Question: In which order to “apply” (i.e., combine/patch) the deltas? Algorithms ▷ Left-to-Right (LR) ▷ Greedy ▷ Path Contraction evaluate⊕ having the least cost; iterate find the best parenthesizing of ⊕ using dynamic programming evaluate⊕ from left to right (in the expression) See paper for details about cost model and cardinality estimates

Slide 23

Slide 23 text

Single datafile Checkout 0 20 40 60 80 100 120 140 160 25 50 75 100 150 200 300 Time (Seconds) Number of deltas LR Greedy PC Dataset properties: #records/version = 3M, delta size = 5% 7.0-8.8X improvement wrt. LR up to 14% improvement wrt. Greedy

Slide 24

Slide 24 text

Multiple datafile Checkout 1100 1000 150 A 0 A 1 A 2 A 3 A 5 A 4 A 6 A 7 A 8 A 9 A 11 A 10 A 12 20 30 10 10 80 50 30 50 50 50 100 50 Storage Graph Access Tree Checkout A 6 , A 8 , A 9 , A 12 1000 150 A 1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 20 30 10 10 50 30 100 50 A 3 Steiner tree connecting A0, A6, A8, A9, A12 A 0

Slide 25

Slide 25 text

Multiple datafile Checkout A 1 Δ1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 A 3 Δ2 Δ3 Δ4 Δ5 Δ7 Δ8 Δ9 Δ6 Expressions: 1 ⊕ Δ1 ⊕ Δ2 ⊕ Δ3 ; 1 ⊕ Δ4 ⊕ Δ5 ⊕ Δ6 ; 1 ⊕ Δ4 ⊕ Δ7 ; 1 ⊕ Δ4 ⊕ Δ7 ⊕ Δ8 ⊕ Δ9 . Key Question: In which order to “apply” the deltas & share intermediate results?

Slide 26

Slide 26 text

Multiple datafile Checkout A 1 Δ1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 A 3 Δ2 Δ3 Δ4 Δ5 Δ7 Δ8 Δ9 Δ6 Expressions: 1 ⊕ Δ1 ⊕ Δ2 ⊕ Δ3 ; 1 ⊕ Δ4 ⊕ Δ5 ⊕ Δ6 ; 1 ⊕ Δ4 ⊕ Δ7 ; 1 ⊕ Δ4 ⊕ Δ7 ⊕ Δ8 ⊕ Δ9 . Key Question: In which order to “apply” the deltas & share intermediate results? Algorithms ▷ Naïve ▷ Path Contraction (PC) ▷ Greedy ▷ Tree Contraction (TC) share the “largest” sub-expression, use PC to find best order find best sharing and evaluation order using dynamic programming

Slide 27

Slide 27 text

Multiple datafile Checkout Dataset properties: #records/version = 1M, delta size = 5% 0 10 20 30 40 50 60 25 50 75 100 150 200 300 Time (Seconds) Number of deltas Naïve PC Greedy TC 5-6.8X improvement wrt. Naïve up to 10% improvement wrt. Greedy

Slide 28

Slide 28 text

Intersection Key Question: How to push down the query to the deltas? A 1 Δ1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 A 3 Δ2 Δ3 Δ4 Δ5 Δ7 Δ8 Δ9 Δ6 Query: 6 ∩ 8 ∩ 9 ∩ 12

Slide 29

Slide 29 text

Intersection Line Δ1 A 2 A 1 1 ∩ A2 = A1 − Δ1 −

Slide 30

Slide 30 text

Intersection Line Δ1 A 2 A 1 1 ∩ A2 = A1 − Δ1 − 1 ∩ A2 ∩ ⋯ ∩ +1 = A1 ⊕ Δ , where Δ − = ∪=1 Δ −, Δ + = Δ1 A 2 A 1 A k+1 Δk

Slide 31

Slide 31 text

Intersection Line Δ1 A 2 A 1 1 ∩ A2 = A1 − Δ1 − 1 ∩ A2 ∩ ⋯ ∩ +1 = A1 ⊕ Δ , where Δ − = ∪=1 Δ −, Δ + = Δ1 A 2 A 1 A k+1 Δk Star M Δ1 Δ2 Δk A 1 A 2 A k 1 ∩ A2 ∩ ⋯ ∩ = M ⊕ Δ , where Δ − = ∪=1 Δ −, Δ + = ∩=1 Δ +

Slide 32

Slide 32 text

Intersection Key Question: In which order to “apply” the deltas & the transformation rules? Contract & Reduce (C&R): Heuristically “break down” the access tree into lines & stars… …However, this is only a subspace of the potential plans. A 1 Δ1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 A 3 Δ2 Δ3 Δ4 Δ5 Δ7 Δ8 Δ9 Δ6

Slide 33

Slide 33 text

Intersection Key Question: In which order to “apply” the deltas & the transformation rules? Contract & Reduce (C&R): Heuristically “break down” the access tree into lines & stars… …However, this is only a subspace of the potential plans. A 1 Δ1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 A 3 Δ2 Δ3 Δ4 Δ5 Δ7 Δ8 Δ9 Δ6 A 1 Δ1 A 6 A 8 A 9 A 12 A 3 Δ4 Δ7 Δ3 Δ2

Slide 34

Slide 34 text

Intersection Key Question: In which order to “apply” the deltas & the transformation rules? Contract & Reduce (C&R): Heuristically “break down” the access tree into lines & stars… …However, this is only a subspace of the potential plans. A 1 Δ1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 A 3 Δ2 Δ3 Δ4 Δ5 Δ7 Δ8 Δ9 Δ6 A 1 Δ1 A 6 A 8 A 9 A 12 A 3 Δ4 Δ7 Δ3 Δ2 Δ1 A 8 A 9 Q(A 6 ,A 12 ) A 3 Δ4 Δ1 Δ2 A 1

Slide 35

Slide 35 text

Intersection Key Question: In which order to “apply” the deltas & the transformation rules? Contract & Reduce (C&R): Heuristically “break down” the access tree into lines & stars… …However, this is only a subspace of the potential plans. A 1 Δ1 A 2 A 5 A 4 A 6 A 8 A 9 A 10 A 12 A 3 Δ2 Δ3 Δ4 Δ5 Δ7 Δ8 Δ9 Δ6 A 1 Δ1 A 6 A 8 A 9 A 12 A 3 Δ4 Δ7 Δ3 Δ2 Δ1 A 8 A 9 Q(A 6 ,A 12 ) A 3 Δ4 Δ1 Δ2 A 1 Δ1 A 8 A 3 Δ4 A 1 Q(A 6 ,A 9 ,A 12 ) Δ1

Slide 36

Slide 36 text

Intersection 16.6 13.2 6.7 5.3 4 2.9 2.3 1.9 25.1 23.6 16.7 10.4 7.7 6.9 4.4 2.4 0 2 4 6 8 10 12 14 16 18 Time (Seconds) Number of deltas Checkout Intersect C&R (a) k=4 (b) k=8 Up to 25Xspeedup when compared to baseline Dataset: records/version = 3M, delta size = 1%

Slide 37

Slide 37 text

Intersection 16.6 13.2 6.7 5.3 4 2.9 2.3 1.9 25.1 23.6 16.7 10.4 7.7 6.9 4.4 2.4 0 2 4 6 8 10 12 14 16 18 Time (Seconds) Number of deltas Checkout Intersect C&R (a) k=4 (b) k=8 0 2 4 6 8 10 12 14 16 18 2 4 6 8 10 Speedup Query size (k) 1% 2% 3% 4% 5% Up to 25Xspeedup when compared to baseline Gains largely depend on the size of the deltas Dataset: records/version = 3M, delta size = 1% Dataset: #records/version = 3M, number of deltas = 50

Slide 38

Slide 38 text

Summary First systematic study of how to optimize execution of different types of queries in a delta-based storage engine Ongoing work ▷ Transformation rules for additional query classes (e.g., joins, aggregations, etc.) ▷ Other delta formats, e.g., column-deltas ▷ Indexes on deltas, offline & adaptive For set-oriented files: ▷ Expand search space of possible equivalent plans Exploiting properties of deltas Transformation rules for queries ▷ Cost-based optimization to find good plans

Slide 39

Slide 39 text

Summary First systematic study of how to optimize execution of different types of queries in a delta-based storage engine Ongoing work ▷ Transformation rules for additional query classes (e.g., joins, aggregations, etc.) ▷ Other delta formats, e.g., column-deltas ▷ Indexes on deltas, offline & adaptive For set-oriented files: ▷ Expand search space of possible equivalent plans Exploiting properties of deltas Transformation rules for queries ▷ Cost-based optimization to find good plans

Slide 40

Slide 40 text

Thank You! Questions? Presentation template by SlidesCarnival [email protected]

Slide 41

Slide 41 text

Union 0 1 2 3 4 5 6 7 8 9 2 4 6 8 10 Speedup Query size (k) 1% 2% 3% 4% 5% Dataset properties: #records/version = 3M, #deltas = 50

Slide 42

Slide 42 text

t-Threshold 0 1 2 3 4 5 6 4,3 6,4 8,6 10,8 Speedup Query size, Threshold (k,t) 1% 2% 3% 4% 5% Dataset properties: #records/version = 3M, #deltas = 50