Slide 1

Slide 1 text

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff SOUVIK BHATTACHERJEE, AMIT CHAVAN, SILU HUANG, AMOL DESHPANDE, ADITYA PARAMESWARAN

Slide 2

Slide 2 text

A typical data analysis workflow 1 2 3 4 5 CSV from data.gov EDIT: Correct “addresses” EDIT: Append Column NEW: Add file EDIT: Project columns EDIT: Partition rows …is akin to people collaborating on source code. However…

Slide 3

Slide 3 text

Collaborative data science projects end up in dataset version management hell 1) Many private copies of the datasets lead to massive redundancy in storage 2) No easy way to keep track of dependencies between input and derived datasets and versions 3) No mechanisms to support and record manual conflict resolution 4) No way to analyze/compare/query versions [USENIX TaPP’15] …..... This talk! range of V is Version range of E is V.Relations(name=“Employee”).Tuples retrieve V.id, count(E.id where E.age > 50) as n_senior_emp

Slide 4

Slide 4 text

DATAHUB: A collaborative hosted data science platform A dataset management system – import, search, query, analyze a large number of (public) datasets A dataset version control system – branch, update, merge, transform large structured or unstructured datasets An app ecosystem and hooks for external applications (Matlab, R, iPython etc.) DATAHUB Architecture See: DataHub, CIDR’15; DataHub Demo, VLDB’15

Slide 5

Slide 5 text

Can we use Version Control Systems like Git/SVN/…? No, because they typically use fairly simple algorithms and are optimized to work for code-like data 100 versions LF Dataset (Real World) #Versions = 100 Avg. version size = 423 MB gzip = 10.2 GB svn = 8.5 GB git = 202 MB *this = 159 MB

Slide 6

Slide 6 text

Can we use Version Control Systems like Git/SVN/…? No, because they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files DON’T! Use extensions*

Slide 7

Slide 7 text

Can we use Version Control Systems like Git/SVN/…? No, because they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files Can we use temporal databases? No, because they are restricted to managing a linear chain of versions What about “deduplication” strategies in storage systems? Chunking files into blocks and storing unique ones works well with the assumption of localized changes Focus on minimizing storage costs in an online setting, and typically ignore recreation costs

Slide 8

Slide 8 text

Storage cost is the space required to store a set of versions Recreation cost is the time* required to access a version 100 MB 102 MB 101 MB (100 + 101 + 102) = 303 MB Send entire version Recreation cost = IO cost (100 + 101 + 102) = 303 MB 100 MB 101 MB 102 MB A delta between versions is a file which allows constructing one version given the other 1 Directed delta 2 delete add 1 Undirected delta 2 delete add delete add Example: Unix diff, xdelta, XOR, etc. A delta has its own storage cost and recreation cost, which, in general, are independent of each other

Slide 9

Slide 9 text

Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB 100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Sum = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Sum = 341 MB

Slide 10

Slide 10 text

Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB 100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Sum = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Sum = 341 MB Storage cost =(110+5+10) =125 MB 110 MB 5 MB 10 MB Scenario 3 115 MB 110 MB 120 MB Sum = 345 MB

Slide 11

Slide 11 text

GIVEN 1) set of versions 2) partial information about deltas between versions 3) deltas can be directed/undirected 4) deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget

Slide 12

Slide 12 text

GIVEN 1) set of versions 2) partial information about deltas between versions 3) deltas can be directed/undirected 4) deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget See paper for all problem variations and complexity results

Slide 13

Slide 13 text

GIVEN 1) set of versions 2) partial information about deltas between versions 3) deltas can be directed/undirected 4) deltas can have different identical storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget In this talk

Slide 14

Slide 14 text

Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity = O(E + V logV) Shortest Path Tree (SPT) Dijkstra’s algorithm Time complexity = O(E logV) Minimize Storage Cost Recreation Cost: No constraint Minimize Recreation Cost Storage Cost: No constraint “Null” Version 20 25 26 28 7 9 2 3 25 25 28 26 20 20 7 3

Slide 15

Slide 15 text

Local Move Greedy (LMG) heuristic to minimize total recreation cost e(0,4) from SPT is chosen to replace e(1,4): ρ value of e(0,4) was maximum (among other candidates) Time Complexity: O(V2) Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 0 1 2 3 4 5 6 10 15 5 10 10 15 15 Iteration (i+1) 1) Start with MCA and a set of candidate deltas (ex. all SPT edges not in MCA) 2) Iterate until storage budget reached: 1) For each delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value 10 25 15

Slide 16

Slide 16 text

Modified Prim’s (MP) heuristic to minimize maximum recreation cost e(2,3) is chosen to replace e(1,3): recreation cost of 3 is reduced from 25 to 20 Time Complexity: O(E logV) Iteration (i) 1) Start with V0 , initially all other recreation costs = ∞ 2) Grow tree one edge at a time ƒ Like Prim’s, choose the smallest weight edge ƒ Unlike Prim’s, OK to re-visit a version if its new recreation cost < old recreation cost (AND new recreation cost < θ) 0 1 2 3 4 5 10 15 15 5 20 5 10 0 1 2 3 4 5 10 15 15 5 20 5 10 Iteration (i+1)

Slide 17

Slide 17 text

Git Heuristic (GitH) heuristic to minimize storage cost e(2,6) is has the least Δ′ value (d = 5), although e(5,6) is the smallest, without the depth bias Time Complexity: O(V logV + wV) Iteration (i) Iteration (i+1) 1 2 3 4 5 6 ... 15 20 2 3 4 5 w = 4 25 25 1 2 3 4 5 6 ... 20 4 5 6 2 1) Order versions by size 2) Maintain a sliding window of size w; for version Vi and for ∈ window, compute: Δ ′ = Δ ( − ) 3) Choose Vl with the least Δ ′ to be the parent of Vi

Slide 18

Slide 18 text

Evaluation LMG MP LAST* GitH Storage Cost (GB) Sum of Recreation Costs (GB) Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 1.15 X 1e7GB SPT Storage Cost = 3.4 X 1e4 GB *LAST: Khuller et. al. Balancing minimum spanning trees and shortest-path trees. Algorithmica, 1995. MCA Storage Cost SPT Recreation Cost

Slide 19

Slide 19 text

Evaluation LMG MP LAST* GitH Storage Cost (GB) Sum of Recreation Costs (GB) MCA Storage Cost SPT Recreation Cost LAST*: Khuller et. al. Balancing minimum spanning trees and shortest-path trees. Algorithmica, 1995. Storage budget of 1.1X the MCA reduces total recreation cost by 1000X Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 1.15 X 1e7GB SPT Storage Cost = 3.4 X 1e4 GB

Slide 20

Slide 20 text

Evaluation LMG MP LAST GitH Storage Cost (GB) Sum of Recreation Costs (GB) ABOUT LF DATASET: #Versions = 100 #Deltas = 3562 Average version size = 422.79 MB MCA Recreation Cost = 47 GB SPT Storage Cost = 41 GB MCA Storage Cost SPT Recreation Cost

Slide 21

Slide 21 text

1) Detailed discussion of problem variants 2) Complexity results 3) Additional heuristics 4) Workload aware optimization 5) Comparisons with optimal solution (when feasible) More in paper

Slide 22

Slide 22 text

QUESTIONS?