Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

Amit Chavan
August 28, 2015

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

Introduction to our VLDB'15 paper on understanding the problems in data versioning.
Paper link (pdf): http://arxiv.org/abs/1505.05211

Amit Chavan

August 28, 2015
Tweet

More Decks by Amit Chavan

Other Decks in Research

Transcript

  1. A typical data analysis workflow 1 2 3 4 5

    CSV from data.gov EDIT: Correct “addresses” EDIT: Append Column NEW: Add file EDIT: Project columns EDIT: Partition rows …is akin to people collaborating on source code. However…
  2. Collaborative data science projects end up in dataset version management

    hell 1) Many private copies of the datasets lead to massive redundancy in storage 2) No easy way to keep track of dependencies between input and derived datasets and versions 3) No mechanisms to support and record manual conflict resolution 4) No way to analyze/compare/query versions [USENIX TaPP’15] …..... This talk! range of V is Version range of E is V.Relations(name=“Employee”).Tuples retrieve V.id, count(E.id where E.age > 50) as n_senior_emp
  3. DATAHUB: A collaborative hosted data science platform A dataset management

    system – import, search, query, analyze a large number of (public) datasets A dataset version control system – branch, update, merge, transform large structured or unstructured datasets An app ecosystem and hooks for external applications (Matlab, R, iPython etc.) DATAHUB Architecture See: DataHub, CIDR’15; DataHub Demo, VLDB’15
  4. Can we use Version Control Systems like Git/SVN/…? No, because

    they typically use fairly simple algorithms and are optimized to work for code-like data 100 versions LF Dataset (Real World) #Versions = 100 Avg. version size = 423 MB gzip = 10.2 GB svn = 8.5 GB git = 202 MB *this = 159 MB
  5. Can we use Version Control Systems like Git/SVN/…? No, because

    they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files DON’T! Use extensions*
  6. Can we use Version Control Systems like Git/SVN/…? No, because

    they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files Can we use temporal databases? No, because they are restricted to managing a linear chain of versions What about “deduplication” strategies in storage systems? Chunking files into blocks and storing unique ones works well with the assumption of localized changes Focus on minimizing storage costs in an online setting, and typically ignore recreation costs
  7. Storage cost is the space required to store a set

    of versions Recreation cost is the time* required to access a version 100 MB 102 MB 101 MB (100 + 101 + 102) = 303 MB Send entire version Recreation cost = IO cost (100 + 101 + 102) = 303 MB 100 MB 101 MB 102 MB A delta between versions is a file which allows constructing one version given the other 1 Directed delta 2 delete add 1 Undirected delta 2 delete add delete add Example: Unix diff, xdelta, XOR, etc. A delta has its own storage cost and recreation cost, which, in general, are independent of each other
  8. Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

    100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Sum = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Sum = 341 MB
  9. Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

    100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Sum = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Sum = 341 MB Storage cost =(110+5+10) =125 MB 110 MB 5 MB 10 MB Scenario 3 115 MB 110 MB 120 MB Sum = 345 MB
  10. GIVEN 1) set of versions 2) partial information about deltas

    between versions 3) deltas can be directed/undirected 4) deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget
  11. GIVEN 1) set of versions 2) partial information about deltas

    between versions 3) deltas can be directed/undirected 4) deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget See paper for all problem variations and complexity results
  12. GIVEN 1) set of versions 2) partial information about deltas

    between versions 3) deltas can be directed/undirected 4) deltas can have different identical storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget In this talk
  13. Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity =

    O(E + V logV) Shortest Path Tree (SPT) Dijkstra’s algorithm Time complexity = O(E logV) Minimize Storage Cost Recreation Cost: No constraint Minimize Recreation Cost Storage Cost: No constraint “Null” Version 20 25 26 28 7 9 2 3 25 25 28 26 20 20 7 3
  14. Local Move Greedy (LMG) heuristic to minimize total recreation cost

    e(0,4) from SPT is chosen to replace e(1,4): ρ value of e(0,4) was maximum (among other candidates) Time Complexity: O(V2) Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 0 1 2 3 4 5 6 10 15 5 10 10 15 15 Iteration (i+1) 1) Start with MCA and a set of candidate deltas (ex. all SPT edges not in MCA) 2) Iterate until storage budget reached: 1) For each delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value 10 25 15
  15. Modified Prim’s (MP) heuristic to minimize maximum recreation cost e(2,3)

    is chosen to replace e(1,3): recreation cost of 3 is reduced from 25 to 20 Time Complexity: O(E logV) Iteration (i) 1) Start with V0 , initially all other recreation costs = ∞ 2) Grow tree one edge at a time ƒ Like Prim’s, choose the smallest weight edge ƒ Unlike Prim’s, OK to re-visit a version if its new recreation cost < old recreation cost (AND new recreation cost < θ) 0 1 2 3 4 5 10 15 15 5 20 5 10 0 1 2 3 4 5 10 15 15 5 20 5 10 Iteration (i+1)
  16. Git Heuristic (GitH) heuristic to minimize storage cost e(2,6) is

    has the least Δ′ value (d = 5), although e(5,6) is the smallest, without the depth bias Time Complexity: O(V logV + wV) Iteration (i) Iteration (i+1) 1 2 3 4 5 6 ... 15 20 2 3 4 5 w = 4 25 25 1 2 3 4 5 6 ... 20 4 5 6 2 1) Order versions by size 2) Maintain a sliding window of size w; for version Vi and for ∈ window, compute: Δ ′ = Δ ( − ) 3) Choose Vl with the least Δ ′ to be the parent of Vi
  17. Evaluation LMG MP LAST* GitH Storage Cost (GB) Sum of

    Recreation Costs (GB) Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 1.15 X 1e7GB SPT Storage Cost = 3.4 X 1e4 GB *LAST: Khuller et. al. Balancing minimum spanning trees and shortest-path trees. Algorithmica, 1995. MCA Storage Cost SPT Recreation Cost
  18. Evaluation LMG MP LAST* GitH Storage Cost (GB) Sum of

    Recreation Costs (GB) MCA Storage Cost SPT Recreation Cost LAST*: Khuller et. al. Balancing minimum spanning trees and shortest-path trees. Algorithmica, 1995. Storage budget of 1.1X the MCA reduces total recreation cost by 1000X Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 1.15 X 1e7GB SPT Storage Cost = 3.4 X 1e4 GB
  19. Evaluation LMG MP LAST GitH Storage Cost (GB) Sum of

    Recreation Costs (GB) ABOUT LF DATASET: #Versions = 100 #Deltas = 3562 Average version size = 422.79 MB MCA Recreation Cost = 47 GB SPT Storage Cost = 41 GB MCA Storage Cost SPT Recreation Cost
  20. 1) Detailed discussion of problem variants 2) Complexity results 3)

    Additional heuristics 4) Workload aware optimization 5) Comparisons with optimal solution (when feasible) More in paper