Upgrade to Pro — share decks privately, control downloads, hide ads and more …

VLDB, 2015: Principles of Dataset Versioning: E...

Amit Chavan
September 02, 2015

VLDB, 2015: Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

This talk was given at 41st International Conference on Very Large Databases, Kohala Coast, Hawai'i, 2015.

Abstract:
The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage-recreation trade-off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.

Paper:
http://arxiv.org/abs/1505.05211

Amit Chavan

September 02, 2015
Tweet

More Decks by Amit Chavan

Other Decks in Research

Transcript

  1. A typical data analysis workflow 1 2 3 4 5

    CSV from data.gov EDIT: Correct “addresses” EDIT: Append Column NEW: Add file EDIT: Project columns EDIT: Partition rows …is akin to people collaborating on source code. However… 1000s of versions
  2. Collaborative data science projects end up in dataset version management

    hell 1) Many private copies of the datasets lead to massive redundancy in storage 2) No easy way to keep track of dependencies between input and derived datasets and versions 3) No mechanisms to support and record manual conflict resolution 4) No way to analyze/compare/query versions [USENIX TaPP’15] …..... This talk!
  3. DATAHUB: A collaborative hosted data science platform A dataset management

    system – import, search, query, analyze a large number of (public) datasets A dataset version control system – branch, update, merge, transform large structured or unstructured datasets An app ecosystem and hooks for external applications (Matlab, R, iPython etc.) DATAHUB Architecture See: DataHub, CIDR’15; DataHub Demo, VLDB’15 (Thu, Sep 3, 10:30am-12:00pm)
  4. How can we store thousands of versions of datasets compactly?

    … …And still be able to access any version, on-demand, efficiently?
  5. Can we use Version Control Systems like Git/SVN/…? No, because

    they typically use fairly simple algorithms and are optimized to work for code-like data 100 versions LF Dataset (Real World) #Versions = 100 Avg. version size = 423 MB gzip = 10.2 GB svn = 8.5 GB git = 202 MB *this = 159 MB
  6. Can we use Version Control Systems like Git/SVN/…? No, because

    they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files DON’T! Use extensions*
  7. Storage cost is the space required to store a set

    of versions Recreation cost is the time* required to access a version 100 MB 102 MB 101 MB (100 + 101 + 102) = 303 MB Send entire version Recreation cost = IO cost (100 + 101 + 102) = 303 MB 100 MB 101 MB 102 MB
  8. Storage cost is the space required to store a set

    of versions Recreation cost is the time* required to access a version 100 MB 102 MB 101 MB (100 + 101 + 102) = 303 MB Send entire version Recreation cost = IO cost (100 + 101 + 102) = 303 MB 100 MB 101 MB 102 MB A delta between versions is a file which allows constructing one version given the other 1 Directed delta 2 delete add 1 Undirected delta 2 delete add delete add Example: Unix diff, xdelta, XOR, etc. A delta has its own storage cost and recreation cost, which, in general, are independent of each other
  9. Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

    100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Total Access Cost = 370 MB
  10. Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

    100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Total Access Cost = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Total Access Cost = 341 MB
  11. Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

    100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Total Access Cost = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Total Access Cost = 341 MB Storage cost =(110+5+10) =125 MB 110 MB 5 MB 10 MB Scenario 3 115 MB 110 MB 120 MB Total Access Cost = 345 MB
  12. GIVEN 1) set of versions 2) partial information about deltas

    between versions WHERE • deltas can be directed/undirected • deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget
  13. GIVEN 1) set of versions 2) partial information about deltas

    between versions WHERE • deltas can be directed/undirected • deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget See paper for all problem variations and complexity results
  14. GIVEN 1) set of versions 2) partial information about deltas

    between versions WHERE • deltas can be directed/undirected • deltas can have identical different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget In this talk
  15. Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity =

    O(E + V logV) Minimize Storage Cost Recreation Cost: No constraint “Null” Version 20 25 26 28 7 9 2 3 25 20 7 3
  16. Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity =

    O(E + V logV) Shortest Path Tree (SPT) Dijkstra’s algorithm Time complexity = O(E logV) Minimize Storage Cost Recreation Cost: No constraint Minimize Recreation Cost Storage Cost: No constraint “Null” Version 20 25 26 28 7 9 2 3 25 25 28 26 20 20 7 3
  17. Local Move Greedy (LMG) heuristic to minimize total recreation cost

    1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value
  18. Local Move Greedy (LMG) heuristic to minimize total recreation cost

    Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value 10 25 15
  19. Local Move Greedy (LMG) heuristic to minimize total recreation cost

    1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 10 25 15 (5/5) (15/5) (10/10)
  20. Local Move Greedy (LMG) heuristic to minimize total recreation cost

    0 1 2 3 4 5 6 10 15 5 10 10 15 15 Iteration (i+1) 1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 10 25 15 Time Complexity: O(V2) (5/5) (15/5) (10/10)
  21. Other heuristics 1) Local Move Greedy (LMG) 2) Modified Prim’s

    (MP): Incrementally build a tree by adapting Prim’s algorithm 3) Light Approximate Shortest path Tree (LAST*): Balance minimum spanning tree and shortest path tree 4) Git Heuristic (GitH): Our understanding of the heuristic used by git repack *LAST: Khuller et. al. Balancing minimum spanning trees and shortest-path trees. Algorithmica, 1995.
  22. Evaluation LMG MP LAST GitH Storage Cost (TB) Sum of

    Recreation Costs (TB) Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 11.5 PB SPT Storage Cost = 34 TB MCA Storage Cost SPT Recreation Cost 30 40 50 60 70 80
  23. Evaluation LMG MP LAST GitH Storage Cost (TB) Sum of

    Recreation Costs (TB) Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 11.5 PB SPT Storage Cost = 34 TB MCA Storage Cost SPT Recreation Cost 30 40 50 60 70 80 Storage budget of 1.1X the MCA reduces total recreation cost by 1000X
  24. 1) Detailed discussion of problem variants 2) Complexity results 3)

    Additional heuristics 4) Workload aware optimization 5) Comparisons with optimal solution (when feasible) More in paper