hell 1) Many private copies of the datasets lead to massive redundancy in storage 2) No easy way to keep track of dependencies between input and derived datasets and versions 3) No mechanisms to support and record manual conflict resolution 4) No way to analyze/compare/query versions [USENIX TaPP’15] …..... This talk! range of V is Version range of E is V.Relations(name=“Employee”).Tuples retrieve V.id, count(E.id where E.age > 50) as n_senior_emp
system – import, search, query, analyze a large number of (public) datasets A dataset version control system – branch, update, merge, transform large structured or unstructured datasets An app ecosystem and hooks for external applications (Matlab, R, iPython etc.) DATAHUB Architecture See: DataHub, CIDR’15; DataHub Demo, VLDB’15
they typically use fairly simple algorithms and are optimized to work for code-like data 100 versions LF Dataset (Real World) #Versions = 100 Avg. version size = 423 MB gzip = 10.2 GB svn = 8.5 GB git = 202 MB *this = 159 MB
they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files DON’T! Use extensions*
they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files Can we use temporal databases? No, because they are restricted to managing a linear chain of versions What about “deduplication” strategies in storage systems? Chunking files into blocks and storing unique ones works well with the assumption of localized changes Focus on minimizing storage costs in an online setting, and typically ignore recreation costs
of versions Recreation cost is the time* required to access a version 100 MB 102 MB 101 MB (100 + 101 + 102) = 303 MB Send entire version Recreation cost = IO cost (100 + 101 + 102) = 303 MB 100 MB 101 MB 102 MB A delta between versions is a file which allows constructing one version given the other 1 Directed delta 2 delete add 1 Undirected delta 2 delete add delete add Example: Unix diff, xdelta, XOR, etc. A delta has its own storage cost and recreation cost, which, in general, are independent of each other
between versions 3) deltas can be directed/undirected 4) deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget
between versions 3) deltas can be directed/undirected 4) deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget See paper for all problem variations and complexity results
between versions 3) deltas can be directed/undirected 4) deltas can have different identical storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget In this talk
e(0,4) from SPT is chosen to replace e(1,4): ρ value of e(0,4) was maximum (among other candidates) Time Complexity: O(V2) Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 0 1 2 3 4 5 6 10 15 5 10 10 15 15 Iteration (i+1) 1) Start with MCA and a set of candidate deltas (ex. all SPT edges not in MCA) 2) Iterate until storage budget reached: 1) For each delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value 10 25 15
is chosen to replace e(1,3): recreation cost of 3 is reduced from 25 to 20 Time Complexity: O(E logV) Iteration (i) 1) Start with V0 , initially all other recreation costs = ∞ 2) Grow tree one edge at a time  Like Prim’s, choose the smallest weight edge  Unlike Prim’s, OK to re-visit a version if its new recreation cost < old recreation cost (AND new recreation cost < θ) 0 1 2 3 4 5 10 15 15 5 20 5 10 0 1 2 3 4 5 10 15 15 5 20 5 10 Iteration (i+1)
has the least Δ′ value (d = 5), although e(5,6) is the smallest, without the depth bias Time Complexity: O(V logV + wV) Iteration (i) Iteration (i+1) 1 2 3 4 5 6 ... 15 20 2 3 4 5 w = 4 25 25 1 2 3 4 5 6 ... 20 4 5 6 2 1) Order versions by size 2) Maintain a sliding window of size w; for version Vi and for ∈ window, compute: Δ ′ = Δ ( − ) 3) Choose Vl with the least Δ ′ to be the parent of Vi