VLDB, 2015: Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff SOUVIK BHATTACHERJEE,
AMIT CHAVAN, SILU HUANG, AMOL DESHPANDE, ADITYA PARAMESWARAN

A typical data analysis workflow 1 2 3 4 5
CSV from data.gov EDIT: Correct “addresses” EDIT: Append Column NEW: Add file EDIT: Project columns EDIT: Partition rows …is akin to people collaborating on source code. However… 1000s of versions

Collaborative data science projects end up in dataset version management
hell 1) Many private copies of the datasets lead to massive redundancy in storage 2) No easy way to keep track of dependencies between input and derived datasets and versions 3) No mechanisms to support and record manual conflict resolution 4) No way to analyze/compare/query versions [USENIX TaPP’15] …..... This talk!

DATAHUB: A collaborative hosted data science platform A dataset management
system – import, search, query, analyze a large number of (public) datasets A dataset version control system – branch, update, merge, transform large structured or unstructured datasets An app ecosystem and hooks for external applications (Matlab, R, iPython etc.) DATAHUB Architecture See: DataHub, CIDR’15; DataHub Demo, VLDB’15 (Thu, Sep 3, 10:30am-12:00pm)

How can we store thousands of versions of datasets compactly?
… …And still be able to access any version, on-demand, efficiently?

Can we use Version Control Systems like Git/SVN/…? No, because
they typically use fairly simple algorithms and are optimized to work for code-like data 100 versions LF Dataset (Real World) #Versions = 100 Avg. version size = 423 MB gzip = 10.2 GB svn = 8.5 GB git = 202 MB *this = 159 MB

Can we use Version Control Systems like Git/SVN/…? No, because
they typically use fairly simple algorithms and are optimized to work for code-like data Git ends up using large amounts of RAM for large files DON’T! Use extensions*

Storage cost is the space required to store a set
of versions Recreation cost is the time* required to access a version 100 MB 102 MB 101 MB (100 + 101 + 102) = 303 MB Send entire version Recreation cost = IO cost (100 + 101 + 102) = 303 MB 100 MB 101 MB 102 MB

Storage cost is the space required to store a set
of versions Recreation cost is the time* required to access a version 100 MB 102 MB 101 MB (100 + 101 + 102) = 303 MB Send entire version Recreation cost = IO cost (100 + 101 + 102) = 303 MB 100 MB 101 MB 102 MB A delta between versions is a file which allows constructing one version given the other 1 Directed delta 2 delete add 1 Undirected delta 2 delete add delete add Example: Unix diff, xdelta, XOR, etc. A delta has its own storage cost and recreation cost, which, in general, are independent of each other

Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB
100 MB 30 MB 10 MB Scenario 1

100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Total Access Cost = 370 MB

100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Total Access Cost = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Total Access Cost = 341 MB

100 MB 30 MB 10 MB Scenario 1 100 MB 130 MB 140 MB Total Access Cost = 370 MB Storage cost =(100+30+11) =141 MB 100 MB 30 MB 11 MB Scenario 2 100 MB 130 MB 110 MB Total Access Cost = 341 MB Storage cost =(110+5+10) =125 MB 110 MB 5 MB 10 MB Scenario 3 115 MB 110 MB 120 MB Total Access Cost = 345 MB

GIVEN 1) set of versions 2) partial information about deltas
between versions WHERE • deltas can be directed/undirected • deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget

between versions WHERE • deltas can be directed/undirected • deltas can have different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget See paper for all problem variations and complexity results

between versions WHERE • deltas can be directed/undirected • deltas can have identical different storage and recreation costs FIND A STORAGE SOLUTION THAT • minimizes total recreation cost within a storage budget OR • minimizes maximum recreation cost within a storage budget In this talk

Baselines “Null” Version 20 25 26 28 7 9 2
3

Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity =
O(E + V logV) Minimize Storage Cost Recreation Cost: No constraint “Null” Version 20 25 26 28 7 9 2 3 25 20 7 3

Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity =
O(E + V logV) Shortest Path Tree (SPT) Dijkstra’s algorithm Time complexity = O(E logV) Minimize Storage Cost Recreation Cost: No constraint Minimize Recreation Cost Storage Cost: No constraint “Null” Version 20 25 26 28 7 9 2 3 25 25 28 26 20 20 7 3

Local Move Greedy (LMG) heuristic to minimize total recreation cost
1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value

Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value 10 25 15

1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 10 25 15 (5/5) (15/5) (10/10)

0 1 2 3 4 5 6 10 15 5 10 10 15 15 Iteration (i+1) 1) Start with MCA 2) Iterate until storage budget reached: 1) For each new delta, compute = reduction in sum of recreation costs increase in storage cost 2) Choose the delta with the highest value Iteration (i) 0 1 2 3 4 5 6 10 15 5 10 10 15 10 25 15 Time Complexity: O(V2) (5/5) (15/5) (10/10)

Other heuristics 1) Local Move Greedy (LMG) 2) Modified Prim’s
(MP): Incrementally build a tree by adapting Prim’s algorithm 3) Light Approximate Shortest path Tree (LAST*): Balance minimum spanning tree and shortest path tree 4) Git Heuristic (GitH): Our understanding of the heuristic used by git repack *LAST: Khuller et. al. Balancing minimum spanning trees and shortest-path trees. Algorithmica, 1995.

Evaluation LMG MP LAST GitH Storage Cost (TB) Sum of
Recreation Costs (TB) Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 11.5 PB SPT Storage Cost = 34 TB MCA Storage Cost SPT Recreation Cost 30 40 50 60 70 80

Evaluation LMG MP LAST GitH Storage Cost (TB) Sum of
Recreation Costs (TB) Type = CSV files #Versions = 100010 #Deltas = 18086876 Average version size = 347.65 MB MCA Recreation Cost = 11.5 PB SPT Storage Cost = 34 TB MCA Storage Cost SPT Recreation Cost 30 40 50 60 70 80 Storage budget of 1.1X the MCA reduces total recreation cost by 1000X

1) Detailed discussion of problem variants 2) Complexity results 3)
Additional heuristics 4) Workload aware optimization 5) Comparisons with optimal solution (when feasible) More in paper

QUESTIONS?

VLDB, 2015: Principles of Dataset Versioning: E...

VLDB, 2015: Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

Amit Chavan

More Decks by Amit Chavan

Other Decks in Research

Featured

Transcript

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff SOUVIK BHATTACHERJEE,

A typical data analysis workflow 1 2 3 4 5

Collaborative data science projects end up in dataset version management

DATAHUB: A collaborative hosted data science platform A dataset management

How can we store thousands of versions of datasets compactly?

Can we use Version Control Systems like Git/SVN/…? No, because

Can we use Version Control Systems like Git/SVN/…? No, because

Storage cost is the space required to store a set

Storage cost is the space required to store a set

Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

Storage/Recreation Tradeoff with delta encoding Storage cost =(100+30+10) =140 MB

GIVEN 1) set of versions 2) partial information about deltas

GIVEN 1) set of versions 2) partial information about deltas

GIVEN 1) set of versions 2) partial information about deltas

Baselines “Null” Version 20 25 26 28 7 9 2

Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity =

Baselines Minimum Cost Arborescence (MCA) Edmonds’ algorithm Time complexity =

Local Move Greedy (LMG) heuristic to minimize total recreation cost

Local Move Greedy (LMG) heuristic to minimize total recreation cost

Local Move Greedy (LMG) heuristic to minimize total recreation cost

Local Move Greedy (LMG) heuristic to minimize total recreation cost

Other heuristics 1) Local Move Greedy (LMG) 2) Modified Prim’s

Evaluation LMG MP LAST GitH Storage Cost (TB) Sum of

Evaluation LMG MP LAST GitH Storage Cost (TB) Sum of

1) Detailed discussion of problem variants 2) Complexity results 3)

QUESTIONS?