Data Version Control:
Tool for Iterative Machine Learning
Dmitry Petrov, PhD
Conference http://data-intelligence.ai
June 2017. McLean, VA
Slide 2
Slide 2 text
Open source:
1) Created
github.com/dataversioncontrol/dvc
2) Implemented Wavelet image hash
(wHash) for
github.com/JohannesBuchner/ima
gehash
Blog: fullstackml.com
Dmitry Petrov
@FullStackML
Data Scientist.
Ex-Data Scientist @Microsoft.
PhD in CS. Ex-Researcher.
Slide 3
Slide 3 text
Agenda
1. Dependencies in data science projects.
2. DVC introduction.
3. Tutorial: NLP with DVC.
4. Beyond the Horizon.
5. Q&A.
Slide 4
Slide 4 text
I. Dependencies in machine learning
projects
Slide 5
Slide 5 text
Problem 1: ML is slow
RUNNING
RUNNING
Slide 6
Slide 6 text
Problem 2: ML is slow and iterative
Possible solution: Makefile
Data
preparation
Feature
extraction
Model
training
Model
evaluation
Slide 7
Slide 7 text
Problem 3: Makefile is not enough
Possible solution: Replace timestamp based model to versioning (Git
hashes)
Makefile is based on timestamps and keeps only the last version.
Data
preparation
Feature
extraction
Model
training
Model
evaluation
Version: 0267f11
Version: 33e292f
Version: 6e9aa50
Slide 8
Slide 8 text
Problem 4: Data and code are not connected
Git is not good for large files.
Local
repository
Git server
(GitHub)
Local
repository
Local files
Cloud
(S3, GCP)
Local files
Code:
Data:
Data
scientist 1
Data
scientist 2
Services
Slide 9
Slide 9 text
II. DVC introduction
Slide 10
Slide 10 text
DVC basic
Home page: http://dataversioncontrol.com
Command: $ dvc init
DVC directories:
1) data/ - data files (actually symlinks to content)
2) .cache/ - data files content
3) .state/ - dependency graph (DAG)
4) * .git/ - standard Git directory
Slide 11
Slide 11 text
Running your code
Any data manipulation command has to be run through DVC.
Import data
Command: $ dvc run / import / remove
Slide 12
Slide 12 text
Data and .cache directories
DVC separates data files and it’s content. Content is never stored in the Git.
Slide 13
Slide 13 text
Dependencies tracking - .state directory
A dependencies graph (DAG) is being created automatically.
Slide 14
Slide 14 text
DVC data flows
Slide 15
Slide 15 text
III. Tutorial: NLP with DVC
Link to the tutorial
Slide 16
Slide 16 text
IV. Beyond the Horizon
Slide 17
Slide 17 text
Important!
DVC is a tool for modeling, not data engineering!
DVC workflow != Airflow workflow
DVC
Optimizes: model creation time or
data scientists agility
Airflow/Luigi
Optimizes: execution time\resources
and reliability
Slide 18
Slide 18 text
Languages
DVC is written in Python but the tool supports
any languages:
$ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p
$ dvc run Rscript prc_plot.R data/model.p data/matrix-test.p data/prc.jpg
$ dvc run MyApp data/input.tsv data/output.tsv