Data Version Control: Tool for Iterative Machine Learning

Data Version Control: Tool for Iterative Machine Learning Dmitry Petrov,
PhD Conference http://data-intelligence.ai June 2017. McLean, VA

Open source: 1) Created github.com/dataversioncontrol/dvc 2) Implemented Wavelet image hash
(wHash) for github.com/JohannesBuchner/ima gehash Blog: fullstackml.com Dmitry Petrov @FullStackML Data Scientist. Ex-Data Scientist @Microsoft. PhD in CS. Ex-Researcher.

Agenda 1. Dependencies in data science projects. 2. DVC introduction.
3. Tutorial: NLP with DVC. 4. Beyond the Horizon. 5. Q&A.

I. Dependencies in machine learning projects

Problem 1: ML is slow RUNNING RUNNING

Problem 2: ML is slow and iterative Possible solution: Makefile
Data preparation Feature extraction Model training Model evaluation

Problem 3: Makefile is not enough Possible solution: Replace timestamp
based model to versioning (Git hashes) Makefile is based on timestamps and keeps only the last version. Data preparation Feature extraction Model training Model evaluation Version: 0267f11 Version: 33e292f Version: 6e9aa50

Problem 4: Data and code are not connected Git is
not good for large files. Local repository Git server (GitHub) Local repository Local files Cloud (S3, GCP) Local files Code: Data: Data scientist 1 Data scientist 2 Services

II. DVC introduction

DVC basic Home page: http://dataversioncontrol.com Command: $ dvc init DVC
directories: 1) data/ - data files (actually symlinks to content) 2) .cache/ - data files content 3) .state/ - dependency graph (DAG) 4) * .git/ - standard Git directory

Running your code Any data manipulation command has to be
run through DVC. Import data Command: $ dvc run / import / remove

Data and .cache directories DVC separates data files and it’s
content. Content is never stored in the Git.

Dependencies tracking - .state directory A dependencies graph (DAG) is
being created automatically.

DVC data flows

III. Tutorial: NLP with DVC Link to the tutorial

IV. Beyond the Horizon

Important! DVC is a tool for modeling, not data engineering!
DVC workflow != Airflow workflow DVC Optimizes: model creation time or data scientists agility Airflow/Luigi Optimizes: execution time\resources and reliability

Languages DVC is written in Python but the tool supports
any languages: $ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p $ dvc run Rscript prc_plot.R data/model.p data/matrix-test.p data/prc.jpg $ dvc run MyApp data/input.tsv data/output.tsv

Q&A Dmitry Petrov Twitter: @FullStackML Email: [email protected]

Data Version Control: Tool for Iterative Machin...

Data Version Control: Tool for Iterative Machine Learning

Data Intelligence

More Decks by Data Intelligence

Featured

Transcript