Slide 1

Slide 1 text

Data Version Control: Tool for Iterative Machine Learning Dmitry Petrov, PhD Conference http://data-intelligence.ai June 2017. McLean, VA

Slide 2

Slide 2 text

Open source: 1) Created github.com/dataversioncontrol/dvc 2) Implemented Wavelet image hash (wHash) for github.com/JohannesBuchner/ima gehash Blog: fullstackml.com Dmitry Petrov @FullStackML Data Scientist. Ex-Data Scientist @Microsoft. PhD in CS. Ex-Researcher.

Slide 3

Slide 3 text

Agenda 1. Dependencies in data science projects. 2. DVC introduction. 3. Tutorial: NLP with DVC. 4. Beyond the Horizon. 5. Q&A.

Slide 4

Slide 4 text

I. Dependencies in machine learning projects

Slide 5

Slide 5 text

Problem 1: ML is slow RUNNING RUNNING

Slide 6

Slide 6 text

Problem 2: ML is slow and iterative Possible solution: Makefile Data preparation Feature extraction Model training Model evaluation

Slide 7

Slide 7 text

Problem 3: Makefile is not enough Possible solution: Replace timestamp based model to versioning (Git hashes) Makefile is based on timestamps and keeps only the last version. Data preparation Feature extraction Model training Model evaluation Version: 0267f11 Version: 33e292f Version: 6e9aa50

Slide 8

Slide 8 text

Problem 4: Data and code are not connected Git is not good for large files. Local repository Git server (GitHub) Local repository Local files Cloud (S3, GCP) Local files Code: Data: Data scientist 1 Data scientist 2 Services

Slide 9

Slide 9 text

II. DVC introduction

Slide 10

Slide 10 text

DVC basic Home page: http://dataversioncontrol.com Command: $ dvc init DVC directories: 1) data/ - data files (actually symlinks to content) 2) .cache/ - data files content 3) .state/ - dependency graph (DAG) 4) * .git/ - standard Git directory

Slide 11

Slide 11 text

Running your code Any data manipulation command has to be run through DVC. Import data Command: $ dvc run / import / remove

Slide 12

Slide 12 text

Data and .cache directories DVC separates data files and it’s content. Content is never stored in the Git.

Slide 13

Slide 13 text

Dependencies tracking - .state directory A dependencies graph (DAG) is being created automatically.

Slide 14

Slide 14 text

DVC data flows

Slide 15

Slide 15 text

III. Tutorial: NLP with DVC Link to the tutorial

Slide 16

Slide 16 text

IV. Beyond the Horizon

Slide 17

Slide 17 text

Important! DVC is a tool for modeling, not data engineering! DVC workflow != Airflow workflow DVC Optimizes: model creation time or data scientists agility Airflow/Luigi Optimizes: execution time\resources and reliability

Slide 18

Slide 18 text

Languages DVC is written in Python but the tool supports any languages: $ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p $ dvc run Rscript prc_plot.R data/model.p data/matrix-test.p data/prc.jpg $ dvc run MyApp data/input.tsv data/output.tsv

Slide 19

Slide 19 text

Q&A Dmitry Petrov Twitter: @FullStackML Email: [email protected]