Data Version Control: Tool for Iterative Machine Learning

C93e0512fbfca1b61a9913bfceeac7ec?s=47 Data Intelligence
June 28, 2017

Data Version Control: Tool for Iterative Machine Learning

Dmitry Petrov
Audience level: Intermediate
Topic area: Modeling
Data version control or DVC is a new open source tool  which is designed to help data scientists keep track of their ML processes and file dependencies in the simple form of git-like commands. This presentation post walks you through an iterative process of building a machine learning model with DVC.


Data Intelligence

June 28, 2017


  1. Data Version Control: Tool for Iterative Machine Learning Dmitry Petrov,

    PhD Conference June 2017. McLean, VA
  2. Open source: 1) Created 2) Implemented Wavelet image hash

    (wHash) for gehash Blog: Dmitry Petrov @FullStackML Data Scientist. Ex-Data Scientist @Microsoft. PhD in CS. Ex-Researcher.
  3. Agenda 1. Dependencies in data science projects. 2. DVC introduction.

    3. Tutorial: NLP with DVC. 4. Beyond the Horizon. 5. Q&A.
  4. I. Dependencies in machine learning projects

  5. Problem 1: ML is slow RUNNING RUNNING

  6. Problem 2: ML is slow and iterative Possible solution: Makefile

    Data preparation Feature extraction Model training Model evaluation
  7. Problem 3: Makefile is not enough Possible solution: Replace timestamp

    based model to versioning (Git hashes) Makefile is based on timestamps and keeps only the last version. Data preparation Feature extraction Model training Model evaluation Version: 0267f11 Version: 33e292f Version: 6e9aa50
  8. Problem 4: Data and code are not connected Git is

    not good for large files. Local repository Git server (GitHub) Local repository Local files Cloud (S3, GCP) Local files Code: Data: Data scientist 1 Data scientist 2 Services
  9. II. DVC introduction

  10. DVC basic Home page: Command: $ dvc init DVC

    directories: 1) data/ - data files (actually symlinks to content) 2) .cache/ - data files content 3) .state/ - dependency graph (DAG) 4) * .git/ - standard Git directory
  11. Running your code Any data manipulation command has to be

    run through DVC. Import data Command: $ dvc run / import / remove
  12. Data and .cache directories DVC separates data files and it’s

    content. Content is never stored in the Git.
  13. Dependencies tracking - .state directory A dependencies graph (DAG) is

    being created automatically.
  14. DVC data flows

  15. III. Tutorial: NLP with DVC Link to the tutorial

  16. IV. Beyond the Horizon

  17. Important! DVC is a tool for modeling, not data engineering!

    DVC workflow != Airflow workflow DVC Optimizes: model creation time or data scientists agility Airflow/Luigi Optimizes: execution time\resources and reliability
  18. Languages DVC is written in Python but the tool supports

    any languages: $ dvc run python code/ data/matrix-train.p 20170426 data/model.p $ dvc run Rscript prc_plot.R data/model.p data/matrix-test.p data/prc.jpg $ dvc run MyApp data/input.tsv data/output.tsv
  19. Q&A Dmitry Petrov Twitter: @FullStackML Email: