Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Version Control: Tool for Iterative Machine Learning

Data Intelligence
June 28, 2017
950

Data Version Control: Tool for Iterative Machine Learning

Dmitry Petrov
Audience level: Intermediate
Topic area: Modeling
Data version control or DVC is a new open source tool  which is designed to help data scientists keep track of their ML processes and file dependencies in the simple form of git-like commands. This presentation post walks you through an iterative process of building a machine learning model with DVC.

Data Intelligence

June 28, 2017
Tweet

Transcript

  1. Data Version Control: Tool for Iterative Machine Learning Dmitry Petrov,

    PhD Conference http://data-intelligence.ai June 2017. McLean, VA
  2. Open source: 1) Created github.com/dataversioncontrol/dvc 2) Implemented Wavelet image hash

    (wHash) for github.com/JohannesBuchner/ima gehash Blog: fullstackml.com Dmitry Petrov @FullStackML Data Scientist. Ex-Data Scientist @Microsoft. PhD in CS. Ex-Researcher.
  3. Agenda 1. Dependencies in data science projects. 2. DVC introduction.

    3. Tutorial: NLP with DVC. 4. Beyond the Horizon. 5. Q&A.
  4. Problem 2: ML is slow and iterative Possible solution: Makefile

    Data preparation Feature extraction Model training Model evaluation
  5. Problem 3: Makefile is not enough Possible solution: Replace timestamp

    based model to versioning (Git hashes) Makefile is based on timestamps and keeps only the last version. Data preparation Feature extraction Model training Model evaluation Version: 0267f11 Version: 33e292f Version: 6e9aa50
  6. Problem 4: Data and code are not connected Git is

    not good for large files. Local repository Git server (GitHub) Local repository Local files Cloud (S3, GCP) Local files Code: Data: Data scientist 1 Data scientist 2 Services
  7. DVC basic Home page: http://dataversioncontrol.com Command: $ dvc init DVC

    directories: 1) data/ - data files (actually symlinks to content) 2) .cache/ - data files content 3) .state/ - dependency graph (DAG) 4) * .git/ - standard Git directory
  8. Running your code Any data manipulation command has to be

    run through DVC. Import data Command: $ dvc run / import / remove
  9. Data and .cache directories DVC separates data files and it’s

    content. Content is never stored in the Git.
  10. Important! DVC is a tool for modeling, not data engineering!

    DVC workflow != Airflow workflow DVC Optimizes: model creation time or data scientists agility Airflow/Luigi Optimizes: execution time\resources and reliability
  11. Languages DVC is written in Python but the tool supports

    any languages: $ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p $ dvc run Rscript prc_plot.R data/model.p data/matrix-test.p data/prc.jpg $ dvc run MyApp data/input.tsv data/output.tsv