Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Version Control: Tool for Iterative Machin...

Data Version Control: Tool for Iterative Machine Learning

Data Version Control or DVC (http://dataversioncontrol.com) is an open source project which makes data science and machine learning projects reproducible and shareable by automatically building data dependency graph (DAG) and sharing code by Git and data by cloud storage (AWS S3, GCP) in a single DVC environment.

Avatar for Dmitry Petrov

Dmitry Petrov

June 25, 2017
Tweet

Other Decks in Technology

Transcript

  1. Data Version Control: Tool for Iterative Machine Learning Dmitry Petrov,

    PhD Conference http://data-intelligence.ai June 2017. McLean, VA
  2. Open source: 1) Created github.com/dataversioncontrol/dvc 2) Implemented Wavelet image hash

    (wHash) for github.com/JohannesBuchner/ima gehash Blog: fullstackml.com Dmitry Petrov @FullStackML Data Scientist. Ex-Data Scientist @Microsoft. PhD in CS. Ex-Researcher.
  3. Agenda 1. Dependencies in data science projects. 2. DVC introduction.

    3. Tutorial: NLP with DVC. 4. Beyond the Horizon. 5. Q&A.
  4. Problem 2: ML is slow and iterative Possible solution: Makefile

    Data preparation Feature extraction Model training Model evaluation
  5. Problem 3: Makefile is not enough Possible solution: Replace timestamp

    based model to versioning (Git hashes) Makefile is based on timestamps and keeps only the last version. Data preparation Feature extraction Model training Model evaluation Version: 0267f11 Version: 33e292f Version: 6e9aa50
  6. Problem 4: Data and code are not connected Git is

    not good for large files. Local repository Git server (GitHub) Local repository Local files Cloud (S3, GCP) Local files Code: Data: Data scientist 1 Data scientist 2 Services
  7. DVC basic Home page: http://dataversioncontrol.com Command: $ dvc init DVC

    directories: 1) data/ - data files (actually symlinks to content) 2) .cache/ - data files content 3) .state/ - dependency graph (DAG) 4) * .git/ - standard Git directory
  8. Running your code Any data manipulation command has to be

    run through DVC. Import data Command: $ dvc run / import / remove
  9. Data and .cache directories DVC separates data files and it’s

    content. Content is never stored in the Git.
  10. Important! DVC is a tool for modeling, not data engineering!

    DVC workflow != Airflow workflow DVC Optimizes: model creation time or data scientists agility Airflow/Luigi Optimizes: execution time\resources and reliability
  11. Languages DVC is written in Python but the tool supports

    any languages: $ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p $ dvc run Rscript prc_plot.R data/model.p data/matrix-test.p data/prc.jpg $ dvc run MyApp data/input.tsv data/output.tsv