Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Version Control: Tool for Iterative Machine Learning

Data Intelligence
June 28, 2017
920

Data Version Control: Tool for Iterative Machine Learning

Dmitry Petrov
Audience level: Intermediate
Topic area: Modeling
Data version control or DVC is a new open source tool  which is designed to help data scientists keep track of their ML processes and file dependencies in the simple form of git-like commands. This presentation post walks you through an iterative process of building a machine learning model with DVC.

Data Intelligence

June 28, 2017
Tweet

Transcript

  1. Data Version Control:
    Tool for Iterative Machine Learning
    Dmitry Petrov, PhD
    Conference http://data-intelligence.ai
    June 2017. McLean, VA

    View Slide

  2. Open source:
    1) Created
    github.com/dataversioncontrol/dvc
    2) Implemented Wavelet image hash
    (wHash) for
    github.com/JohannesBuchner/ima
    gehash
    Blog: fullstackml.com
    Dmitry Petrov
    @FullStackML
    Data Scientist.
    Ex-Data Scientist @Microsoft.
    PhD in CS. Ex-Researcher.

    View Slide

  3. Agenda
    1. Dependencies in data science projects.
    2. DVC introduction.
    3. Tutorial: NLP with DVC.
    4. Beyond the Horizon.
    5. Q&A.

    View Slide

  4. I. Dependencies in machine learning
    projects

    View Slide

  5. Problem 1: ML is slow
    RUNNING
    RUNNING

    View Slide

  6. Problem 2: ML is slow and iterative
    Possible solution: Makefile
    Data
    preparation
    Feature
    extraction
    Model
    training
    Model
    evaluation

    View Slide

  7. Problem 3: Makefile is not enough
    Possible solution: Replace timestamp based model to versioning (Git
    hashes)
    Makefile is based on timestamps and keeps only the last version.
    Data
    preparation
    Feature
    extraction
    Model
    training
    Model
    evaluation
    Version: 0267f11
    Version: 33e292f
    Version: 6e9aa50

    View Slide

  8. Problem 4: Data and code are not connected
    Git is not good for large files.
    Local
    repository
    Git server
    (GitHub)
    Local
    repository
    Local files
    Cloud
    (S3, GCP)
    Local files
    Code:
    Data:
    Data
    scientist 1
    Data
    scientist 2
    Services

    View Slide

  9. II. DVC introduction

    View Slide

  10. DVC basic
    Home page: http://dataversioncontrol.com
    Command: $ dvc init
    DVC directories:
    1) data/ - data files (actually symlinks to content)
    2) .cache/ - data files content
    3) .state/ - dependency graph (DAG)
    4) * .git/ - standard Git directory

    View Slide

  11. Running your code
    Any data manipulation command has to be run through DVC.
    Import data
    Command: $ dvc run / import / remove

    View Slide

  12. Data and .cache directories
    DVC separates data files and it’s content. Content is never stored in the Git.

    View Slide

  13. Dependencies tracking - .state directory
    A dependencies graph (DAG) is being created automatically.

    View Slide

  14. DVC data flows

    View Slide

  15. III. Tutorial: NLP with DVC
    Link to the tutorial

    View Slide

  16. IV. Beyond the Horizon

    View Slide

  17. Important!
    DVC is a tool for modeling, not data engineering!
    DVC workflow != Airflow workflow
    DVC
    Optimizes: model creation time or
    data scientists agility
    Airflow/Luigi
    Optimizes: execution time\resources
    and reliability

    View Slide

  18. Languages
    DVC is written in Python but the tool supports
    any languages:
    $ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p
    $ dvc run Rscript prc_plot.R data/model.p data/matrix-test.p data/prc.jpg
    $ dvc run MyApp data/input.tsv data/output.tsv

    View Slide

  19. Q&A
    Dmitry Petrov
    Twitter: @FullStackML
    Email: [email protected]

    View Slide