Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dmitry Petrov - Machine learning model and dataset versioning practices

Dmitry Petrov - Machine learning model and dataset versioning practices

Python is a prevalent programming language in machine learning (ML) community. A lot of Python engineers and data scientists feel the lack of engineering practices like versioning large datasets and ML models, and the lack of reproducibility. This lack is particularly acute for engineers who just moved to ML space.

We will discuss the current practices of organizing ML projects using traditional open-source toolset like Git and Git-LFS as well as this toolset limitation. Thereby motivation for developing new ML specific version control systems will be explained.

Data Version Control or DVC.ORG is an open source, command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favorite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects.

https://us.pycon.org/2019/schedule/presentation/176/

PyCon 2019

May 04, 2019
Tweet

More Decks by PyCon 2019

Other Decks in Programming

Transcript

  1. Co-Founder & CEO > Iterative.AI > San Francisco, USA ex-Data

    Scientist > Microsoft (BingAds) > Seattle, USA ex-Head of Lab > St. Petersburg Electrotechnical University > Russia |HELLO| Dmitry Petrov PhD in Computer Science Twitter: @FullStackML Creator of DVC.org project
  2. > Why ML is special? > MLFlow > Git-LFS >

    DVC > Conclusion |AGENDA|
  3. >> EXPERIMENT = CODE + OUTPUTS Outputs include metrics and

    graphs AUC, etc. |DIFFERENCE 1: ML IS METRICS DRIVEN - EXPERIMENT|
  4. >> EXPERIMENT = CODE + OUTPUTS Outputs include metrics and

    graphs AUC, etc. |DIFFERENCE 1: ML IS METRICS DRIVEN - TRACKING| Solution: metrics tracking
  5. >> EXPERIMENT = CODE + OUTPUTS ML model is an

    output |DIFFERENCE 2: ML MODELS CENTRIC|
  6. |DIFFERENCE 2: ML MODELS CENTRIC - TRACKING| Solution: ML model

    versioning >> EXPERIMENT = CODE + OUTPUTS ML model is an output
  7. |DIFFERENCE 3: DATASETS MANAGEMENT| > Size is usually large >

    Gb > Moving datasets around > Datasets evolve. So, versioning is needed >> EXPERIMENT = CODE + OUTPUTS + DATASET Source code, Datasets, ML models
  8. |DIFFERENCE 3: DATASETS| >> EXPERIMENT = CODE + OUTPUTS +

    DATASET Solution: > Custom data syncing scripts > Connect data to code
  9. |DIFFERENCE 4: ML PIPELINES| Why do we need ML Pipelines?

    > Manage complexity - separate steps > Optimize execution - cache steps > Reusability of code > Scale team - steps ownership
  10. |DIFFERENCE 4: ML PIPELINES - TYPES| ML Pipeline types: >

    Data engineering (AirFlow) for reliability > ML pipelines for fast iterations
  11. |DIFFERENCE 4: ML PIPELINES - TYPES| ML Pipeline types: >

    Data engineering (AirFlow) for reliability > ML pipelines for fast iterations
  12. |WHEN THE BEST PRACTICES ARE NEEDED| To be more efficient:

    > Team work > Reuse previous results > Predictable process
  13. Platform for the machine learning lifecycle > Tracking > Project

    > Models $ pip install mlflow |MLFLOW INTRO|
  14. |MLFLOW SUMMARY| Best practice Result Tracking metrics + Versioning ML

    models + Versioning datasets -/+ Versioning ML pipelines -
  15. > Why ML is special? > MLFlow > Git-LFS Git

    Large File Storage > DVC > Conclusion |03|
  16. > Install $ brew install git-lfs $ git lfs install

    > Specify data-files type in a Git repository $ git lfs track ‘*.p’ $ git add .gitattributes |GIT-LFS INTRO|
  17. $ python mytrain.py # your code generates mymodel.p $ git

    add mytrain.py mymodel.p $ git commit -m ‘Decay was added’ $ git push Uploading LFS objects: 100% (1/1), 56 MB | 3.2 MB/s, done |GIT-LFS ADD DATA FILES|
  18. $ git clone https://github.com/dmpetrov/my-lfs-repo $ cd my-lfs-repo $ du -sh

    mymodel.p # data file does not contain data yet 4.0K mymodel.p $ git pull Downloading LFS objects: 75% (3/4), 44 MB | 4.5 MB/s |GIT-LFS RETRIEVE DATA FILES|
  19. > PROS > Simple, like Git > CONS > Limited

    by data size <2Gb, <500Mb even better > Not every Git server supports Git-LFS > No ML\Data Science specific |GIT-LFS PROS/CONS|
  20. |GIT-LFS SUMMARY| Best practice Result Tracking metrics - Versioning ML

    models + Versioning datasets -/+ Versioning ML pipelines -
  21. > Why ML is special? > MLFlow > Git-LFS >

    DVC Data Version Control > Conclusion |04|
  22. > DATASETS VERSIONING > Including large ones: 10-100Gb > ML

    PIPELINE VERSIONING > All intermediate datasets\featuresets > MULTIPLE CLOUD SUPPORT > AWS S3, Google cloud, Azure, bare metal SSH server |DVC PRINCIPLES|
  23. |DVC INTRO| Website: http://DVC.org > Install $ pip install dvc

    $ dvc init > Git-like tool no infrastructure is required
  24. |DVC DATASETS VERSIONING| > Push data to storage $ dvc

    add data.xml $ dvc push > Push meta information to Git server $ git add .gitignore data.xml.dvc $ git commit -m "add source data to DVC" $ git push
  25. $ git clone https://github.com/dmpetrov/my-dvc-repo $ cd my-dvc-repo $ dvc pull

    ... $ du -sh data.xml 7G data.xml |DVC RETRIEVE DATASETS|
  26. |DVC ML PIPELINES VERSIONING| $ dvc add data/data.xml $ dvc

    run -d src/prepare.py -d data/data.xml -o data/prepared \ python src/prepare.py data/data.xml $ dvc run -d src/featurization.py -d data/prepared -o data/features \ python src/featurization.py data/prepared data/features $ dvc run -d src/train.py -d data/features -o model.pkl \ python src/train.py data/features model.pkl
  27. |DVC PIPELINES REPRODUCIBILITY| > Reproduce your project $ dvc repro

    > Reproduce $ dvc repro train.dvc > Version DVC pipeline $ git add train.dvc $ git commit -m ‘Reproduce with dataset update 2019-05-02‘
  28. > Copy 50G directory ~1-2 min What about DVC? |DVC

    OPTIMIZATION - CHECKOUT SPEED| $ git checkout update_20190310 $ time dvc checkout real 0m1.596s user 0m1.776s sys 0m0.528s
  29. $ git clone https://github.com/dmpetrov/my-dvc-repo $ cd my-dvc-repo $ dvc pull

    train.dvc ... $ du -sh cnn_model.p 54M cnn_model.p |DVC OPTIMIZATION - PARTIAL DATA RETRIEVING|
  30. |DVC SUMMARY| Best practice Result Tracking metrics -/+ Versioning ML

    models + Versioning datasets + Versioning ML pipelines +
  31. Best practice MLFLOW Git-LFS DVC Tracking metrics + - -/+

    Versioning ML models + + + Versioning datasets -/+ -/+ + Versioning ML pipelines - - + |THE TOOLS SUMMARY|
  32. Data science as different from software as software was different

    from hardware Nick Elprin, Domino Data Lab |THE WORLD IS CHANGING| Hardware Software DS/ML Waterfall Agile ¯\_(ツ)_/¯ | | |
  33. > Think about processes > Try / develop new ML

    tools |HOW TO DESIGN OUR FUTURE|
  34. > Think about processes > Try / develop new ML

    tools > Share your knowledge |HOW TO DESIGN OUR FUTURE|
  35. > Think about processes > Try / develop new ML

    tools > Share your knowledge > Support open source |HOW TO DESIGN OUR FUTURE|
  36. > DVC PACKAGES - multi repository projects > A datasets

    repo + ML projects repos |DVC UPCOMING FEATURES|