Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dmitry Petrov - Machine learning model and dataset versioning practices

Dmitry Petrov - Machine learning model and dataset versioning practices

Python is a prevalent programming language in machine learning (ML) community. A lot of Python engineers and data scientists feel the lack of engineering practices like versioning large datasets and ML models, and the lack of reproducibility. This lack is particularly acute for engineers who just moved to ML space.

We will discuss the current practices of organizing ML projects using traditional open-source toolset like Git and Git-LFS as well as this toolset limitation. Thereby motivation for developing new ML specific version control systems will be explained.

Data Version Control or DVC.ORG is an open source, command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favorite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects.

https://us.pycon.org/2019/schedule/presentation/176/

53b37e14a09c5a718a39fda61fe1b8e5?s=128

PyCon 2019

May 04, 2019
Tweet

Transcript

  1. > Machine learning model and dataset versioning practices Dmitry Petrov

    iterative.AI |00|
  2. Co-Founder & CEO > Iterative.AI > San Francisco, USA ex-Data

    Scientist > Microsoft (BingAds) > Seattle, USA ex-Head of Lab > St. Petersburg Electrotechnical University > Russia |HELLO| Dmitry Petrov PhD in Computer Science Twitter: @FullStackML Creator of DVC.org project
  3. > Why ML is special? > MLFlow > Git-LFS >

    DVC > Conclusion |AGENDA|
  4. > Why ML is special? > MLFlow > Git-LFS >

    DVC > Conclusion |01|
  5. None
  6. |DIFFERENCE 1: ML IS METRICS DRIVEN|

  7. >> EXPERIMENT = CODE + OUTPUTS Outputs include metrics and

    graphs AUC, etc. |DIFFERENCE 1: ML IS METRICS DRIVEN - EXPERIMENT|
  8. >> EXPERIMENT = CODE + OUTPUTS Outputs include metrics and

    graphs AUC, etc. |DIFFERENCE 1: ML IS METRICS DRIVEN - TRACKING| Solution: metrics tracking
  9. >> EXPERIMENT = CODE + OUTPUTS ML model is an

    output |DIFFERENCE 2: ML MODELS CENTRIC|
  10. |DIFFERENCE 2: ML MODELS CENTRIC - TRACKING| Solution: ML model

    versioning >> EXPERIMENT = CODE + OUTPUTS ML model is an output
  11. which version I used LAST WEEK? |DIFFERENCE 3: DATASETS MANAGEMENT|

  12. |DIFFERENCE 3: DATASETS MANAGEMENT| > Size is usually large >

    Gb > Moving datasets around > Datasets evolve. So, versioning is needed >> EXPERIMENT = CODE + OUTPUTS + DATASET Source code, Datasets, ML models
  13. |DIFFERENCE 3: DATASETS| >> EXPERIMENT = CODE + OUTPUTS +

    DATASET Solution: > Custom data syncing scripts > Connect data to code
  14. |DIFFERENCE 4: ML PIPELINES| Why do we need ML Pipelines?

    > Manage complexity - separate steps > Optimize execution - cache steps > Reusability of code > Scale team - steps ownership
  15. |DIFFERENCE 4: ML PIPELINES - AN EXAMPLE|

  16. |DIFFERENCE 4: ML PIPELINES - TYPES| ML Pipeline types: >

    Data engineering (AirFlow) for reliability > ML pipelines for fast iterations
  17. |DIFFERENCE 4: ML PIPELINES - TYPES| ML Pipeline types: >

    Data engineering (AirFlow) for reliability > ML pipelines for fast iterations
  18. |BEST PRACTICES SUMMARY| Best practice Tracking metrics Versioning ML models

    Versioning datasets Versioning ML pipelines
  19. |WHEN THE BEST PRACTICES ARE NEEDED| To be more efficient:

    > Team work > Reuse previous results > Predictable process
  20. > Why ML is special? > MLFlow > Git-LFS >

    DVC > Conclusion |02|
  21. Platform for the machine learning lifecycle > Tracking > Project

    > Models $ pip install mlflow |MLFLOW INTRO|
  22. from mlflow import log_metric, log_param, log_artifact log_param("lr", 0.03) log_metric("loss", curr_loss)

    log_artifact("model.p") |MLFLOW TRACKING| $ mlflow ui
  23. |MLFLOW TRACKING UI| from: mlflow.org

  24. |MLFLOW SUMMARY| Best practice Result Tracking metrics + Versioning ML

    models + Versioning datasets -/+ Versioning ML pipelines -
  25. > Why ML is special? > MLFlow > Git-LFS Git

    Large File Storage > DVC > Conclusion |03|
  26. > Install $ brew install git-lfs $ git lfs install

    > Specify data-files type in a Git repository $ git lfs track ‘*.p’ $ git add .gitattributes |GIT-LFS INTRO|
  27. $ python mytrain.py # your code generates mymodel.p $ git

    add mytrain.py mymodel.p $ git commit -m ‘Decay was added’ $ git push Uploading LFS objects: 100% (1/1), 56 MB | 3.2 MB/s, done |GIT-LFS ADD DATA FILES|
  28. $ git clone https://github.com/dmpetrov/my-lfs-repo $ cd my-lfs-repo $ du -sh

    mymodel.p # data file does not contain data yet 4.0K mymodel.p $ git pull Downloading LFS objects: 75% (3/4), 44 MB | 4.5 MB/s |GIT-LFS RETRIEVE DATA FILES|
  29. > PROS > Simple, like Git > CONS > Limited

    by data size <2Gb, <500Mb even better > Not every Git server supports Git-LFS > No ML\Data Science specific |GIT-LFS PROS/CONS|
  30. |GIT-LFS SUMMARY| Best practice Result Tracking metrics - Versioning ML

    models + Versioning datasets -/+ Versioning ML pipelines -
  31. > Why ML is special? > MLFlow > Git-LFS >

    DVC Data Version Control > Conclusion |04|
  32. > DATASETS VERSIONING > Including large ones: 10-100Gb > ML

    PIPELINE VERSIONING > All intermediate datasets\featuresets > MULTIPLE CLOUD SUPPORT > AWS S3, Google cloud, Azure, bare metal SSH server |DVC PRINCIPLES|
  33. |DVC INTRO| Website: http://DVC.org > Install $ pip install dvc

    $ dvc init > Git-like tool no infrastructure is required
  34. None
  35. |DVC DATASETS VERSIONING| > Push data to storage $ dvc

    add data.xml $ dvc push > Push meta information to Git server $ git add .gitignore data.xml.dvc $ git commit -m "add source data to DVC" $ git push
  36. $ git clone https://github.com/dmpetrov/my-dvc-repo $ cd my-dvc-repo $ dvc pull

    ... $ du -sh data.xml 7G data.xml |DVC RETRIEVE DATASETS|
  37. |DVC ML PIPELINES VERSIONING| $ dvc add data/data.xml $ dvc

    run -d src/prepare.py -d data/data.xml -o data/prepared \ python src/prepare.py data/data.xml $ dvc run -d src/featurization.py -d data/prepared -o data/features \ python src/featurization.py data/prepared data/features $ dvc run -d src/train.py -d data/features -o model.pkl \ python src/train.py data/features model.pkl
  38. |DVC PIPELINES REPRODUCIBILITY| > Reproduce your project $ dvc repro

    > Reproduce $ dvc repro train.dvc > Version DVC pipeline $ git add train.dvc $ git commit -m ‘Reproduce with dataset update 2019-05-02‘
  39. > Checkout data $ git checkout vgg16_exp2 $ dvc checkout

    |DVC OPTIMIZATION - CHECKOUT| --->
  40. > Copy 50G directory ~1-2 min What about DVC? |DVC

    OPTIMIZATION - CHECKOUT SPEED|
  41. > Copy 50G directory ~1-2 min What about DVC? |DVC

    OPTIMIZATION - CHECKOUT SPEED| $ git checkout update_20190310 $ time dvc checkout real 0m1.596s user 0m1.776s sys 0m0.528s
  42. $ git clone https://github.com/dmpetrov/my-dvc-repo $ cd my-dvc-repo $ dvc pull

    train.dvc ... $ du -sh cnn_model.p 54M cnn_model.p |DVC OPTIMIZATION - PARTIAL DATA RETRIEVING|
  43. |DVC SUMMARY| Best practice Result Tracking metrics -/+ Versioning ML

    models + Versioning datasets + Versioning ML pipelines +
  44. > Best practices for ML > Git-LFS > MLFlow >

    DVC > Conclusion |05|
  45. Best practice MLFLOW Git-LFS DVC Tracking metrics + - -/+

    Versioning ML models + + + Versioning datasets -/+ -/+ + Versioning ML pipelines - - + |THE TOOLS SUMMARY|
  46. Data science as different from software as software was different

    from hardware Nick Elprin, Domino Data Lab |THE WORLD IS CHANGING| Hardware Software DS/ML Waterfall Agile ¯\_(ツ)_/¯ | | |
  47. |HOW TO DESIGN OUR FUTURE|

  48. > Think about processes |HOW TO DESIGN OUR FUTURE|

  49. > Think about processes > Try / develop new ML

    tools |HOW TO DESIGN OUR FUTURE|
  50. > Think about processes > Try / develop new ML

    tools > Share your knowledge |HOW TO DESIGN OUR FUTURE|
  51. > Think about processes > Try / develop new ML

    tools > Share your knowledge > Support open source |HOW TO DESIGN OUR FUTURE|
  52. > Questions Twitter @FullStackML Email dmitry@iterative.ai > Actions Visit dvc.org

    Star github.com/iterative/dvc |THANK YOU|
  53. None
  54. > DVC PACKAGES - multi repository projects > A datasets

    repo + ML projects repos |DVC UPCOMING FEATURES|