$30 off During Our Annual Pro Sale. View Details »

Dmitry Petrov - Machine learning model and dataset versioning practices

Dmitry Petrov - Machine learning model and dataset versioning practices

Python is a prevalent programming language in machine learning (ML) community. A lot of Python engineers and data scientists feel the lack of engineering practices like versioning large datasets and ML models, and the lack of reproducibility. This lack is particularly acute for engineers who just moved to ML space.

We will discuss the current practices of organizing ML projects using traditional open-source toolset like Git and Git-LFS as well as this toolset limitation. Thereby motivation for developing new ML specific version control systems will be explained.

Data Version Control or DVC.ORG is an open source, command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favorite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects.

https://us.pycon.org/2019/schedule/presentation/176/

PyCon 2019

May 04, 2019
Tweet

More Decks by PyCon 2019

Other Decks in Programming

Transcript

  1. > Machine learning model and
    dataset versioning practices
    Dmitry Petrov iterative.AI
    |00|

    View Slide

  2. Co-Founder & CEO > Iterative.AI > San Francisco, USA
    ex-Data Scientist > Microsoft (BingAds) > Seattle, USA
    ex-Head of Lab > St. Petersburg Electrotechnical University > Russia
    |HELLO|
    Dmitry Petrov
    PhD in Computer Science
    Twitter: @FullStackML
    Creator of
    DVC.org project

    View Slide

  3. > Why ML is special?
    > MLFlow
    > Git-LFS
    > DVC
    > Conclusion
    |AGENDA|

    View Slide

  4. > Why ML is special?
    > MLFlow
    > Git-LFS
    > DVC
    > Conclusion
    |01|

    View Slide

  5. View Slide

  6. |DIFFERENCE 1: ML IS METRICS DRIVEN|

    View Slide

  7. >> EXPERIMENT = CODE + OUTPUTS
    Outputs include metrics and graphs AUC, etc.
    |DIFFERENCE 1: ML IS METRICS DRIVEN - EXPERIMENT|

    View Slide

  8. >> EXPERIMENT = CODE + OUTPUTS
    Outputs include metrics and graphs AUC, etc.
    |DIFFERENCE 1: ML IS METRICS DRIVEN - TRACKING|
    Solution: metrics tracking

    View Slide

  9. >> EXPERIMENT = CODE + OUTPUTS
    ML model is an output
    |DIFFERENCE 2: ML MODELS CENTRIC|

    View Slide

  10. |DIFFERENCE 2: ML MODELS CENTRIC - TRACKING|
    Solution: ML model versioning
    >> EXPERIMENT = CODE + OUTPUTS
    ML model is an output

    View Slide

  11. which version I used
    LAST WEEK?
    |DIFFERENCE 3: DATASETS MANAGEMENT|

    View Slide

  12. |DIFFERENCE 3: DATASETS MANAGEMENT|
    > Size is usually large > Gb
    > Moving datasets around
    > Datasets evolve. So, versioning is needed
    >> EXPERIMENT = CODE + OUTPUTS + DATASET
    Source code, Datasets, ML models

    View Slide

  13. |DIFFERENCE 3: DATASETS|
    >> EXPERIMENT = CODE + OUTPUTS + DATASET
    Solution:
    > Custom data syncing scripts
    > Connect data to code

    View Slide

  14. |DIFFERENCE 4: ML PIPELINES|
    Why do we need ML Pipelines?
    > Manage complexity - separate steps
    > Optimize execution - cache steps
    > Reusability of code
    > Scale team - steps ownership

    View Slide

  15. |DIFFERENCE 4: ML PIPELINES - AN EXAMPLE|

    View Slide

  16. |DIFFERENCE 4: ML PIPELINES - TYPES|
    ML Pipeline types:
    > Data engineering (AirFlow) for reliability
    > ML pipelines for fast iterations

    View Slide

  17. |DIFFERENCE 4: ML PIPELINES - TYPES|
    ML Pipeline types:
    > Data engineering (AirFlow) for reliability
    > ML pipelines for fast iterations

    View Slide

  18. |BEST PRACTICES SUMMARY|
    Best practice
    Tracking metrics
    Versioning ML models
    Versioning datasets
    Versioning ML pipelines

    View Slide

  19. |WHEN THE BEST PRACTICES ARE NEEDED|
    To be more efficient:
    > Team work
    > Reuse previous results
    > Predictable process

    View Slide

  20. > Why ML is special?
    > MLFlow
    > Git-LFS
    > DVC
    > Conclusion
    |02|

    View Slide

  21. Platform for the machine learning lifecycle
    > Tracking
    > Project
    > Models
    $ pip install mlflow
    |MLFLOW INTRO|

    View Slide

  22. from mlflow import log_metric, log_param, log_artifact
    log_param("lr", 0.03)
    log_metric("loss", curr_loss)
    log_artifact("model.p")
    |MLFLOW TRACKING|
    $ mlflow ui

    View Slide

  23. |MLFLOW TRACKING UI|
    from: mlflow.org

    View Slide

  24. |MLFLOW SUMMARY|
    Best practice Result
    Tracking metrics +
    Versioning ML models +
    Versioning datasets -/+
    Versioning ML pipelines -

    View Slide

  25. > Why ML is special?
    > MLFlow
    > Git-LFS Git Large File Storage
    > DVC
    > Conclusion
    |03|

    View Slide

  26. > Install
    $ brew install git-lfs
    $ git lfs install
    > Specify data-files type in a Git repository
    $ git lfs track ‘*.p’
    $ git add .gitattributes
    |GIT-LFS INTRO|

    View Slide

  27. $ python mytrain.py # your code generates mymodel.p
    $ git add mytrain.py mymodel.p
    $ git commit -m ‘Decay was added’
    $ git push
    Uploading LFS objects: 100% (1/1),
    56 MB | 3.2 MB/s, done
    |GIT-LFS ADD DATA FILES|

    View Slide

  28. $ git clone https://github.com/dmpetrov/my-lfs-repo
    $ cd my-lfs-repo
    $ du -sh mymodel.p # data file does not contain data yet
    4.0K mymodel.p
    $ git pull
    Downloading LFS objects: 75% (3/4),
    44 MB | 4.5 MB/s
    |GIT-LFS RETRIEVE DATA FILES|

    View Slide

  29. > PROS
    > Simple, like Git
    > CONS
    > Limited by data size <2Gb, <500Mb even better
    > Not every Git server supports Git-LFS
    > No ML\Data Science specific
    |GIT-LFS PROS/CONS|

    View Slide

  30. |GIT-LFS SUMMARY|
    Best practice Result
    Tracking metrics -
    Versioning ML models +
    Versioning datasets -/+
    Versioning ML pipelines -

    View Slide

  31. > Why ML is special?
    > MLFlow
    > Git-LFS
    > DVC Data Version Control
    > Conclusion
    |04|

    View Slide

  32. > DATASETS VERSIONING
    > Including large ones: 10-100Gb
    > ML PIPELINE VERSIONING
    > All intermediate datasets\featuresets
    > MULTIPLE CLOUD SUPPORT
    > AWS S3, Google cloud, Azure, bare metal SSH server
    |DVC PRINCIPLES|

    View Slide

  33. |DVC INTRO|
    Website: http://DVC.org
    > Install
    $ pip install dvc
    $ dvc init
    > Git-like tool no infrastructure is required

    View Slide

  34. View Slide

  35. |DVC DATASETS VERSIONING|
    > Push data to storage
    $ dvc add data.xml
    $ dvc push
    > Push meta information to Git server
    $ git add .gitignore data.xml.dvc
    $ git commit -m "add source data to DVC"
    $ git push

    View Slide

  36. $ git clone https://github.com/dmpetrov/my-dvc-repo
    $ cd my-dvc-repo
    $ dvc pull
    ...
    $ du -sh data.xml
    7G data.xml
    |DVC RETRIEVE DATASETS|

    View Slide

  37. |DVC ML PIPELINES VERSIONING|
    $ dvc add data/data.xml
    $ dvc run -d src/prepare.py -d data/data.xml -o data/prepared \
    python src/prepare.py data/data.xml
    $ dvc run -d src/featurization.py -d data/prepared -o
    data/features \
    python src/featurization.py data/prepared data/features
    $ dvc run -d src/train.py -d data/features -o model.pkl \
    python src/train.py data/features model.pkl

    View Slide

  38. |DVC PIPELINES REPRODUCIBILITY|
    > Reproduce your project
    $ dvc repro
    > Reproduce
    $ dvc repro train.dvc
    > Version DVC pipeline
    $ git add train.dvc
    $ git commit -m ‘Reproduce with dataset update 2019-05-02‘

    View Slide

  39. > Checkout data
    $ git checkout vgg16_exp2
    $ dvc checkout
    |DVC OPTIMIZATION - CHECKOUT|
    --->

    View Slide

  40. > Copy 50G directory ~1-2 min
    What about DVC?
    |DVC OPTIMIZATION - CHECKOUT SPEED|

    View Slide

  41. > Copy 50G directory ~1-2 min
    What about DVC?
    |DVC OPTIMIZATION - CHECKOUT SPEED|
    $ git checkout update_20190310
    $ time dvc checkout
    real 0m1.596s
    user 0m1.776s
    sys 0m0.528s

    View Slide

  42. $ git clone https://github.com/dmpetrov/my-dvc-repo
    $ cd my-dvc-repo
    $ dvc pull train.dvc
    ...
    $ du -sh cnn_model.p
    54M cnn_model.p
    |DVC OPTIMIZATION - PARTIAL DATA RETRIEVING|

    View Slide

  43. |DVC SUMMARY|
    Best practice Result
    Tracking metrics -/+
    Versioning ML models +
    Versioning datasets +
    Versioning ML pipelines +

    View Slide

  44. > Best practices for ML
    > Git-LFS
    > MLFlow
    > DVC
    > Conclusion
    |05|

    View Slide

  45. Best practice MLFLOW Git-LFS DVC
    Tracking metrics + - -/+
    Versioning ML models + + +
    Versioning datasets -/+ -/+ +
    Versioning ML pipelines - - +
    |THE TOOLS SUMMARY|

    View Slide

  46. Data science as different from software
    as software was different from hardware
    Nick Elprin, Domino Data Lab
    |THE WORLD IS CHANGING|
    Hardware Software DS/ML
    Waterfall Agile ¯\_(ツ)_/¯
    | | |

    View Slide

  47. |HOW TO DESIGN OUR FUTURE|

    View Slide

  48. > Think about processes
    |HOW TO DESIGN OUR FUTURE|

    View Slide

  49. > Think about processes
    > Try / develop new ML tools
    |HOW TO DESIGN OUR FUTURE|

    View Slide

  50. > Think about processes
    > Try / develop new ML tools
    > Share your knowledge
    |HOW TO DESIGN OUR FUTURE|

    View Slide

  51. > Think about processes
    > Try / develop new ML tools
    > Share your knowledge
    > Support open source
    |HOW TO DESIGN OUR FUTURE|

    View Slide

  52. > Questions
    Twitter @FullStackML
    Email [email protected]
    > Actions
    Visit dvc.org
    Star github.com/iterative/dvc
    |THANK YOU|

    View Slide

  53. View Slide

  54. > DVC PACKAGES - multi repository projects
    > A datasets repo + ML projects repos
    |DVC UPCOMING FEATURES|

    View Slide