Dmitry Petrov - Machine learning model and dataset versioning practices

> Machine learning model and dataset versioning practices Dmitry Petrov
iterative.AI |00|

Co-Founder & CEO > Iterative.AI > San Francisco, USA ex-Data
Scientist > Microsoft (BingAds) > Seattle, USA ex-Head of Lab > St. Petersburg Electrotechnical University > Russia |HELLO| Dmitry Petrov PhD in Computer Science Twitter: @FullStackML Creator of DVC.org project

> Why ML is special? > MLFlow > Git-LFS >
DVC > Conclusion |AGENDA|

DVC > Conclusion |01|

|DIFFERENCE 1: ML IS METRICS DRIVEN|

>> EXPERIMENT = CODE + OUTPUTS Outputs include metrics and
graphs AUC, etc. |DIFFERENCE 1: ML IS METRICS DRIVEN - EXPERIMENT|

>> EXPERIMENT = CODE + OUTPUTS Outputs include metrics and
graphs AUC, etc. |DIFFERENCE 1: ML IS METRICS DRIVEN - TRACKING| Solution: metrics tracking

>> EXPERIMENT = CODE + OUTPUTS ML model is an
output |DIFFERENCE 2: ML MODELS CENTRIC|

|DIFFERENCE 2: ML MODELS CENTRIC - TRACKING| Solution: ML model
versioning >> EXPERIMENT = CODE + OUTPUTS ML model is an output

which version I used LAST WEEK? |DIFFERENCE 3: DATASETS MANAGEMENT|

|DIFFERENCE 3: DATASETS MANAGEMENT| > Size is usually large >
Gb > Moving datasets around > Datasets evolve. So, versioning is needed >> EXPERIMENT = CODE + OUTPUTS + DATASET Source code, Datasets, ML models

|DIFFERENCE 3: DATASETS| >> EXPERIMENT = CODE + OUTPUTS +
DATASET Solution: > Custom data syncing scripts > Connect data to code

|DIFFERENCE 4: ML PIPELINES| Why do we need ML Pipelines?
> Manage complexity - separate steps > Optimize execution - cache steps > Reusability of code > Scale team - steps ownership

|DIFFERENCE 4: ML PIPELINES - AN EXAMPLE|

|DIFFERENCE 4: ML PIPELINES - TYPES| ML Pipeline types: >
Data engineering (AirFlow) for reliability > ML pipelines for fast iterations

|BEST PRACTICES SUMMARY| Best practice Tracking metrics Versioning ML models
Versioning datasets Versioning ML pipelines

|WHEN THE BEST PRACTICES ARE NEEDED| To be more efﬁcient:
> Team work > Reuse previous results > Predictable process

Platform for the machine learning lifecycle > Tracking > Project
> Models $ pip install mlﬂow |MLFLOW INTRO|

from mlﬂow import log_metric, log_param, log_artifact log_param("lr", 0.03) log_metric("loss", curr_loss)
log_artifact("model.p") |MLFLOW TRACKING| $ mlﬂow ui

|MLFLOW TRACKING UI| from: mlﬂow.org

|MLFLOW SUMMARY| Best practice Result Tracking metrics + Versioning ML
models + Versioning datasets -/+ Versioning ML pipelines -

> Why ML is special? > MLFlow > Git-LFS Git
Large File Storage > DVC > Conclusion |03|

> Install $ brew install git-lfs $ git lfs install
> Specify data-ﬁles type in a Git repository $ git lfs track ‘*.p’ $ git add .gitattributes |GIT-LFS INTRO|

$ python mytrain.py # your code generates mymodel.p $ git
add mytrain.py mymodel.p $ git commit -m ‘Decay was added’ $ git push Uploading LFS objects: 100% (1/1), 56 MB | 3.2 MB/s, done |GIT-LFS ADD DATA FILES|

$ git clone https://github.com/dmpetrov/my-lfs-repo $ cd my-lfs-repo $ du -sh
mymodel.p # data ﬁle does not contain data yet 4.0K mymodel.p $ git pull Downloading LFS objects: 75% (3/4), 44 MB | 4.5 MB/s |GIT-LFS RETRIEVE DATA FILES|

> PROS > Simple, like Git > CONS > Limited
by data size <2Gb, <500Mb even better > Not every Git server supports Git-LFS > No ML\Data Science speciﬁc |GIT-LFS PROS/CONS|

|GIT-LFS SUMMARY| Best practice Result Tracking metrics - Versioning ML
models + Versioning datasets -/+ Versioning ML pipelines -

DVC Data Version Control > Conclusion |04|

> DATASETS VERSIONING > Including large ones: 10-100Gb > ML
PIPELINE VERSIONING > All intermediate datasets\featuresets > MULTIPLE CLOUD SUPPORT > AWS S3, Google cloud, Azure, bare metal SSH server |DVC PRINCIPLES|

|DVC INTRO| Website: http://DVC.org > Install $ pip install dvc
$ dvc init > Git-like tool no infrastructure is required

|DVC DATASETS VERSIONING| > Push data to storage $ dvc
add data.xml $ dvc push > Push meta information to Git server $ git add .gitignore data.xml.dvc $ git commit -m "add source data to DVC" $ git push

$ git clone https://github.com/dmpetrov/my-dvc-repo $ cd my-dvc-repo $ dvc pull
... $ du -sh data.xml 7G data.xml |DVC RETRIEVE DATASETS|

|DVC ML PIPELINES VERSIONING| $ dvc add data/data.xml $ dvc
run -d src/prepare.py -d data/data.xml -o data/prepared \ python src/prepare.py data/data.xml $ dvc run -d src/featurization.py -d data/prepared -o data/features \ python src/featurization.py data/prepared data/features $ dvc run -d src/train.py -d data/features -o model.pkl \ python src/train.py data/features model.pkl

|DVC PIPELINES REPRODUCIBILITY| > Reproduce your project $ dvc repro
> Reproduce $ dvc repro train.dvc > Version DVC pipeline $ git add train.dvc $ git commit -m ‘Reproduce with dataset update 2019-05-02‘

> Checkout data $ git checkout vgg16_exp2 $ dvc checkout
|DVC OPTIMIZATION - CHECKOUT| --->

> Copy 50G directory ~1-2 min What about DVC? |DVC
OPTIMIZATION - CHECKOUT SPEED|

> Copy 50G directory ~1-2 min What about DVC? |DVC
OPTIMIZATION - CHECKOUT SPEED| $ git checkout update_20190310 $ time dvc checkout real 0m1.596s user 0m1.776s sys 0m0.528s

$ git clone https://github.com/dmpetrov/my-dvc-repo $ cd my-dvc-repo $ dvc pull
train.dvc ... $ du -sh cnn_model.p 54M cnn_model.p |DVC OPTIMIZATION - PARTIAL DATA RETRIEVING|

|DVC SUMMARY| Best practice Result Tracking metrics -/+ Versioning ML
models + Versioning datasets + Versioning ML pipelines +

> Best practices for ML > Git-LFS > MLFlow >

Best practice MLFLOW Git-LFS DVC Tracking metrics + - -/+
Versioning ML models + + + Versioning datasets -/+ -/+ + Versioning ML pipelines - - + |THE TOOLS SUMMARY|

Data science as different from software as software was different
from hardware Nick Elprin, Domino Data Lab |THE WORLD IS CHANGING| Hardware Software DS/ML Waterfall Agile ¯\_(ツ)_/¯ | | |

|HOW TO DESIGN OUR FUTURE|

> Think about processes |HOW TO DESIGN OUR FUTURE|

> Think about processes > Try / develop new ML
tools |HOW TO DESIGN OUR FUTURE|

tools > Share your knowledge |HOW TO DESIGN OUR FUTURE|

tools > Share your knowledge > Support open source |HOW TO DESIGN OUR FUTURE|

> Questions Twitter @FullStackML Email dmitry@iterative.ai > Actions Visit dvc.org
Star github.com/iterative/dvc |THANK YOU|

> DVC PACKAGES - multi repository projects > A datasets
repo + ML projects repos |DVC UPCOMING FEATURES|

Dmitry Petrov - Machine learning model and data...

Dmitry Petrov - Machine learning model and dataset versioning practices

More Decks by PyCon 2019

Other Decks in Programming

Featured

Transcript