> Machine learning model and
dataset versioning practices
Dmitry Petrov iterative.AI
|00|
Slide 2
Slide 2 text
Co-Founder & CEO > Iterative.AI > San Francisco, USA
ex-Data Scientist > Microsoft (BingAds) > Seattle, USA
ex-Head of Lab > St. Petersburg Electrotechnical University > Russia
|HELLO|
Dmitry Petrov
PhD in Computer Science
Twitter: @FullStackML
Creator of
DVC.org project
Slide 3
Slide 3 text
> Why ML is special?
> MLFlow
> Git-LFS
> DVC
> Conclusion
|AGENDA|
Slide 4
Slide 4 text
> Why ML is special?
> MLFlow
> Git-LFS
> DVC
> Conclusion
|01|
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
|DIFFERENCE 1: ML IS METRICS DRIVEN|
Slide 7
Slide 7 text
>> EXPERIMENT = CODE + OUTPUTS
Outputs include metrics and graphs AUC, etc.
|DIFFERENCE 1: ML IS METRICS DRIVEN - EXPERIMENT|
Slide 8
Slide 8 text
>> EXPERIMENT = CODE + OUTPUTS
Outputs include metrics and graphs AUC, etc.
|DIFFERENCE 1: ML IS METRICS DRIVEN - TRACKING|
Solution: metrics tracking
Slide 9
Slide 9 text
>> EXPERIMENT = CODE + OUTPUTS
ML model is an output
|DIFFERENCE 2: ML MODELS CENTRIC|
Slide 10
Slide 10 text
|DIFFERENCE 2: ML MODELS CENTRIC - TRACKING|
Solution: ML model versioning
>> EXPERIMENT = CODE + OUTPUTS
ML model is an output
Slide 11
Slide 11 text
which version I used
LAST WEEK?
|DIFFERENCE 3: DATASETS MANAGEMENT|
Slide 12
Slide 12 text
|DIFFERENCE 3: DATASETS MANAGEMENT|
> Size is usually large > Gb
> Moving datasets around
> Datasets evolve. So, versioning is needed
>> EXPERIMENT = CODE + OUTPUTS + DATASET
Source code, Datasets, ML models
Slide 13
Slide 13 text
|DIFFERENCE 3: DATASETS|
>> EXPERIMENT = CODE + OUTPUTS + DATASET
Solution:
> Custom data syncing scripts
> Connect data to code
Slide 14
Slide 14 text
|DIFFERENCE 4: ML PIPELINES|
Why do we need ML Pipelines?
> Manage complexity - separate steps
> Optimize execution - cache steps
> Reusability of code
> Scale team - steps ownership
Slide 15
Slide 15 text
|DIFFERENCE 4: ML PIPELINES - AN EXAMPLE|
Slide 16
Slide 16 text
|DIFFERENCE 4: ML PIPELINES - TYPES|
ML Pipeline types:
> Data engineering (AirFlow) for reliability
> ML pipelines for fast iterations
Slide 17
Slide 17 text
|DIFFERENCE 4: ML PIPELINES - TYPES|
ML Pipeline types:
> Data engineering (AirFlow) for reliability
> ML pipelines for fast iterations
Slide 18
Slide 18 text
|BEST PRACTICES SUMMARY|
Best practice
Tracking metrics
Versioning ML models
Versioning datasets
Versioning ML pipelines
Slide 19
Slide 19 text
|WHEN THE BEST PRACTICES ARE NEEDED|
To be more efficient:
> Team work
> Reuse previous results
> Predictable process
Slide 20
Slide 20 text
> Why ML is special?
> MLFlow
> Git-LFS
> DVC
> Conclusion
|02|
Slide 21
Slide 21 text
Platform for the machine learning lifecycle
> Tracking
> Project
> Models
$ pip install mlflow
|MLFLOW INTRO|
$ git clone https://github.com/dmpetrov/my-lfs-repo
$ cd my-lfs-repo
$ du -sh mymodel.p # data file does not contain data yet
4.0K mymodel.p
$ git pull
Downloading LFS objects: 75% (3/4),
44 MB | 4.5 MB/s
|GIT-LFS RETRIEVE DATA FILES|
Slide 29
Slide 29 text
> PROS
> Simple, like Git
> CONS
> Limited by data size <2Gb, <500Mb even better
> Not every Git server supports Git-LFS
> No ML\Data Science specific
|GIT-LFS PROS/CONS|
Slide 30
Slide 30 text
|GIT-LFS SUMMARY|
Best practice Result
Tracking metrics -
Versioning ML models +
Versioning datasets -/+
Versioning ML pipelines -
Slide 31
Slide 31 text
> Why ML is special?
> MLFlow
> Git-LFS
> DVC Data Version Control
> Conclusion
|04|
Slide 32
Slide 32 text
> DATASETS VERSIONING
> Including large ones: 10-100Gb
> ML PIPELINE VERSIONING
> All intermediate datasets\featuresets
> MULTIPLE CLOUD SUPPORT
> AWS S3, Google cloud, Azure, bare metal SSH server
|DVC PRINCIPLES|
Slide 33
Slide 33 text
|DVC INTRO|
Website: http://DVC.org
> Install
$ pip install dvc
$ dvc init
> Git-like tool no infrastructure is required
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
|DVC DATASETS VERSIONING|
> Push data to storage
$ dvc add data.xml
$ dvc push
> Push meta information to Git server
$ git add .gitignore data.xml.dvc
$ git commit -m "add source data to DVC"
$ git push
Slide 36
Slide 36 text
$ git clone https://github.com/dmpetrov/my-dvc-repo
$ cd my-dvc-repo
$ dvc pull
...
$ du -sh data.xml
7G data.xml
|DVC RETRIEVE DATASETS|
> Copy 50G directory ~1-2 min
What about DVC?
|DVC OPTIMIZATION - CHECKOUT SPEED|
Slide 41
Slide 41 text
> Copy 50G directory ~1-2 min
What about DVC?
|DVC OPTIMIZATION - CHECKOUT SPEED|
$ git checkout update_20190310
$ time dvc checkout
real 0m1.596s
user 0m1.776s
sys 0m0.528s
Slide 42
Slide 42 text
$ git clone https://github.com/dmpetrov/my-dvc-repo
$ cd my-dvc-repo
$ dvc pull train.dvc
...
$ du -sh cnn_model.p
54M cnn_model.p
|DVC OPTIMIZATION - PARTIAL DATA RETRIEVING|
Slide 43
Slide 43 text
|DVC SUMMARY|
Best practice Result
Tracking metrics -/+
Versioning ML models +
Versioning datasets +
Versioning ML pipelines +
Slide 44
Slide 44 text
> Best practices for ML
> Git-LFS
> MLFlow
> DVC
> Conclusion
|05|
Slide 45
Slide 45 text
Best practice MLFLOW Git-LFS DVC
Tracking metrics + - -/+
Versioning ML models + + +
Versioning datasets -/+ -/+ +
Versioning ML pipelines - - +
|THE TOOLS SUMMARY|
Slide 46
Slide 46 text
Data science as different from software
as software was different from hardware
Nick Elprin, Domino Data Lab
|THE WORLD IS CHANGING|
Hardware Software DS/ML
Waterfall Agile ¯\_(ツ)_/¯
| | |
Slide 47
Slide 47 text
|HOW TO DESIGN OUR FUTURE|
Slide 48
Slide 48 text
> Think about processes
|HOW TO DESIGN OUR FUTURE|
Slide 49
Slide 49 text
> Think about processes
> Try / develop new ML tools
|HOW TO DESIGN OUR FUTURE|
Slide 50
Slide 50 text
> Think about processes
> Try / develop new ML tools
> Share your knowledge
|HOW TO DESIGN OUR FUTURE|
Slide 51
Slide 51 text
> Think about processes
> Try / develop new ML tools
> Share your knowledge
> Support open source
|HOW TO DESIGN OUR FUTURE|