$ WHOAMI
Open source contributor
FOSSASIA OpenTechNights Winner
Kaggle Expert in Kernels
Slide 3
Slide 3 text
$ WHOAMI
Open source contributor
FOSSASIA OpenTechNights Winner
Kaggle Expert
Final Year BTech student @MEC
Slide 4
Slide 4 text
OUTLINE
Start up Adventures
Challenges
Model and Dataset versioning
How I discovered DVC?
Use case: Versioning dogs and Cats
Conclusion
Slide 5
Slide 5 text
Startup Adventures
Slide 6
Slide 6 text
CHALLENGE 1:
ML IS SLOW
Slide 7
Slide 7 text
CHALLENGE 2:
WORKING WITH ML PROJECTS
Most software products take a few
seconds to execute.
$ git clone project-repo
$ pip install -r requirements.txt
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
CHALLENGE 3:
METRIC DRIVEN
Slide 10
Slide 10 text
CHALLENGE 4:
NOT ABLE TO USE GIT
git not suitable for projects > 1GB
git clone becomes slow
Slide 11
Slide 11 text
MODEL
VERSIONING
Slide 12
Slide 12 text
TRACKING EXPERIMENTS
TRACKING
METRICS
Slide 13
Slide 13 text
Why Model Versioning?
> To keep track of experiments
> Choose the best ideas
>> EXPERIMENTS = CODE + OUTPUTS
Models are outputs
Slide 14
Slide 14 text
DATASET
VERSIONING
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
4 TB/day
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
Why Dataset management?
> Moving Datasets around
> Datasets evolve, so versioning required
>> EXPERIMENTS = CODE + DATA + OUTPUTS
Source code, Datasets
Slide 19
Slide 19 text
HOW I DISCOVERED
DVC
Slide 20
Slide 20 text
DATA VERSION
CONTROL(DVC)
Slide 21
Slide 21 text
> Experiment and Dataset tracking
> Open-source(3500+ stars)
> Build to adopt the best practises of ML
> Works well with git
> Language and framework agnostic
Slide 22
Slide 22 text
VERSIONING CATS &
DOGS
Slide 23
Slide 23 text
DEMO TIME
Slide 24
Slide 24 text
DVC WORKFLOW
Slide 25
Slide 25 text
Tracking data
1 Tracking 1000 cats and dogs
2 Add 1000 more labelled images of cats & dogs
Slide 26
Slide 26 text
SWITCHING VERSIONS
Slide 27
Slide 27 text
CONCLUSION
Slide 28
Slide 28 text
"Data science as different from software
as software was different from hardware."
Nick Elprin,
CEO, DominoLabs.
Slide 29
Slide 29 text
Think about your processes(ML projects)
Slide 30
Slide 30 text
Think about your processes
Try to version control for your projects
Other Tools for versioning
ML Flow - Tracking Models, Metrics
Git-LFS - Tracking Large files
Jovian - JupyterNB based tracking
Neptune.Ml
Hangar Py - Versioning Tensor Data