Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ML Models and Dataset Versioning
Search
Kurian Benoy
October 13, 2019
Programming
0
480
ML Models and Dataset Versioning
Kurian Benoy
October 13, 2019
Tweet
Share
More Decks by Kurian Benoy
See All by Kurian Benoy
How I ended up maintaining a python package with 1M+ downloads so far?
kurianbenoy
0
6
MTech Final Project - Presentation Slides
kurianbenoy
0
13
Project Review Report 5 - MTech Project
kurianbenoy
1
38
Joy of Programming
kurianbenoy
0
44
Expert Interaction on ML
kurianbenoy
0
85
Project Review Report 4 - Robust Speech Recognition in Malayalam
kurianbenoy
0
93
Final project report - Phase 1
kurianbenoy
0
65
Project Review Slides
kurianbenoy
0
24
Demysitfying Async&Await in Python and JavaScript
kurianbenoy
0
170
Other Decks in Programming
See All in Programming
その面倒な作業、「Dart」にやらせませんか? Flutter開発者のための業務効率化
yordgenome03
1
110
Pythonスレッドとは結局何なのか? CPython実装から見るNoGIL時代の変化
curekoshimizu
5
1.7k
私達はmodernize packageに夢を見るか feat. go/analysis, go/ast / Go Conference 2025
kaorumuta
2
520
Go Conference 2025: Goで体感するMultipath TCP ― Go 1.24 時代の MPTCP Listener を理解する
takehaya
8
1.6k
(Extension DC 2025) Actor境界を越える技術
teamhimeh
1
250
Serena MCPのすすめ
wadakatu
4
960
The Flutter Journey of Building a Live Streaming App — With a Side of Performance Tuning
u503
1
110
開発生産性を上げるための生成AI活用術
starfish719
3
420
XP, Testing and ninja testing ZOZ5
m_seki
3
600
GraphQL×Railsアプリのデータベース負荷分散 - 月間3,000万人利用サービスを無停止で
koxya
1
1.2k
Signals & Resource API in Angular: 3 Effective Rules for Your Architecture @BASTA 2025 in Mainz
manfredsteyer
PRO
0
120
Advance Your Career with Open Source
ivargrimstad
0
460
Featured
See All Featured
A Modern Web Designer's Workflow
chriscoyier
697
190k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
189
55k
We Have a Design System, Now What?
morganepeng
53
7.8k
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.7k
What's in a price? How to price your products and services
michaelherold
246
12k
Faster Mobile Websites
deanohume
310
31k
The Straight Up "How To Draw Better" Workshop
denniskardys
237
140k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
9
860
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.6k
Typedesign – Prime Four
hannesfritz
42
2.8k
Transcript
ML MODELS AND DATASET VERSIONING Kurian Benoy
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
in Kernels
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
Final Year BTech student @MEC
OUTLINE Start up Adventures Challenges Model and Dataset versioning How
I discovered DVC? Use case: Versioning dogs and Cats Conclusion
Startup Adventures
CHALLENGE 1: ML IS SLOW
CHALLENGE 2: WORKING WITH ML PROJECTS Most software products take
a few seconds to execute. $ git clone project-repo $ pip install -r requirements.txt
None
CHALLENGE 3: METRIC DRIVEN
CHALLENGE 4: NOT ABLE TO USE GIT git not suitable
for projects > 1GB git clone becomes slow
MODEL VERSIONING
TRACKING EXPERIMENTS TRACKING METRICS
Why Model Versioning? > To keep track of experiments >
Choose the best ideas >> EXPERIMENTS = CODE + OUTPUTS Models are outputs
DATASET VERSIONING
None
4 TB/day
None
Why Dataset management? > Moving Datasets around > Datasets evolve,
so versioning required >> EXPERIMENTS = CODE + DATA + OUTPUTS Source code, Datasets
HOW I DISCOVERED DVC
DATA VERSION CONTROL(DVC)
> Experiment and Dataset tracking > Open-source(3500+ stars) > Build
to adopt the best practises of ML > Works well with git > Language and framework agnostic
VERSIONING CATS & DOGS
DEMO TIME
DVC WORKFLOW
Tracking data 1 Tracking 1000 cats and dogs 2 Add
1000 more labelled images of cats & dogs
SWITCHING VERSIONS
CONCLUSION
"Data science as different from software as software was different
from hardware." Nick Elprin, CEO, DominoLabs.
Think about your processes(ML projects)
Think about your processes Try to version control for your
projects
Try it out in your ML project!
THANK YOU Twitter: kurianbenoy2 Email :
[email protected]
Speaker Deck: bit.ly/mlversion19
APPENDIX
Other Tools for versioning ML Flow - Tracking Models, Metrics
Git-LFS - Tracking Large files Jovian - JupyterNB based tracking Neptune.Ml Hangar Py - Versioning Tensor Data