Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ML Models and Dataset Versioning
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Kurian Benoy
October 13, 2019
Programming
0
490
ML Models and Dataset Versioning
Kurian Benoy
October 13, 2019
Tweet
Share
More Decks by Kurian Benoy
See All by Kurian Benoy
How I ended up maintaining a python package with 1M+ downloads so far?
kurianbenoy
0
9
MTech Final Project - Presentation Slides
kurianbenoy
0
37
Project Review Report 5 - MTech Project
kurianbenoy
1
42
Joy of Programming
kurianbenoy
0
49
Expert Interaction on ML
kurianbenoy
0
91
Project Review Report 4 - Robust Speech Recognition in Malayalam
kurianbenoy
0
120
Final project report - Phase 1
kurianbenoy
0
80
Project Review Slides
kurianbenoy
0
35
Demysitfying Async&Await in Python and JavaScript
kurianbenoy
0
180
Other Decks in Programming
See All in Programming
AI時代でも変わらない技術コミュニティの力~10年続く“ゆるい”つながりが生み出す価値
n_takehata
2
710
CSC307 Lecture 13
javiergs
PRO
0
320
LangChain4jとは一味違うLangChain4j-CDI
kazumura
1
170
Claude Codeログ基盤の構築
giginet
PRO
7
2.5k
受け入れテスト駆動開発(ATDD)×AI駆動開発 AI時代のATDDの取り組み方を考える
kztakasaki
2
550
守る「だけ」の優しいEMを抜けて、 事業とチームを両方見る視点を身につけた話
maroon8021
3
730
new(1.26) ← これすき / kamakura.go #8
utgwkk
0
2.2k
AIコーディングの理想と現実 2026 | AI Coding: Expectations vs. Reality 2026
tomohisa
0
1.2k
2026年は Rust 置き換えが流行る! / 20260220-niigata-5min-tech
girigiribauer
0
230
エージェント開発初心者の僕がエージェントを作った話と今後やりたいこと
thasu0123
0
240
株式会社 Sun terras カンパニーデック
sunterras
0
2.1k
Windows on Ryzen and I
seosoft
0
250
Featured
See All Featured
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
530
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.6k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
470
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
150
Gemini Prompt Engineering: Practical Techniques for Tangible AI Outcomes
mfonobong
2
310
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
140
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
2
160
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
760
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
260
What’s in a name? Adding method to the madness
productmarketing
PRO
24
4k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.2k
Evolving SEO for Evolving Search Engines
ryanjones
0
150
Transcript
ML MODELS AND DATASET VERSIONING Kurian Benoy
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
in Kernels
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
Final Year BTech student @MEC
OUTLINE Start up Adventures Challenges Model and Dataset versioning How
I discovered DVC? Use case: Versioning dogs and Cats Conclusion
Startup Adventures
CHALLENGE 1: ML IS SLOW
CHALLENGE 2: WORKING WITH ML PROJECTS Most software products take
a few seconds to execute. $ git clone project-repo $ pip install -r requirements.txt
None
CHALLENGE 3: METRIC DRIVEN
CHALLENGE 4: NOT ABLE TO USE GIT git not suitable
for projects > 1GB git clone becomes slow
MODEL VERSIONING
TRACKING EXPERIMENTS TRACKING METRICS
Why Model Versioning? > To keep track of experiments >
Choose the best ideas >> EXPERIMENTS = CODE + OUTPUTS Models are outputs
DATASET VERSIONING
None
4 TB/day
None
Why Dataset management? > Moving Datasets around > Datasets evolve,
so versioning required >> EXPERIMENTS = CODE + DATA + OUTPUTS Source code, Datasets
HOW I DISCOVERED DVC
DATA VERSION CONTROL(DVC)
> Experiment and Dataset tracking > Open-source(3500+ stars) > Build
to adopt the best practises of ML > Works well with git > Language and framework agnostic
VERSIONING CATS & DOGS
DEMO TIME
DVC WORKFLOW
Tracking data 1 Tracking 1000 cats and dogs 2 Add
1000 more labelled images of cats & dogs
SWITCHING VERSIONS
CONCLUSION
"Data science as different from software as software was different
from hardware." Nick Elprin, CEO, DominoLabs.
Think about your processes(ML projects)
Think about your processes Try to version control for your
projects
Try it out in your ML project!
THANK YOU Twitter: kurianbenoy2 Email :
[email protected]
Speaker Deck: bit.ly/mlversion19
APPENDIX
Other Tools for versioning ML Flow - Tracking Models, Metrics
Git-LFS - Tracking Large files Jovian - JupyterNB based tracking Neptune.Ml Hangar Py - Versioning Tensor Data