Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ML Models and Dataset Versioning
Search
Kurian Benoy
October 13, 2019
Programming
0
460
ML Models and Dataset Versioning
Kurian Benoy
October 13, 2019
Tweet
Share
More Decks by Kurian Benoy
See All by Kurian Benoy
How I ended up maintaining a python package with 1M+ downloads so far?
kurianbenoy
0
1
MTech Final Project - Presentation Slides
kurianbenoy
0
3
Project Review Report 5 - MTech Project
kurianbenoy
1
34
Joy of Programming
kurianbenoy
0
33
Expert Interaction on ML
kurianbenoy
0
67
Project Review Report 4 - Robust Speech Recognition in Malayalam
kurianbenoy
0
67
Final project report - Phase 1
kurianbenoy
0
52
Project Review Slides
kurianbenoy
0
23
Demysitfying Async&Await in Python and JavaScript
kurianbenoy
0
150
Other Decks in Programming
See All in Programming
Qiita Bash
mercury_dev0517
2
200
The Evolution of the CRuby Build System
kateinoigakukun
0
720
AWSで雰囲気でつくる! VRChatの写真変換ピタゴラスイッチ
anatofuz
0
170
Jakarta EE Meets AI
ivargrimstad
0
180
Rollupのビルド時間高速化によるプレビュー表示速度改善とバンドラとASTを駆使したプロダクト開発の難しさ
plaidtech
PRO
1
180
Making TCPSocket.new "Happy"!
coe401_
1
1.8k
Exit 8 for SwiftUI
ojun9
0
140
Cursor/Devin全社導入の理想と現実
saitoryc
20
14k
地域ITコミュニティの活性化とAWSに移行してみた話
yuukis
0
250
メモリウォールを超えて:キャッシュメモリ技術の進歩
kawayu
0
1.9k
「”誤った使い方をすることが困難”な設計」で良いコードの基礎を固めよう / phpcon-odawara-2025
taniguhey
0
170
Fiber Scheduler vs. General-Purpose Parallel Client
hayaokimura
1
110
Featured
See All Featured
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
47
5.3k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
5
560
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
331
21k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.2k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
34
2.2k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
13
680
Reflections from 52 weeks, 52 projects
jeffersonlam
349
20k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
12k
Writing Fast Ruby
sferik
628
61k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
670
GraphQLの誤解/rethinking-graphql
sonatard
71
10k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
47
2.7k
Transcript
ML MODELS AND DATASET VERSIONING Kurian Benoy
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
in Kernels
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
Final Year BTech student @MEC
OUTLINE Start up Adventures Challenges Model and Dataset versioning How
I discovered DVC? Use case: Versioning dogs and Cats Conclusion
Startup Adventures
CHALLENGE 1: ML IS SLOW
CHALLENGE 2: WORKING WITH ML PROJECTS Most software products take
a few seconds to execute. $ git clone project-repo $ pip install -r requirements.txt
None
CHALLENGE 3: METRIC DRIVEN
CHALLENGE 4: NOT ABLE TO USE GIT git not suitable
for projects > 1GB git clone becomes slow
MODEL VERSIONING
TRACKING EXPERIMENTS TRACKING METRICS
Why Model Versioning? > To keep track of experiments >
Choose the best ideas >> EXPERIMENTS = CODE + OUTPUTS Models are outputs
DATASET VERSIONING
None
4 TB/day
None
Why Dataset management? > Moving Datasets around > Datasets evolve,
so versioning required >> EXPERIMENTS = CODE + DATA + OUTPUTS Source code, Datasets
HOW I DISCOVERED DVC
DATA VERSION CONTROL(DVC)
> Experiment and Dataset tracking > Open-source(3500+ stars) > Build
to adopt the best practises of ML > Works well with git > Language and framework agnostic
VERSIONING CATS & DOGS
DEMO TIME
DVC WORKFLOW
Tracking data 1 Tracking 1000 cats and dogs 2 Add
1000 more labelled images of cats & dogs
SWITCHING VERSIONS
CONCLUSION
"Data science as different from software as software was different
from hardware." Nick Elprin, CEO, DominoLabs.
Think about your processes(ML projects)
Think about your processes Try to version control for your
projects
Try it out in your ML project!
THANK YOU Twitter: kurianbenoy2 Email : kurian.bkk@gmail.com Speaker Deck: bit.ly/mlversion19
APPENDIX
Other Tools for versioning ML Flow - Tracking Models, Metrics
Git-LFS - Tracking Large files Jovian - JupyterNB based tracking Neptune.Ml Hangar Py - Versioning Tensor Data