Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ML Models and Dataset Versioning
Search
Kurian Benoy
October 13, 2019
Programming
0
480
ML Models and Dataset Versioning
Kurian Benoy
October 13, 2019
Tweet
Share
More Decks by Kurian Benoy
See All by Kurian Benoy
How I ended up maintaining a python package with 1M+ downloads so far?
kurianbenoy
0
6
MTech Final Project - Presentation Slides
kurianbenoy
0
13
Project Review Report 5 - MTech Project
kurianbenoy
1
36
Joy of Programming
kurianbenoy
0
43
Expert Interaction on ML
kurianbenoy
0
83
Project Review Report 4 - Robust Speech Recognition in Malayalam
kurianbenoy
0
87
Final project report - Phase 1
kurianbenoy
0
63
Project Review Slides
kurianbenoy
0
24
Demysitfying Async&Await in Python and JavaScript
kurianbenoy
0
170
Other Decks in Programming
See All in Programming
Google I/O Extended Incheon 2025 ~ What's new in Android development tools
pluu
1
280
画像コンペでのベースラインモデルの育て方
tattaka
3
1.7k
Dart 参戦!!静的型付き言語界の隠れた実力者
kno3a87
0
200
『リコリス・リコイル』に学ぶ!! 〜キャリア戦略における計画的偶発性理論と変わる勇気の重要性〜
wanko_it
1
530
書き捨てではなく継続開発可能なコードをAIコーディングエージェントで書くために意識していること
shuyakinjo
1
270
GitHub Copilotの全体像と活用のヒント AI駆動開発の最初の一歩
74th
7
2.8k
LLMは麻雀を知らなすぎるから俺が教育してやる
po3rin
3
2.1k
パスタの技術
yusukebe
1
380
LLMOpsのパフォーマンスを支える技術と現場で実践した改善
po3rin
8
920
STUNMESH-go: Wireguard NAT穿隧工具的源起與介紹
tjjh89017
0
370
なぜ今、Terraformの本を書いたのか? - 著者陣に聞く!『Terraformではじめる実践IaC』登壇資料
fufuhu
4
600
あのころの iPod を どうにか再生させたい
orumin
2
2.4k
Featured
See All Featured
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
Gamification - CAS2011
davidbonilla
81
5.4k
Side Projects
sachag
455
43k
Writing Fast Ruby
sferik
628
62k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
Optimizing for Happiness
mojombo
379
70k
A Tale of Four Properties
chriscoyier
160
23k
Building an army of robots
kneath
306
45k
How STYLIGHT went responsive
nonsquared
100
5.7k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Product Roadmaps are Hard
iamctodd
PRO
54
11k
Statistics for Hackers
jakevdp
799
220k
Transcript
ML MODELS AND DATASET VERSIONING Kurian Benoy
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
in Kernels
$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert
Final Year BTech student @MEC
OUTLINE Start up Adventures Challenges Model and Dataset versioning How
I discovered DVC? Use case: Versioning dogs and Cats Conclusion
Startup Adventures
CHALLENGE 1: ML IS SLOW
CHALLENGE 2: WORKING WITH ML PROJECTS Most software products take
a few seconds to execute. $ git clone project-repo $ pip install -r requirements.txt
None
CHALLENGE 3: METRIC DRIVEN
CHALLENGE 4: NOT ABLE TO USE GIT git not suitable
for projects > 1GB git clone becomes slow
MODEL VERSIONING
TRACKING EXPERIMENTS TRACKING METRICS
Why Model Versioning? > To keep track of experiments >
Choose the best ideas >> EXPERIMENTS = CODE + OUTPUTS Models are outputs
DATASET VERSIONING
None
4 TB/day
None
Why Dataset management? > Moving Datasets around > Datasets evolve,
so versioning required >> EXPERIMENTS = CODE + DATA + OUTPUTS Source code, Datasets
HOW I DISCOVERED DVC
DATA VERSION CONTROL(DVC)
> Experiment and Dataset tracking > Open-source(3500+ stars) > Build
to adopt the best practises of ML > Works well with git > Language and framework agnostic
VERSIONING CATS & DOGS
DEMO TIME
DVC WORKFLOW
Tracking data 1 Tracking 1000 cats and dogs 2 Add
1000 more labelled images of cats & dogs
SWITCHING VERSIONS
CONCLUSION
"Data science as different from software as software was different
from hardware." Nick Elprin, CEO, DominoLabs.
Think about your processes(ML projects)
Think about your processes Try to version control for your
projects
Try it out in your ML project!
THANK YOU Twitter: kurianbenoy2 Email :
[email protected]
Speaker Deck: bit.ly/mlversion19
APPENDIX
Other Tools for versioning ML Flow - Tracking Models, Metrics
Git-LFS - Tracking Large files Jovian - JupyterNB based tracking Neptune.Ml Hangar Py - Versioning Tensor Data