Slide 1

Slide 1 text

ML MODELS AND DATASET VERSIONING Kurian Benoy

Slide 2

Slide 2 text

$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert in Kernels

Slide 3

Slide 3 text

$ WHOAMI Open source contributor FOSSASIA OpenTechNights Winner Kaggle Expert Final Year BTech student @MEC

Slide 4

Slide 4 text

OUTLINE Start up Adventures Challenges Model and Dataset versioning How I discovered DVC? Use case: Versioning dogs and Cats Conclusion

Slide 5

Slide 5 text

Startup Adventures

Slide 6

Slide 6 text

CHALLENGE 1: ML IS SLOW

Slide 7

Slide 7 text

CHALLENGE 2: WORKING WITH ML PROJECTS Most software products take a few seconds to execute. $ git clone project-repo $ pip install -r requirements.txt

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

CHALLENGE 3: METRIC DRIVEN

Slide 10

Slide 10 text

CHALLENGE 4: NOT ABLE TO USE GIT git not suitable for projects > 1GB git clone becomes slow

Slide 11

Slide 11 text

MODEL VERSIONING

Slide 12

Slide 12 text

TRACKING EXPERIMENTS TRACKING METRICS

Slide 13

Slide 13 text

Why Model Versioning? > To keep track of experiments > Choose the best ideas >> EXPERIMENTS = CODE + OUTPUTS Models are outputs

Slide 14

Slide 14 text

DATASET VERSIONING

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

4 TB/day

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Why Dataset management? > Moving Datasets around > Datasets evolve, so versioning required >> EXPERIMENTS = CODE + DATA + OUTPUTS Source code, Datasets

Slide 19

Slide 19 text

HOW I DISCOVERED DVC

Slide 20

Slide 20 text

DATA VERSION CONTROL(DVC)

Slide 21

Slide 21 text

> Experiment and Dataset tracking > Open-source(3500+ stars) > Build to adopt the best practises of ML > Works well with git > Language and framework agnostic

Slide 22

Slide 22 text

VERSIONING CATS & DOGS

Slide 23

Slide 23 text

DEMO TIME

Slide 24

Slide 24 text

DVC WORKFLOW

Slide 25

Slide 25 text

Tracking data 1 Tracking 1000 cats and dogs 2 Add 1000 more labelled images of cats & dogs

Slide 26

Slide 26 text

SWITCHING VERSIONS

Slide 27

Slide 27 text

CONCLUSION

Slide 28

Slide 28 text

"Data science as different from software as software was different from hardware." Nick Elprin, CEO, DominoLabs.

Slide 29

Slide 29 text

Think about your processes(ML projects)

Slide 30

Slide 30 text

Think about your processes Try to version control for your projects

Slide 31

Slide 31 text

Try it out in your ML project!

Slide 32

Slide 32 text

THANK YOU Twitter: kurianbenoy2 Email : [email protected] Speaker Deck: bit.ly/mlversion19

Slide 33

Slide 33 text

APPENDIX

Slide 34

Slide 34 text

Other Tools for versioning ML Flow - Tracking Models, Metrics Git-LFS - Tracking Large files Jovian - JupyterNB based tracking Neptune.Ml Hangar Py - Versioning Tensor Data