$30 off During Our Annual Pro Sale. View Details »

ML Models and Dataset Versioning

ML Models and Dataset Versioning

Kurian Benoy

October 13, 2019
Tweet

More Decks by Kurian Benoy

Other Decks in Programming

Transcript

  1. ML MODELS AND
    DATASET
    VERSIONING
    Kurian Benoy

    View Slide

  2. $ WHOAMI
    Open source contributor
    FOSSASIA OpenTechNights Winner
    Kaggle Expert in Kernels

    View Slide

  3. $ WHOAMI
    Open source contributor
    FOSSASIA OpenTechNights Winner
    Kaggle Expert
    Final Year BTech student @MEC

    View Slide

  4. OUTLINE
    Start up Adventures
    Challenges
    Model and Dataset versioning
    How I discovered DVC?
    Use case: Versioning dogs and Cats
    Conclusion

    View Slide

  5. Startup Adventures

    View Slide

  6. CHALLENGE 1:
    ML IS SLOW

    View Slide

  7. CHALLENGE 2:
    WORKING WITH ML PROJECTS
    Most software products take a few
    seconds to execute.
    $ git clone project-repo
    $ pip install -r requirements.txt

    View Slide

  8. View Slide

  9. CHALLENGE 3:
    METRIC DRIVEN

    View Slide

  10. CHALLENGE 4:
    NOT ABLE TO USE GIT
    git not suitable for projects > 1GB
    git clone becomes slow

    View Slide

  11. MODEL
    VERSIONING

    View Slide

  12. TRACKING EXPERIMENTS
    TRACKING
    METRICS

    View Slide

  13. Why Model Versioning?
    > To keep track of experiments
    > Choose the best ideas
    >> EXPERIMENTS = CODE + OUTPUTS
    Models are outputs

    View Slide

  14. DATASET
    VERSIONING

    View Slide

  15. View Slide

  16. 4 TB/day

    View Slide

  17. View Slide

  18. Why Dataset management?
    > Moving Datasets around
    > Datasets evolve, so versioning required
    >> EXPERIMENTS = CODE + DATA + OUTPUTS
    Source code, Datasets

    View Slide

  19. HOW I DISCOVERED
    DVC

    View Slide

  20. DATA VERSION
    CONTROL(DVC)

    View Slide

  21. > Experiment and Dataset tracking
    > Open-source(3500+ stars)
    > Build to adopt the best practises of ML
    > Works well with git
    > Language and framework agnostic

    View Slide

  22. VERSIONING CATS &
    DOGS

    View Slide

  23. DEMO TIME

    View Slide

  24. DVC WORKFLOW

    View Slide

  25. Tracking data
    1 Tracking 1000 cats and dogs
    2 Add 1000 more labelled images of cats & dogs

    View Slide

  26. SWITCHING VERSIONS

    View Slide

  27. CONCLUSION

    View Slide

  28. "Data science as different from software
    as software was different from hardware."
    Nick Elprin,
    CEO, DominoLabs.

    View Slide

  29. Think about your processes(ML projects)

    View Slide

  30. Think about your processes
    Try to version control for your projects

    View Slide

  31. Try it out in your ML project!

    View Slide

  32. THANK YOU
    Twitter: kurianbenoy2
    Email : [email protected]
    Speaker Deck: bit.ly/mlversion19

    View Slide

  33. APPENDIX

    View Slide

  34. Other Tools for versioning
    ML Flow - Tracking Models, Metrics
    Git-LFS - Tracking Large files
    Jovian - JupyterNB based tracking
    Neptune.Ml
    Hangar Py - Versioning Tensor Data

    View Slide