DataTalk#40 3/3 -- MLFlow & DVC pour la gestion des expériences en sciences des données

DataTalk#40 3/3 -- MLFlow & DVC pour la gestion des expériences en sciences des données

Speaker: Hacene Karrad, Delair

La démocratisation de l’apprentissage automatique et profond a rendu indispensable les outils de gestion des expériences. Ces outils permettent de gérer, organiser, suivre et enregistrer les expériences d'apprentissage automatique. Chez Delair, on entraîne et on déploie fréquemment des solutions à base d’apprentissage automatique et profond appliquées sur des images de drones. On utilise les différents outils de l'état de l’art pour entraîner ces modèles, et mlflow est un outil essentiel qu’on utilise pour gérer nos expériences. Dans cette présentation on vous parlera de notre motivation derrière l’utilisation de cet outil, et on vous présentera un cas d’usage pour illustrer ses fonctionnalités

6aa4f3c589d3108830b371d0310bc4da?s=128

Toulouse Data Science

November 19, 2019
Tweet

Transcript

  1. 1.
  2. 3.

    3 • What do we do at Delair? • The

    ML as an iterative process • Iterative processes management • What do we need and an experiment manager • Introduction to Mlflow • Dataset tracking with DVC We will cover:
  3. 5.

    5

  4. 7.

    7 • Machine learning experiments are iterative scientific projects. •

    Machine Learning It’s about finding the right model for the data. • It’s often hard to track and to reproduce. Different hyperparameters, different datasets, different codes and different architectures ⇒ Different model’s performance (metrics…) The Machine Learning Iterative Process Kenneth Jensen [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)]
  5. 8.

    8 The Machine Learning Iterative Process In the phase of

    model exploration and model refinement, generally we iterated many times to find a suitable model. Lots of experiments are done, and most are lost and wasted, whereas they can provide great insight. Today’s scope
  6. 9.

    9 Managing Iterative Processes • The need to keep track

    of machine learning experiments. • Resume suspended research. • Revisit models, choices, notes… • Reproduce experimental results. • Draw and revisit conclusions. XKCD: Creative Commons Attribution-NonCommercial 2.5 License.
  7. 10.

    We were looking for at tool that: • Is framework

    agnostic (pytorch, tensorflow, keras …). • Centralizes machine learning experiments tracking (collaboration). • Is not code intrusive. • Seamlessly integrate with our workflow. • Is not complicated to backup. • Does not require an expert to maintain. 10 We were looking for... Alternative tools of experiment tracking
  8. 11.

    11 We found mlflow What it is included: ⇡ Log

    params, log metrics, plots them in a frontend. ⇡ Centralized artifacts and experiment results / or distributed. ⇡ Saving notes. ⇡ Code tracking (git hash). What is not included: ⇣ Hyper parameter optimization. ⇣ Scheduling. ⇣ No data tracking. ⇣ No replay training from the interface. ⇣ ⇒ That’s exactly what we were expecting.
  9. 12.

    12 What about datasets tracking? • Datasets are the most

    important “hyperparameter”. • Dataset versioning is important, yet somehow tricky. ⇒ We use DVC to track datasets. Source: https://in.pycon.org/cfp/2019/proposals/model-and-dataset-versioning-practices-using-dvc-tool~ej1zd/
  10. 14.

    14 Sources • Meetup mlflow example: https://github.com/dltkhacene/mnist_mlflow • DVC Github

    example: https://github.com/dltkhacene/meetup-demo-dvc • Git tutorial: https://www.youtube.com/watch?v=41tsyReTloA • Delair website: delair.aero • Mlflow official website: http://mlflow.org/docs/ • DVC official website: https://dvc.org/
  11. 16.