Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Chaos to Order: How MLFlow Transforms Experiment Tracking and Reproducibility

Vikas Shetty
August 30, 2023
61

From Chaos to Order: How MLFlow Transforms Experiment Tracking and Reproducibility

Vikas Shetty

August 30, 2023
Tweet

Transcript

  1. Agenda 1. Understanding why ML is different 2. Pain points

    in the traditional approach 3. Enter MLFlow 4. Using the OSS version 5. Problems with it 6. Ways to solve them
  2. Why is ML different from traditional software • Once you

    have the code, you can recreate/ build the same software again (as long as the code remains the same)
  3. So what’s different • More variables that can change •

    A non reproducible process • Keeping track of data vs Keeping track of code
  4. What really drives ML • Metrics!!! • How accurate are

    your predictions • Tracking these metrics and improving them incrementally • Testing the model on data not used for training
  5. What is messy about tracking metrics • Need to manage

    associated files. • For example, tensorboard stores files required for graphs and other details as part of a ‘log’ folder. • Things can get out of hand quickly once you have more iterations
  6. The lesson we learnt from the whales • In January

    2016, deepsense.ai won the Right Whale Recognition contest on Kaggle. • To collect the prize, the winning team had to provide source code and a description of how to recreate the winning solution. • The team took 3 weeks to recreate the winning solution. • Funnily enough the trauma made Deepsense.ai build Neptune.ai (similar tool to MLFlow)
  7. Introducing MLFlow • Open source project developed by Databricks to

    address pain points in ML Lifecycle • ML projects can get complex and hard to reproduce or track • Framework independent
  8. Making it more friendly • Experiment runs listed based on

    the run date • It’s nice but not simple on the eyes • Tag runs with a version number for reference • Multiple teams can run different experiments on the same server
  9. Our initial setup • Well, we can track experiments with

    ease now. • How do I show the experiments to my coworker • Can you see my screen real quick
  10. Let’s deploy it on an instance • Run MLFlow on

    a server. • Anyone can access the URL :| • Why is the UI slow • Maybe we have too many experiments • Graphs have a limit on how many values you can log
  11. A few moments later • Store experiments in a DB

    • Store artifacts on cloud storage • Both options are supported by MLFlow • Still don’t have auth
  12. Let’s just use Nginx? • Atleast the URL isn’t open

    anymore. • Basic username and password authentication. • Low effort solution
  13. MLFlow Auth • Relatively newer • Still in experimental phase

    • Permissions and Roles for managing users (which we were missing)
  14. How MLFlow changed our work • One place to find

    everything related to the experiment • Metrics, parameters, model artifacts • Easy access to details (RIP google sheets) • Number of iterations didn’t matter anymore • A proper goodbye to debugging nightmares