Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MLOps Pipeline for Distributed Model Training w...

Mahdi Khashan
January 13, 2025
14

MLOps Pipeline for Distributed Model Training with Kubeflow

TODO

Mahdi Khashan

January 13, 2025
Tweet

Transcript

  1. Problem Definition - Machine Learning projects results are based on

    experimenting ideas - In a worksetup, a team works on AI model design-development - Training models usually involve long running tasks - Model training consists of different steps (preprocessing, training …) - Provisioning hardware for a couple of hours are way more affordable than owning it - We need to monitor the process, dig into each step and be able to debug 2
  2. Proposed Solution 3 1. Develop a solution deployable on cloud

    (Digital Ocean) 2. Track experiments (MLFlow) 3. Separate steps (Kubeflow Pipeline Components) 4. Register Models (MLFlow Model Registry) 5. Use Notebooks with predefined images (Jupyter Notebook) 6. Serve Models from model registry (MLFlow/PyTorch) 7. Evaluate Models with a UI (Streamlit)
  3. Challenges - Kubeflow Deployment - Order of objects syncing when

    there are dependencies between them - Monitoring pods / services behaviour during installation (K9s) - MLFlow Deployment 7
  4. Learnings - Package Kubeflow with Helm - ArgoCD (app of

    apps pattern) - Resource Syncing (GitOps) - Importance of accessing pods logs - K9s - Not linted, mal-formatted yaml causes problem - sync with ArgoCD 8