MLOps Pipeline for Distributed Model Training with Kubeflow

MLOps Pipeline for Distributed Model Training with Kubeflow Mahdi Khashan

Problem Definition - Machine Learning projects results are based on
experimenting ideas - In a worksetup, a team works on AI model design-development - Training models usually involve long running tasks - Model training consists of different steps (preprocessing, training …) - Provisioning hardware for a couple of hours are way more affordable than owning it - We need to monitor the process, dig into each step and be able to debug 2

Proposed Solution 3 1. Develop a solution deployable on cloud
(Digital Ocean) 2. Track experiments (MLFlow) 3. Separate steps (Kubeflow Pipeline Components) 4. Register Models (MLFlow Model Registry) 5. Use Notebooks with predefined images (Jupyter Notebook) 6. Serve Models from model registry (MLFlow/PyTorch) 7. Evaluate Models with a UI (Streamlit)

Architecture 4

Architecture 5

Demo Flow 6

Challenges - Kubeflow Deployment - Order of objects syncing when
there are dependencies between them - Monitoring pods / services behaviour during installation (K9s) - MLFlow Deployment 7

Learnings - Package Kubeflow with Helm - ArgoCD (app of
apps pattern) - Resource Syncing (GitOps) - Importance of accessing pods logs - K9s - Not linted, mal-formatted yaml causes problem - sync with ArgoCD 8

MLOps Pipeline for Distributed Model Training w...

MLOps Pipeline for Distributed Model Training with Kubeflow

Mahdi Khashan

More Decks by Mahdi Khashan

Featured

Transcript