Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Standardizing Cloud Native ML Computing Platforms

Avatar for Yuki Iwai Yuki Iwai
May 23, 2025
98

Standardizing Cloud Native ML Computing Platforms

Avatar for Yuki Iwai

Yuki Iwai

May 23, 2025
Tweet

Transcript

  1. 1. What’s ML Computing Platforms 2. Kubernetes Community Effort for

    AI/ML 3. Cloud Native ML Computing Platform 4. Collaboration in OSS communities
  2. ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization

    Model Training Model Fine-Tuning Model Serving Application Server Model Registry Metadata Model Artifacts Offline Feature Store Feature Store Online Feature Store Data Producers Data Products Feature Generation Feature Extraction loop Pipelines
  3. ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization

    Model Training Model Fine-Tuning Model Serving Application Server Model Registry Metadata Model Artifacts Offline Feature Store Feature Store Online Feature Store Data Producers Data Products Feature Generation Feature Extraction loop Pipelines Computing Platform Data Platform
  4. ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization

    Model Training Model Fine-Tuning Model Serving loop Pipelines CUT Computing Platform
  5. Kubernetes Community for AI/ML • Committees (Steering Committee etc.. •

    Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc..
  6. Kubernetes Community for AI/ML • Committees (Steering Committee etc.. •

    Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc.. WG Batch (with Device Management) 2021 2022 2023 2024
  7. Kubernetes Community for AI/ML • Committees (Steering Committee etc.. •

    Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc.. WG Batch (with Device Management) WG Device Management WG Serving 2021 2022 2023 2024
  8. Cloud Native ML Computing Platform • Managing Devices by Kubernetes

    Native and Declarative ways ◦ Device Sharing between multiple Pods ◦ Dynamically Virtual Device creation / deletion ◦ Fine-grained Device-Aware Pod scheduling ◦ etc ... • Device Management (DRA)
  9. Cloud Native ML Computing Platform Pod Pod Pod Pod Group

    • Pod Scheduling • Container Lifecycle Management • Pod Group as a new Primitive • Enhanced Container Restarting Strategies • Device Management (DRA)
  10. Cloud Native ML Computing Platform Pod Pod Pod Pod Group

    • Pod Scheduling • Container Lifecycle Management Workload α Workload β Workload γ • Workload Orchestration • Workload Scheduling • Quota Management Quota A Quota B Quota C • Device Management (DRA) • Enhanced Pod Restarting Strategies • Resource Fairness across cluster • Pre-Scheduling for Workload
  11. Cloud Native ML Computing Platform Load Balancer • Traffic Scheduling

    Pod Pod Pod Pod Group model • Pod Scheduling • Container Lifecycle Management Workload α Workload β Workload γ • Workload Orchestration • Workload Scheduling • Quota Management Quota A Quota B Quota C • Device Management (DRA) • ML Model / cache Placement-Aware Routing • LLM token-size based Load Balancing
  12. Cloud Native ML Computing Platform Load Balancer WG Device Management

    WG Serving • Traffic Scheduling Pod Pod Pod Pod Group model • Pod Scheduling • Container Lifecycle Management WG Batch Workload α Workload β Workload γ • Workload Orchestration • Workload Scheduling • Quota Management WG Serving WG Batch Quota A Quota B Quota C • Device Management (DRA)
  13. Cloud Native ML Computing Platform Load Balancer • Device Management

    (DRA) • Traffic Scheduling Pod Pod Pod Pod Group model • Pod Scheduling • Container Lifecycle Management Workload α Workload β Workload γ • Workload Orchestration • Workload Scheduling • Quota Management Quota A Quota B Quota C
  14. Collaboration in OSS communities • OSS as a business •

    Collaborate with professional contributors and maintainers • Handling user stories conflits
  15. Collaboration in OSS communities • Going far together with the

    OSS community • Standardization of technology