Slide 1

Slide 1 text

Standardizing Cloud Native ML Computing Platforms Yuki Iwai / Member of Cycloud ML Platform Team

Slide 2

Slide 2 text

1. What’s ML Computing Platforms 2. Kubernetes Community Effort for AI/ML 3. Cloud Native ML Computing Platform 4. Collaboration in OSS communities

Slide 3

Slide 3 text

ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization Model Training Model Fine-Tuning Model Serving Application Server Model Registry Metadata Model Artifacts Offline Feature Store Feature Store Online Feature Store Data Producers Data Products Feature Generation Feature Extraction loop Pipelines

Slide 4

Slide 4 text

ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization Model Training Model Fine-Tuning Model Serving Application Server Model Registry Metadata Model Artifacts Offline Feature Store Feature Store Online Feature Store Data Producers Data Products Feature Generation Feature Extraction loop Pipelines Computing Platform Data Platform

Slide 5

Slide 5 text

ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization Model Training Model Fine-Tuning Model Serving loop Pipelines CUT Computing Platform

Slide 6

Slide 6 text

Kubernetes Community for AI/ML • Committees (Steering Committee etc.. • Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc..

Slide 7

Slide 7 text

Kubernetes Community for AI/ML • Committees (Steering Committee etc.. • Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc.. WG Batch (with Device Management) 2021 2022 2023 2024

Slide 8

Slide 8 text

Kubernetes Community for AI/ML • Committees (Steering Committee etc.. • Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc.. WG Batch (with Device Management) WG Device Management WG Serving 2021 2022 2023 2024

Slide 9

Slide 9 text

Cloud Native ML Computing Platform ● Managing Devices by Kubernetes Native and Declarative ways ○ Device Sharing between multiple Pods ○ Dynamically Virtual Device creation / deletion ○ Fine-grained Device-Aware Pod scheduling ○ etc ... ● Device Management (DRA)

Slide 10

Slide 10 text

Cloud Native ML Computing Platform Pod Pod Pod Pod Group ● Pod Scheduling ● Container Lifecycle Management ● Pod Group as a new Primitive ● Enhanced Container Restarting Strategies ● Device Management (DRA)

Slide 11

Slide 11 text

Cloud Native ML Computing Platform Pod Pod Pod Pod Group ● Pod Scheduling ● Container Lifecycle Management Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management Quota A Quota B Quota C ● Device Management (DRA) ● Enhanced Pod Restarting Strategies ● Resource Fairness across cluster ● Pre-Scheduling for Workload

Slide 12

Slide 12 text

Cloud Native ML Computing Platform Load Balancer ● Traffic Scheduling Pod Pod Pod Pod Group model ● Pod Scheduling ● Container Lifecycle Management Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management Quota A Quota B Quota C ● Device Management (DRA) ● ML Model / cache Placement-Aware Routing ● LLM token-size based Load Balancing

Slide 13

Slide 13 text

Cloud Native ML Computing Platform Load Balancer WG Device Management WG Serving ● Traffic Scheduling Pod Pod Pod Pod Group model ● Pod Scheduling ● Container Lifecycle Management WG Batch Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management WG Serving WG Batch Quota A Quota B Quota C ● Device Management (DRA)

Slide 14

Slide 14 text

Cloud Native ML Computing Platform Load Balancer ● Device Management (DRA) ● Traffic Scheduling Pod Pod Pod Pod Group model ● Pod Scheduling ● Container Lifecycle Management Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management Quota A Quota B Quota C

Slide 15

Slide 15 text

Work together with ML Ecosystem Training Inference

Slide 16

Slide 16 text

Collaboration in OSS communities • OSS as a business • Collaborate with professional contributors and maintainers • Handling user stories conflits

Slide 17

Slide 17 text

Collaboration in OSS communities • Going far together with the OSS community • Standardization of technology

Slide 18

Slide 18 text

Thank you