Standardizing Cloud Native ML Computing Platforms

by Yuki Iwai

Slide 1

Slide 1 text

Standardizing Cloud Native ML Computing Platforms Yuki Iwai / Member of Cycloud ML Platform Team

Slide 2

Slide 2 text

1. What’s ML Computing Platforms 2. Kubernetes Community Effort for AI/ML 3. Cloud Native ML Computing Platform 4. Collaboration in OSS communities

Slide 3

Slide 3 text

ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization Model Training Model Fine-Tuning Model Serving Application Server Model Registry Metadata Model Artifacts Offline Feature Store Feature Store Online Feature Store Data Producers Data Products Feature Generation Feature Extraction loop Pipelines

Slide 4

Slide 4 text

Slide 5

Slide 5 text

ML Lifecycle Data Preparation Model Development Hyperparameter Optimization Model Optimization Model Training Model Fine-Tuning Model Serving loop Pipelines CUT Computing Platform

Slide 6

Slide 6 text

Kubernetes Community for AI/ML • Committees (Steering Committee etc.. • Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc..

Slide 7

Slide 7 text

Kubernetes Community for AI/ML • Committees (Steering Committee etc.. • Special Interest Groups (SIG Autoscaling etc.. • Working Groups (WG Batch etc.. WG Batch (with Device Management) 2021 2022 2023 2024

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Cloud Native ML Computing Platform ● Managing Devices by Kubernetes Native and Declarative ways ○ Device Sharing between multiple Pods ○ Dynamically Virtual Device creation / deletion ○ Fine-grained Device-Aware Pod scheduling ○ etc ... ● Device Management (DRA)

Slide 10

Slide 10 text

Cloud Native ML Computing Platform Pod Pod Pod Pod Group ● Pod Scheduling ● Container Lifecycle Management ● Pod Group as a new Primitive ● Enhanced Container Restarting Strategies ● Device Management (DRA)

Slide 11

Slide 11 text

Cloud Native ML Computing Platform Pod Pod Pod Pod Group ● Pod Scheduling ● Container Lifecycle Management Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management Quota A Quota B Quota C ● Device Management (DRA) ● Enhanced Pod Restarting Strategies ● Resource Fairness across cluster ● Pre-Scheduling for Workload

Slide 12

Slide 12 text

Cloud Native ML Computing Platform Load Balancer ● Traffic Scheduling Pod Pod Pod Pod Group model ● Pod Scheduling ● Container Lifecycle Management Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management Quota A Quota B Quota C ● Device Management (DRA) ● ML Model / cache Placement-Aware Routing ● LLM token-size based Load Balancing

Slide 13

Slide 13 text

Cloud Native ML Computing Platform Load Balancer WG Device Management WG Serving ● Traffic Scheduling Pod Pod Pod Pod Group model ● Pod Scheduling ● Container Lifecycle Management WG Batch Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management WG Serving WG Batch Quota A Quota B Quota C ● Device Management (DRA)

Slide 14

Slide 14 text

Cloud Native ML Computing Platform Load Balancer ● Device Management (DRA) ● Traffic Scheduling Pod Pod Pod Pod Group model ● Pod Scheduling ● Container Lifecycle Management Workload α Workload β Workload γ ● Workload Orchestration ● Workload Scheduling ● Quota Management Quota A Quota B Quota C

Slide 15

Slide 15 text

Work together with ML Ecosystem Training Inference

Slide 16

Slide 16 text

Collaboration in OSS communities • OSS as a business • Collaborate with professional contributors and maintainers • Handling user stories conflits

Slide 17

Slide 17 text

Collaboration in OSS communities • Going far together with the OSS community • Standardization of technology

Slide 18

Slide 18 text

Thank you