Slide 8
Slide 8 text
Key innovations in training and validation
8
TGIS
Optimized Text Generation Inference Server
Caikit
Dev APIs, Prompt Tuning, Inference
KServe
InstaScale
Cluster Scaling
MCAD
Job dispatching, queuing and packing
KubeRay TORCHX
Multi-NIC CNI
Training and validation
Workflows Domain specific APIs
Tuning and serving
Simplified User experience with CodeFlare SDK
intuitive, easy-to-use python interface for batch
resource requesting, access and job submission
Enhanced interactivity, logging and observability for
AI/ML jobs on OpenShift
Advanced Kubernetes-native Resource Management
Multi-Cluster App Dispatcher (MCAD) enabling job
queueing, meta-scheduling, prioritization and quota
management
InstaScale providing o-demand cluster scaling
Integrated support for TorchX and KubeRay
Scalable, efficient pre-processing, training and validation
Scale out, distributed GPU-based training and fine tuning
with PyTorch and Ray