Slide 1

Slide 1 text

Building an open source, Cloud-native platform for Foundation Models OpenShift AI and NASA use case Carlos Costa (IBM), Alex Corvin (Red Hat), Hongchao Deng (Anyscale) featuring open source collaboration with

Slide 2

Slide 2 text

Emerging AI workflows impose new challenges to the platform

Slide 3

Slide 3 text

Foundation models are becoming an essential ingredient of a new AI workflow 3

Slide 4

Slide 4 text

An end-to-end life cycle 4 Inferenc e May have sensitivity to latency, throughput, power, Data preparation Workflow of steps (e.g. remove hate and profanity, deduplicate, etc) Model adaptation Model tuning with custom data set for downstream tasks Distributed training Long-running job on massive infrastructure Model creation … … deployment Public clouds on-prem Public clouds on-prem Public clouds on-prem Edge Public clouds on-prem

Slide 5

Slide 5 text

Emerging new challenges 5 Fine tuning Data Processing Training Validation Serving Optimization AI/ML pipelines Making it easier to scale and orchestrate today’s core AI building blocks… ...making it possible what is next Toda y Tomorro w Larger models (billions parameters) SCAL E More complex adaptation pipelines From narrow to broad and more reusable AI

Slide 6

Slide 6 text

Building a platform for E2E life cycle of foundation models

Slide 7

Slide 7 text

A cloud-native stack for foundation models 7 TGIS Optimized Text Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving OpenShift AI Midstream delivery in Open Data Hub self-managed, self-deploy platform

Slide 8

Slide 8 text

Key innovations in training and validation 8 TGIS Optimized Text Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving Simplified User experience with CodeFlare SDK intuitive, easy-to-use python interface for batch resource requesting, access and job submission Enhanced interactivity, logging and observability for AI/ML jobs on OpenShift Advanced Kubernetes-native Resource Management Multi-Cluster App Dispatcher (MCAD) enabling job queueing, meta-scheduling, prioritization and quota management InstaScale providing o-demand cluster scaling Integrated support for TorchX and KubeRay Scalable, efficient pre-processing, training and validation Scale out, distributed GPU-based training and fine tuning with PyTorch and Ray

Slide 9

Slide 9 text

Key innovations in inference 9 TGIS Optimized Text Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving User experience and performance Performant and efficient inference SDK and developer experience Tuning - vanilla prompt tuning and multi-prompt tuning Model scaling, GPU sharing, model placement Coming LoRA & emerging variants, Output filtering: HAP, PII Model chaining/composition

Slide 10

Slide 10 text

Building an end-to-end platform 10 Model creation … … Business value watsonx.ai Models Suite of IBM trained foundation models Tune and infer Studio Model serving Train and validate Pre- processing Validation Model training Hybrid cloud platform OpenShift AI InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX KServe TGIS Optimized Text Generation Inference Server Caikit Runtime Dev APIs, Prompt tuning, inference

Slide 11

Slide 11 text

Working with the open source community

Slide 12

Slide 12 text

KubeRay: The best solution for Ray on Kubernetes KubeRay enables data/ML scientists to focus on computation while infra engineers concentrate on Kubernetes. 12 Ray Kubernetes infra engineers: Integrate KubeRay with Kubernetes ecosystem tools, e.g. Prometheus, Grafana, and Nginx. Data/ML scientists: Develop Python scripts. KubeRay Read Create / Delete Pods Update (observability) Health check Monitoring Scaling requests for Tasks / Actors Create / Update

Slide 13

Slide 13 text

IBM and Ray community collaboration Simplified user experience to deploy Ray Contributed workflow DAG generation under ray-project Evolving API server Integration with K8s native job scheduler MCAD integration with KubeRay Workflow DAG generation Ray Workflows CodeFlare SDK CodeFlare Project KubeRay Use cases in new domains leading to contributions across the stack and increasing mind share model training quantum chip design Current focus Open source collaboration with

Slide 14

Slide 14 text

Deploying in the filed and enabling new domains NASA Use Case

Slide 15

Slide 15 text

Leveraging a common platform across multiple environments to enable full life cycle of foundation model training Pre-train Fine tune Inferenc e on-prem + Foundation Models @ NASA OpenShift AI + NASA and IBM have teamed up to create an AI Foundation Model for Earth Observations, using large-scale satellite and remote sensing data, including the Harmonized Landsat and Sentinel-2 (HLS) data

Slide 16

Slide 16 text

Building and inferencing a geospatial foundation model 16 FM workflows pre-processing, training, fine tuning Pre-trai n IBM Cloud Fine tune OpenShift AI IBM Cloud Inferenc e geospatial model fine tuned model IBM Cloud (Vela) AW S InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRa y TORCHX

Slide 17

Slide 17 text

First of a kind foundation model… + NASA Available on

Slide 18

Slide 18 text

What is next…

Slide 19

Slide 19 text

What’s next… 19 Fully automated pre-processing, validation and fine-tuning pipelines KubeRay MCAD integration and hardened OpenShift support Advanced job and configuration templates for foundation model jobs Automated deployment, job launching, and enhanced observability Advanced fault-recovery and requeuing

Slide 20

Slide 20 text

Thank you!