OpenShift Commons Gathering Chicago 2023 - Case Study: OpenShift AI at NASA

Building an open source, Cloud-native platform for Foundation Models OpenShift
AI and NASA use case Carlos Costa (IBM), Alex Corvin (Red Hat), Hongchao Deng (Anyscale) featuring open source collaboration with

Emerging AI workflows impose new challenges to the platform

Foundation models are becoming an essential ingredient of a new
AI workflow 3

An end-to-end life cycle 4 Inferenc e May have sensitivity
to latency, throughput, power, Data preparation Workflow of steps (e.g. remove hate and profanity, deduplicate, etc) Model adaptation Model tuning with custom data set for downstream tasks Distributed training Long-running job on massive infrastructure Model creation … … deployment Public clouds on-prem Public clouds on-prem Public clouds on-prem Edge Public clouds on-prem

Emerging new challenges 5 Fine tuning Data Processing Training Validation
Serving Optimization AI/ML pipelines Making it easier to scale and orchestrate today’s core AI building blocks… ...making it possible what is next Toda y Tomorro w Larger models (billions parameters) SCAL E More complex adaptation pipelines From narrow to broad and more reusable AI

Building a platform for E2E life cycle of foundation models

A cloud-native stack for foundation models 7 TGIS Optimized Text
Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain speciﬁc APIs Tuning and serving OpenShift AI Midstream delivery in Open Data Hub self-managed, self-deploy platform

Key innovations in training and validation 8 TGIS Optimized Text
Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving Simplified User experience with CodeFlare SDK intuitive, easy-to-use python interface for batch resource requesting, access and job submission Enhanced interactivity, logging and observability for AI/ML jobs on OpenShift Advanced Kubernetes-native Resource Management Multi-Cluster App Dispatcher (MCAD) enabling job queueing, meta-scheduling, prioritization and quota management InstaScale providing o-demand cluster scaling Integrated support for TorchX and KubeRay Scalable, efficient pre-processing, training and validation Scale out, distributed GPU-based training and fine tuning with PyTorch and Ray

Key innovations in inference 9 TGIS Optimized Text Generation Inference
Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving User experience and performance Performant and efficient inference SDK and developer experience Tuning - vanilla prompt tuning and multi-prompt tuning Model scaling, GPU sharing, model placement Coming LoRA & emerging variants, Output filtering: HAP, PII Model chaining/composition

Building an end-to-end platform 10 Model creation … … Business
value watsonx.ai Models Suite of IBM trained foundation models Tune and infer Studio Model serving Train and validate Pre- processing Validation Model training Hybrid cloud platform OpenShift AI InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX KServe TGIS Optimized Text Generation Inference Server Caikit Runtime Dev APIs, Prompt tuning, inference

Working with the open source community

KubeRay: The best solution for Ray on Kubernetes KubeRay enables
data/ML scientists to focus on computation while infra engineers concentrate on Kubernetes. 12 Ray Kubernetes infra engineers: Integrate KubeRay with Kubernetes ecosystem tools, e.g. Prometheus, Grafana, and Nginx. Data/ML scientists: Develop Python scripts. KubeRay Read Create / Delete Pods Update (observability) Health check Monitoring Scaling requests for Tasks / Actors Create / Update

IBM and Ray community collaboration Simpliﬁed user experience to deploy
Ray Contributed workflow DAG generation under ray-project Evolving API server Integration with K8s native job scheduler MCAD integration with KubeRay Workflow DAG generation Ray Workflows CodeFlare SDK CodeFlare Project KubeRay Use cases in new domains leading to contributions across the stack and increasing mind share model training quantum chip design Current focus Open source collaboration with

Deploying in the ﬁled and enabling new domains NASA Use
Case

Leveraging a common platform across multiple environments to enable full
life cycle of foundation model training Pre-train Fine tune Inferenc e on-prem + Foundation Models @ NASA OpenShift AI + NASA and IBM have teamed up to create an AI Foundation Model for Earth Observations, using large-scale satellite and remote sensing data, including the Harmonized Landsat and Sentinel-2 (HLS) data

Building and inferencing a geospatial foundation model 16 FM workflows
pre-processing, training, ﬁne tuning Pre-trai n IBM Cloud Fine tune OpenShift AI IBM Cloud Inferenc e geospatial model ﬁne tuned model IBM Cloud (Vela) AW S InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRa y TORCHX

First of a kind foundation model… + NASA Available on

What is next…

What’s next… 19 Fully automated pre-processing, validation and ﬁne-tuning pipelines
KubeRay MCAD integration and hardened OpenShift support Advanced job and conﬁguration templates for foundation model jobs Automated deployment, job launching, and enhanced observability Advanced fault-recovery and requeuing

Thank you!

OpenShift Commons Gathering Chicago 2023 - Case...

OpenShift Commons Gathering Chicago 2023 - Case Study: OpenShift AI at NASA