Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenShift Commons Gathering Chicago 2023 - Case Study: OpenShift AI at NASA

OpenShift Commons Gathering Chicago 2023 - Case Study: OpenShift AI at NASA

Carlos Costa (IBM), Hongchao Deng (Anyscale), and Alex Corvin (Red Hat) present at the OpenShift Commons Gathering Co-Located with KubeCon + CloudNativeCon North America 2023.

OpenShift Commons

November 17, 2023
Tweet

More Decks by OpenShift Commons

Other Decks in Technology

Transcript

  1. Building an open source, Cloud-native platform for Foundation Models OpenShift

    AI and NASA use case Carlos Costa (IBM), Alex Corvin (Red Hat), Hongchao Deng (Anyscale) featuring open source collaboration with
  2. An end-to-end life cycle 4 Inferenc e May have sensitivity

    to latency, throughput, power, Data preparation Workflow of steps (e.g. remove hate and profanity, deduplicate, etc) Model adaptation Model tuning with custom data set for downstream tasks Distributed training Long-running job on massive infrastructure Model creation … … deployment Public clouds on-prem Public clouds on-prem Public clouds on-prem Edge Public clouds on-prem
  3. Emerging new challenges 5 Fine tuning Data Processing Training Validation

    Serving Optimization AI/ML pipelines Making it easier to scale and orchestrate today’s core AI building blocks… ...making it possible what is next Toda y Tomorro w Larger models (billions parameters) SCAL E More complex adaptation pipelines From narrow to broad and more reusable AI
  4. A cloud-native stack for foundation models 7 TGIS Optimized Text

    Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving OpenShift AI Midstream delivery in Open Data Hub self-managed, self-deploy platform
  5. Key innovations in training and validation 8 TGIS Optimized Text

    Generation Inference Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving Simplified User experience with CodeFlare SDK intuitive, easy-to-use python interface for batch resource requesting, access and job submission Enhanced interactivity, logging and observability for AI/ML jobs on OpenShift Advanced Kubernetes-native Resource Management Multi-Cluster App Dispatcher (MCAD) enabling job queueing, meta-scheduling, prioritization and quota management InstaScale providing o-demand cluster scaling Integrated support for TorchX and KubeRay Scalable, efficient pre-processing, training and validation Scale out, distributed GPU-based training and fine tuning with PyTorch and Ray
  6. Key innovations in inference 9 TGIS Optimized Text Generation Inference

    Server Caikit Dev APIs, Prompt Tuning, Inference KServe InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX Multi-NIC CNI Training and validation Workflows Domain specific APIs Tuning and serving User experience and performance Performant and efficient inference SDK and developer experience Tuning - vanilla prompt tuning and multi-prompt tuning Model scaling, GPU sharing, model placement Coming LoRA & emerging variants, Output filtering: HAP, PII Model chaining/composition
  7. Building an end-to-end platform 10 Model creation … … Business

    value watsonx.ai Models Suite of IBM trained foundation models Tune and infer Studio Model serving Train and validate Pre- processing Validation Model training Hybrid cloud platform OpenShift AI InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRay TORCHX KServe TGIS Optimized Text Generation Inference Server Caikit Runtime Dev APIs, Prompt tuning, inference
  8. KubeRay: The best solution for Ray on Kubernetes KubeRay enables

    data/ML scientists to focus on computation while infra engineers concentrate on Kubernetes. 12 Ray Kubernetes infra engineers: Integrate KubeRay with Kubernetes ecosystem tools, e.g. Prometheus, Grafana, and Nginx. Data/ML scientists: Develop Python scripts. KubeRay Read Create / Delete Pods Update (observability) Health check Monitoring Scaling requests for Tasks / Actors Create / Update
  9. IBM and Ray community collaboration Simplified user experience to deploy

    Ray Contributed workflow DAG generation under ray-project Evolving API server Integration with K8s native job scheduler MCAD integration with KubeRay Workflow DAG generation Ray Workflows CodeFlare SDK CodeFlare Project KubeRay Use cases in new domains leading to contributions across the stack and increasing mind share model training quantum chip design Current focus Open source collaboration with
  10. Leveraging a common platform across multiple environments to enable full

    life cycle of foundation model training Pre-train Fine tune Inferenc e on-prem + Foundation Models @ NASA OpenShift AI + NASA and IBM have teamed up to create an AI Foundation Model for Earth Observations, using large-scale satellite and remote sensing data, including the Harmonized Landsat and Sentinel-2 (HLS) data
  11. Building and inferencing a geospatial foundation model 16 FM workflows

    pre-processing, training, fine tuning Pre-trai n IBM Cloud Fine tune OpenShift AI IBM Cloud Inferenc e geospatial model fine tuned model IBM Cloud (Vela) AW S InstaScale Cluster Scaling MCAD Job dispatching, queuing and packing KubeRa y TORCHX
  12. What’s next… 19 Fully automated pre-processing, validation and fine-tuning pipelines

    KubeRay MCAD integration and hardened OpenShift support Advanced job and configuration templates for foundation model jobs Automated deployment, job launching, and enhanced observability Advanced fault-recovery and requeuing