NVIDIA AI Enterprise for Red Hat OpenShift

1 NVIDIAとRed Hatの AI対応プラットフォーム Shingo Kitayama レッドハット株式会社 Solution Architect Copyright
© 2022 Red Hat, Inc. Red Hat, and the Red Hat logo are trademarks or registered trademarks of Red Hat, Inc.

NVIDIA AI Enterprise with OpenShift NVIDIA AI Enterprise NVIDIAが認定、ライセンス、サポートを提供する包括的なクラウドネイティブAIソフト
ウェアスイートです。 Red Hat OpenShiftとNVIDIA-Certified Systems上で動作することが認定されています。 Red Hat OpenShift Red Hatが提供するエンタープライズ対応のコンテナプラットフォームです。フルスタックの自律運用と開発者向けのセルフサービス・プロビジョニングを提供しており、クラウドやオンプレミスなど様々な環境に対応しています。 ref. https://resources.nvidia.com/en-us-nvidia-ai-enterprise/nvaie-red-hat-overview

AI活用におけるRed Hat OpenShiftの魅力インフラリソースへのセルフサービスアクセス Self-Service Access to Infrastructure Resources
MLOpsパイプラインの安全な自動化 Securely Automate MLOps Pipelines Kubernetesを活用したアプリケーション開発 Kubernetes-Powered Application Development コンテナ化されたAIツールやインフラリソースを利用するためのセルフサービスと一貫したクラウド体験 OpenShift GitOpsやPipelines を活用した機械学習パイプラインにおける学習/推論フェーズの自動化機能の拡張 GPUに最適化されたコンテナイメージを使って、ML/DLモデルを容易に開発・デプロイ・スケールできる環境を提供

インフラリソースへのセルフサービスアクセス Self-Service Access to Infrastructure Resources

NVIDIA AI Enterprise Software Suite OpenShiftでのRDMAおよびGPU Direct RDMAワークロードを有効化し、ネットワーク関連のNVIDIAコンポーネント管理を自動化します
NVIDIA Network Operator GPUのプロビジョニングに必要な NVIDIA Driver、Kubernetes Device PluginなどのNVIDIAソフトウェアコンポーネント管理を自動化します NVIDIA GPU Operator AIのCloud-Native Deployment に欠かせないインフラリソース管理の完全自動化 Self-Service Access to Infrastructure Resources

GPUリソースの取り扱い Self-Service Access to Infrastructure Resources 仮想化におけるGPUリソースの取り扱い OpenShiftにおけるGPUリソースの取り扱い NVIDIA-Certified Systems
NVIDIA GPU Hypervisor vGPU vGPU Guest VM NVIDIA Driver Applications Guest VM NVIDIA Driver Applications NVIDIA Virtualization Software NVIDIA-Certified Systems NVIDIA GPU GPU Operator NVIDIA Container Runtime Container Applications NVIDIA Driver NVIDIA GPU NVIDIA Kubernetes Device Plugin NVIDIA GPU Monitoring Container Applications GPUリソース管理の完全自動化互換性維持の複雑化

NVIDIA GPU Operator NVIDIA-Certified Systems Self-Service Access to Infrastructure Resources
NVIDIA コンテナランタイムと連携し、コンテナを介して NVIDIAドライバを提供します (*詳細) NVIDIA Driver Kubernetesクラスタの各ノードにあるGPUの状態を管理し、コンテナ起動時のGPUを割り当てます NVIDIA Kubernetes Device Plugin DCGM(Data Center GPU Manager)-exporterを利用して、 NVIDIA GPUデバイスの監視(ヘルスモニタリング、ポリシー、グループ管理など)を行います NVIDIA GPU Monitoring

Operator Framework Self-Service Access to Infrastructure Resources ref. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#installing-the-nvidia-gpu-operator-by-using-the-web-console OpenShift
Web UIの「Operator Hub」から数クリックで導入できます OpenShift Web UIにおける GPUリソースの可視化 1クリックインストール

MLOpsパイプラインの安全な自動化 Securely Automate MLOps Pipelines

MLOpsを支えるRed Hatのサービス Securely Automate MLOps Pipelines (OpenShift Pipelines) KubernetesネイティブなCI/CDパイプライン作成のフレームワーク
(Red Hat AMQ Streams) スケーラビリティに優れた分散メッセージキュー (Red Hat OpenShift Data Science) AI/MLツール運用の自動化・セルフサービス化したセット (Red Hat Quay) 分散型コンテナイメージレジストリ (OpenShift GitOps) KubernetesネイティブなGitOps ベースのCDツール (Red Hat OpenShift Data Foundation) コンテナ専用のソフトウェア・デファインド・ストレージ(SDS)

MLOpsのパイプライン Securely Automate MLOps Pipelines どのような機械学習モデルを作るためにも、データの準備と収集、モデルの開発(学習)、モデルの展開と兆候の監視 (推論)というサイクルは一貫しています推論フェーズ (Serving) 学習フェーズ
(Training) Step.1 | Model Development Step.2 | ML model training pipelines Step.3 | ML models serving pipelines Step.4 | Monitoring Validation

学習(Model Training)フェーズ Securely Automate MLOps Pipelines Data Store Model Development
ML Model Model Store Model Image Test Step.2 | ML model training pipelines Step.1 | Model Development Red Hat OpenShift上の Jupyter notebooksを使い機械学習モデルを構築します【OpenShift Pipelines】イベント駆動の継続的インテグレーションによって、機械学習モデルをコンテナイメージ化します ▶ Saving: デプロイ準備ができたモデルをModel Storeに保存 ▶ Converting: モデルをコンテナイメージに変換 ▶ Testing: モデルのイメージをテストして機能を確認 ▶ Storing: コンテナレジストリに確認済みのコンテナイメージを保存 Image Registry OpenShift Pipelines by

推論(Model Serving)フェーズ Securely Automate MLOps Pipelines Configuration Repository (Manifests) Trigger
ML Service Intelligent App Step.4 | Monitoring validation Step.3 | ML models serving pipelines PrometheusやGrafana、サードパーティツールを使い、学習モデルによる推論のパフォーマンスを監視し、必要に応じて再トレーニングやデプロイを行います【OpenShift GitOps】マニフェストを監視し、機械学習モデルを安全にデプロイします ▶ Configuring: Gitリポジトリ経由での構成設定 ▶ Monitoring:設定用マニフェストファイルの変更差分を監視 ▶ Triggering: MLサービス上のモデルを更新 ▶ Deploying: EdgeやDatacenter、クラウドなどに展開 Monitor drift OpenShift GitOps by Deploy

Kubernetesを活用したアプリケーション開発 Kubernetes-Powered Application Development

OpenShiftのInstall Type Kubernetes-Powered Application Development NVIDIA-Certified Systems NVIDIA GPU RHEL
NVIDIA-Certified Systems OpenShift NVIDIA GPU RHEL / RHCOS NVIDIA-Certified Systems OpenShift NVIDIA GPU RHEL / RHCOS Hypervisor (VMware) NVIDIA AI Enterprise 物理マシン OS/Hypervisor 仮想マシン TensorFlow PyTorch NVIDIA AI Enterprise TensorFlow PyTorch NVIDIA AI Enterprise TensorFlow PyTorch Container Orchestrator Container Container Runtime(Podman) 1. Virtualization 2. Bare-Metal N/A N/A

AI開発のコンテナサポート Kubernetes-Powered Application Development NVIDIA-Certified Systems OpenShift NVIDIA GPU RHEL
/ RHCOS NVIDIA AI Enterprise TensorFlow PyTorch NVIDIA AI Enterprise に含まれるサポートサービス開発のランタイム Application Runtimes Python Java AI開発専用のフレームワーク AI and data science frameworks: - TensorFlow - PyTorch - NVIDIA TAO Toolkit - NVIDIA Triton Inference Server - NVIDIA TensorRT - NVIDIA RAPIDS Red Hat OpenShift に含まれるサポート Application Streams: - PHP - Python - Perl - Node.js - Ruby - OpenJDK - Quarkus - MySQL / MariaDB etc…

NVIDIA AI Enterprise 2.0のサポートOS Kubernetes-Powered Application Development ref. https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html Install
Type Hypervisor or Bare-Metal OS Guest OS Support Virtualization VMware vSphere Hypervisor (ESXi) Enterprise Plus Edition 7.0 Update 2 or 3 > Ubuntu 20.04 LTS > Red Hat Enterprise Linux 8.4 > Red Hat OpenShift 4.9 Virtualization VMware vSphere 6.7 Bare-Metal Ubuntu 20.04 LTS Bare-Metal Red Hat Enterprise Linux 8.4 Bare-Metal Red Hat OpenShift 4.9 w/Red Hat Linux CoreOS (RHCOS) Install Typeが仮想化(Virtualization)でもベアメタル(Bare-Metal)でも、コンテナ自体は「Ubuntu(20.04 LTS)」または「RHEL(8.4 or RHCOS)」の上で展開されることがサポート要件

RHELの恩恵によるトータルサポート Kubernetes-Powered Application Development OpenShift RHEL / RHCOS Container UBI
Applications Container UBI Applications Container Platform Operation System Container Base Image Applications Container Base Image Applications Container Orchestration OS Container Image コンテナイメージとして展開されるUBIは、RHEL(Red Hat Enterprise Linux)のライフサイクルに基づいてサポートされます。 Universal Base Image Red Hatのコンテナ実行環境を利用する場合、UBIの使用を完全にサポート OpenShift ホストOSであるRHEL/RHCOSのサポートを含む RHEL コンテナランタイムとしての稼働をサポート ref. https://access.redhat.com/articles/2726611

Conclusion NVIDIA AI Enterprise with OpenShift

NVIDIAとRed HatのAI対応プラットフォーム NVIDIA AI Enterprise NVIDIA-Certified Systems インフラリソースへのセルフサービスアクセス Self-Service
Access to Infrastructure Resources MLOpsパイプラインの安全な自動化 Securely Automate MLOps Pipelines Kubernetesを活用したアプリケーション開発 Kubernetes-Powered Application Development

linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 22 Thank you Red Hat is
the world’s leading provider of enterprise open source software solutions. Award-winning support, training, and consulting services make Red Hat a trusted adviser to the Fortune 500.

https://www.redhat.com/en/partners/nvidia

NVIDIA AI Enterprise for Red Hat OpenShift

NVIDIA AI Enterprise for Red Hat OpenShift

Shingo.Kitayama

More Decks by Shingo.Kitayama

Other Decks in Technology

Featured

Transcript

1 NVIDIAとRed Hatの AI対応プラットフォーム Shingo Kitayama レッドハット株式会社 Solution Architect Copyright

NVIDIA AI Enterprise with OpenShift NVIDIA AI Enterprise NVIDIAが認定、ライセンス、サポートを提供する包括的なクラウドネイティブAIソフト

AI活用におけるRed Hat OpenShiftの魅力インフラリソースへのセルフサービスアクセス Self-Service Access to Infrastructure Resources

インフラリソースへのセルフサービスアクセス Self-Service Access to Infrastructure Resources

NVIDIA AI Enterprise Software Suite OpenShiftでのRDMAおよびGPU Direct RDMAワークロードを有効化し、ネットワーク関連のNVIDIAコンポーネント管理を自動化します

GPUリソースの取り扱い Self-Service Access to Infrastructure Resources 仮想化におけるGPUリソースの取り扱い OpenShiftにおけるGPUリソースの取り扱い NVIDIA-Certified Systems

NVIDIA GPU Operator NVIDIA-Certified Systems Self-Service Access to Infrastructure Resources

Operator Framework Self-Service Access to Infrastructure Resources ref. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/install-gpu-ocp.html#installing-the-nvidia-gpu-operator-by-using-the-web-console OpenShift

MLOpsパイプラインの安全な自動化 Securely Automate MLOps Pipelines

MLOpsを支えるRed Hatのサービス Securely Automate MLOps Pipelines (OpenShift Pipelines) KubernetesネイティブなCI/CDパイプライン作成のフレームワーク

学習(Model Training)フェーズ Securely Automate MLOps Pipelines Data Store Model Development

推論(Model Serving)フェーズ Securely Automate MLOps Pipelines Configuration Repository (Manifests) Trigger

Kubernetesを活用したアプリケーション開発 Kubernetes-Powered Application Development

OpenShiftのInstall Type Kubernetes-Powered Application Development NVIDIA-Certified Systems NVIDIA GPU RHEL

AI開発のコンテナサポート Kubernetes-Powered Application Development NVIDIA-Certified Systems OpenShift NVIDIA GPU RHEL

NVIDIA AI Enterprise 2.0のサポートOS Kubernetes-Powered Application Development ref. https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html Install

RHELの恩恵によるトータルサポート Kubernetes-Powered Application Development OpenShift RHEL / RHCOS Container UBI

Conclusion NVIDIA AI Enterprise with OpenShift

NVIDIAとRed HatのAI対応プラットフォーム NVIDIA AI Enterprise NVIDIA-Certified Systems インフラリソースへのセルフサービスアクセス Self-Service

linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 22 Thank you Red Hat is

https://www.redhat.com/en/partners/nvidia