DeepOps – An efficient way to deploy GPU cluster for computing

Slide 1

Slide 1 text

DeepOps An Ef fi cient Way To Deploy GPU Cluster for Computing Frank Lin @yylin1 GARAOTUS HPC Summit 2021 | NOVEMBER 18, 2021

Slide 2

Slide 2 text

About Me 林義洋 Frank Li n • Co-organizer of Cloud Native Taiwan User Grou p • Interested in emerging technologies 2 GitHub: yylin1([email protected]) Blog: https://blog.yylin.io

Slide 3

Slide 3 text

Cloud Native Taiwan User Group Facebook

Slide 4

Slide 4 text

Agenda • Manage resources more effectively • What is DeepOps? • How to choose Job Schedule Management ? 4

Slide 5

Slide 5 text

5 Manage Resources More Effectively 如何更有效應⽤ GPU 環境資源？

Slide 6

Slide 6 text

6 Why Clusters? Ref: GTC 2020 - AI as a Service (AIaaS): Zero to Kubeflow: Supporting Your Data Science Teams with the Most Common Uses Case s https://developer.nvidia.com/gtc/2020/video/s22040-vid CPU CPU CPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU (Excel, Internal Page) (Platform)

Slide 7

Slide 7 text

IT維運團隊 / 資料科學家 IT Admin Data Engineer Data Scientists 如何有效解決環境問題？

Slide 8

Slide 8 text

8 AI Platform Considerations Ref: GTC Silicon Valley-2019: Building and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software  https://developer.nvidia.com/gtc/2019/video/s9334

Slide 9

Slide 9 text

The power of one platform • 易於部署執⾏從開發到⽣產環境的⼯作流程 • 單⼀個管理平台下，能⽀援多個團隊或專案項⽬ • 有效管理和監控叢集資源 (GPUs)、計費和使⽤狀況 • 如何讓每個團隊專注他們擅長的事情，排除額外的成本 • Data Scientists -> Data Scienc e • App Developer -> Implement App s • DevOps Engineers -> DevOps / AI WorkFlow

Slide 10

Slide 10 text

10 What is DeepOps ? Tools for building GPU clusters

Slide 11

Slide 11 text

What is DeepOps? DeepOps 開源專案主要⽤於快速佈署 Kubernetes 與 Slurm，在⼤規模 GPU 伺服器叢集和共享單節點部署（例如 NVIDIA DGX Systems）的理想選擇 • 具⾼彈性，可以進⾏調整或以模塊化⽅式使⽤，以匹配特定使⽤情境的集群需求 • 利⽤ Ansible 提供點到點來設置整個群集管理堆棧 • 可部署 Slurm、Kubernetes 或兩者的混合 • 提供腳本可⽤於現有叢集快速部署 Kubeflow 和連接 NFS 存儲 • GitHub: https://github.com/NVIDIA/deepops

Slide 12

Slide 12 text

Building out your GPU cluster Ref: GTC Silicon Valley-2019: Building and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software  https://developer.nvidia.com/gtc/2019/video/s9334

Slide 13

Slide 13 text

Installing & managing an AI as a Servicer cluster DeepOps Components Scheduler Ref: GTC Digital October 2020: GPU-Accelerated Labs on Demand Using Virtual Computer Serve r https://www.nvidia.com/en-us/on-demand/session/gtcfall20-a21371/

Slide 14

Slide 14 text

Automation: Ansible • Open-source automation and configuration management too l • Agentless (nothing to install on target nodes ) • Easier to maintain & scale than custom script s • Playbooks use YAML: easy to learn and read Ref: https://www.ansible.com/

Slide 15

Slide 15 text

Deploying DeepOps

Slide 16

Slide 16 text

Building Multi-node GPU Clusters with DeepOps

Slide 17

Slide 17 text

Source: https://developer.nvidia.com/blog/deploying-rich-cluster-api-on-dgx-for-multi-user-sharing/ Architecture

Slide 18

Slide 18 text

18 How to choose Job Schedule Management ? Different problems yield different solutions

Slide 19

Slide 19 text

• Basic scheduling features ( kube-batch ) • Share nodes, schedule jobs for GPUs on a node ( Excel spreadsheet) • Advanced container and workflow orchestration (e.g., with kubeflow ) • Covers data permissions and security (LDAP, file permissions ) • Adds analytics and monitoring (important also for justification of purchase) • Advanced scheduling features (batch / gang scheduling) • Works also without containers (no runtime overhead) • Multi-node job s • Job dependencies, workflows, DAGs • Advanced reservation s • Intelligent scheduling (not just FIFO ) • User accountin g • Other HPC-like scheduling functionality Schedulers Comparison Ref: GTC Silicon Valley-2019: Building and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software  https://developer.nvidia.com/gtc/2019/video/s9334

Slide 20

Slide 20 text

AI Workflow - 選擇平台適⽤性以較快的⽅式擺脫「資料科學家/開發者」平台設定與較⾼學習曲線，團隊更有使⽤ GPU 資源 HPC ( Slurm + Container ) • 與開發者環境⼀致，技術⾨檻比較低沒有任何開發差異 • Batch Job 排程管理的⼯作佇列 (Queue) 來仲裁資源分配 • 節點計算資源 bare metal / container runtime (e.g., Singularity, Nvidia ENROOT ) Application Containerization ( Kubernetes ) • 長時間運⾏的 AI Service ⾼可⽤性 • 平台提供更多模組整合，協助 AI Workflow 任務相關⼯具，監控作業與 GPU 資源 • 需針對應⽤情境調整 Job Schedule

Slide 21

Slide 21 text

Slurm Example: Interactive one-shot job

Slide 22

Slide 22 text

總結 • DeepOps 簡化繁瑣部署各種問題，是⼀個 Nvidia 開源專案協助加速 GPU 叢集建置的⼯具 • Slurm 提供更多的 Scheduling Algorithms，讓 HPC 使⽤者資料科學家，能更有效執⾏對應 Jo b • 需要搭配 module load system 配置 GPU 環境依賴資源 • Kubernetes 對於 Batch System / AI 應⽤場景, 例如 TensorFlow, Spark, PyTorch, MP I • 需要搭配其他 Job Scheduler (kube-batch, Volcano) 來提升執⾏策略 • 平台選擇依據現有開發環境，選擇對開發者適⽤性最⾼為最佳 • 技術⾨檻 (e.g., 學習 YAML 撰寫, Disk 掛載問題)

Slide 23

Slide 23 text

23 THANK YOU! GARAOTUS HPC Summit 2021 | NOVEMBER 18, 2021