Deploying multi-GPU workloads on Kubernetes in Python

Slide 1

Slide 1 text

Deploying Multi-GPU workloads on Kubernetes in Python PyData DC - Feb 2023 Jacob Tomlinson Software Engineering Lead NVIDIA

Slide 2

Slide 2 text

2 Jake VanderPlas - PyCon 2017 Jacob Tomlinson Former Research Software Engineer UK Met Office

Slide 3

Slide 3 text

3 RAPIDS https://github.com/rapidsai Jacob Tomlinson Cloud Lead RAPIDS

Slide 4

Slide 4 text

4 Minor Code Changes for Major Benefits Abstracting Accelerated Compute through Familiar Interfaces In [1]: import pandas as pd In [2]: df = pd.read_csv(‘filepath’) In [1]: from sklearn.ensemble import RandomForestClassifier In [2]: clf = RandomForestClassifier(n_estimators=10 0,max_depth=8, random_state=0) In [3]: clf.fit(x, y) In [1]: import networkx as nx In [2]: page_rank=nx.pagerank(graph) In [1]: import cudf In [2]: df = cudf.read_csv(‘filepath’) In [1]: from cuml.ensemble import RandomForestClassifier In [2]: cuclf = RandomForestClassifier(n_estimators=10 0,max_depth=8, random_state=0) In [3]: cuclf.fit(x, y) In [1]: import cugraph In [2]: page_rank=cugraph.pagerank(graph) GPU CPU pandas scikit-learn NetworkX cuDF cuML cuGraph Average Speed-Ups: 150x Average Speed-Ups: 250x Average Speed-Ups: 50x

Slide 5

Slide 5 text

5 Lightning-Fast End-to-End Performance Reducing Data Science Processes from Hours to Seconds *CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs. A100s Provide More Power than 100 CPU Nodes 16 More Cost-Effective than Similar CPU Configuration 20x Faster Performance than Similar CPU Configuration 70x

Slide 6

Slide 6 text

General purpose Python library for parallelism Scales existing libraries, like Numpy, Pandas, and Scikit-Learn Flexible enough to build complex and custom systems Accessible for beginners, secure and trusted for institutions Jacob Tomlinson Core Developer Dask

Slide 7

Slide 7 text

Dask accelerates the existing Python ecosystem Built alongside with the current community import numpy as np x = np.ones((1000, 1000)) x + x.T - x.mean(axis=0 import pandas as pd df = pd.read_csv(“file.csv”) df.groupby(“x”).y.mean() from scikit_learn.linear_model \ import LogisticRegression lr = LogisticRegression() lr.fit(data, labels) Numpy Pandas Scikit-Learn

Slide 8

Slide 8 text

8 Pre-Processing pandas Data Preparation Visualization Model Training Machine Learning scikit-learn Graph Analytics NetworkX Deep Learning TensorFlow, PyTorch, MxNet Visualization matplotlib Apache Spark / Dask CPU Memory Open Source Software Has Democratized Data Science Highly Accessible, Easy to Use Tools Abstract Complexity

Slide 9

Slide 9 text

9 Accelerated Data Science with RAPIDS Powering Popular Data Science Ecosystems with NVIDIA GPUs Pre-Processing cuIO & cuDF Data Preparation Visualization Model Training Machine Learning cuML, XGBoost Graph Analytics cuGraph Deep Learning TensorFlow, PyTorch, MxNet Visualization cuXfilter, pyViz, Plotly Dask GPU Memory Spark / Dask

Slide 10

Slide 10 text

10 XGBoost + RAPIDS: Better Together ● RAPIDS comes paired with XGBoost 1.6.0 ● XGBoost provides zero-copy data import from cuDF, CuPy, Numba, PyTorch and more ● Official Dask API makes it easy to scale to multiple nodes or multiple GPUs ● GPU tree builder delivers huge perf gains ● Now supports Learning to Rank, categorical variables, and SHAP Explainability ● Use models directly in Triton for high-performance inference “XGBoost is All You Need” – Bojan Tunguz, 4x Kaggle Grandmaster All RAPIDS changes are integrated upstream and provided to all XGBoost users – via pypi or RAPIDS conda

Slide 11

Slide 11 text

11 Deploying RAPIDS on the cloud

Slide 12

Slide 12 text

12 RAPIDS in the Cloud Current Focus Areas • Kubernetes • Helm Charts • Operator • Kubeflow • Cloud ML Platforms • Amazon Sagemaker Studio • Google Vertex AI • Cloud Compute • Amazon EC2, ECS, Fargate, EKS • Google Compute Engine, Dataproc, GKE • Cloud ML examples gallery New Deployment documentation website Deployment Documentation: docs.rapids.ai/deployment/stable Kubernetes Deployment: docs.rapids.ai/deployment/stable/platforms/kubernetes.html Dask Kubernetes: kubernetes.dask.org

Slide 13

Slide 13 text

13 RAPIDS on Kubernetes Unified Cloud Deployments GPU Operator Kubernetes GPU GPU GPU GPU GPU GPU GPU GPU

Slide 14

Slide 14 text

14 Live Demo Murphy's First Law: Anything that can go wrong will go wrong. Murphy's Second Law: Nothing is as easy as it looks. Murphy's Third Law: Everything takes longer than you think it will.

Slide 15

Slide 15 text

15 Launch a Kubernetes Cluster # Launch a Kubernetes Cluster with GPUs $ gcloud container clusters create jtomlinson-rapids-demo \ --accelerator type=nvidia-tesla-a100,count=2 \ --machine-type a2-highgpu-2g \ --zone us-central1-c

Slide 16

Slide 16 text

16 Install NVIDIA Drivers # Install the NVIDIA Drivers $ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/contain er-engine-accelerators/master/nvidia-driver-installer/cos/dae monset-preloaded-latest.yaml

Slide 17

Slide 17 text

17 Install the Dask operator # Install the Dask Operator $ helm install --repo https://helm.dask.org \ --create-namespace -n dask-operator \ --generate-name dask-kubernetes-operator

Slide 18

Slide 18 text

18 Installing the operator # Check that we can list daskcluster resources $ kubectl get daskclusters No resources found in default namespace. # Check that the operator pod is running $ kubectl get pods -A -l application=dask-kubernetes-operator NAMESPACE NAME READY STATUS RESTARTS AGE dask-operator dask-kubernetes-operator-775b8bbbd5-zdrf7 1/1 Running 0 74s # 🚀 done!

Slide 19

Slide 19 text

19 Get a Jupyter notebook # Create a notebook Pod for us to drive the workload from $ kubectl apply -f notebook.yaml Source for notebook.yaml https://gist.github.com/jacobtomlinson/397b277e6cc4b717d9ff04759f350b4a#file-notebook-yaml

Slide 20

Slide 20 text

20 Create RAPIDS Clusters within Notebooks With on prem or cloud-managed Kubernetes # Install dask-kubernetes $ pip install dask-kubernetes # Launch a cluster >>> from dask_kubernetes.operator \ import KubeCluster >>> cluster = KubeCluster(name="demo") # List the DaskCluster custom resource that was created for us under the hood $ kubectl get daskclusters NAME AGE demo-cluster 6m3s

Slide 21

Slide 21 text

21 # cluster.yaml apiVersion: kubernetes.dask.org/v1 kind: DaskCluster metadata: name: simple-cluster spec: worker: replicas: 3 spec: containers: - name: worker image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-worker - --name - $(DASK_WORKER_NAME) scheduler: spec: containers: - name: scheduler image: "ghcr.io/dask/dask:latest" imagePullPolicy: "IfNotPresent" args: - dask-scheduler ports: - name: tcp-comm containerPort: 8786 protocol: TCP - name: http-dashboard containerPort: 8787 protocol: TCP readinessProbe: httpGet: port: http-dashboard path: /health initialDelaySeconds: 5 … The Dask Operator has three custom resource types that you can create via kubectl. ● DaskCluster to create whole clusters. ● DaskWorkerGroup to create additional groups of workers with various configurations (high memory, GPUs, etc). ● DaskJob to run end-to-end tasks like a Kubernetes Job but with an adjacent Dask Cluster. Create RAPIDS Clusters with kubectl

Slide 22

Slide 22 text

22 Workload demo

Slide 23

Slide 23 text

23 Typical ML workflows

Slide 24

Slide 24 text

24 Typical ML workflows

Slide 25

Slide 25 text

25 GCP T4 Instance Parallel HPO Computational Parallelism Beyond a Single Node X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=4) # NGPUs on system GPU cuda-worker GPU cuda-worker GPU cuda-worker GPU cuda-worker LocalCUDA cluster GKE Cluster with GPU Pods GPU cuda-worker GPU cuda-worker GPU cuda-worker KubeCluster … … X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=20) # NGPUs on K8s cluster

Slide 26

Slide 26 text

26 Example Notebook github.com/rapidsai/cloud-ml-examples

Slide 27

Slide 27 text

Slide 28

Slide 28 text

28 Wrap up

Slide 29

Slide 29 text

29 RAPIDS Community Join us OPEN SOURCE CONTRIBUTORS ADOPTERS

Slide 30

Slide 30 text

30 How to Get Started with RAPIDS A Variety of Ways to Get Up & Running More about RAPIDS Self-Start Resources Discussion & Support ● Learn more at RAPIDS.ai ● Read the API docs ● Check out the RAPIDS blog ● Read the NVIDIA DevBlog ● Get started with RAPIDS ● Deploy on the Cloud today ● Start with Google Colab ● Look at the cheat sheets ● Check the RAPIDS GitHub ● Use the NVIDIA Forums ● Reach out on Slack ● Talk to NVIDIA Services @RAPIDSai https://github.com/rapidsai https://rapids-goai.slack.com/join https://rapids.ai Get Engaged

Slide 31

Slide 31 text

THANK YOU Jacob Tomlinson [email protected] @_jacobtomlinson