Getting science done with accelerated Python computing platforms

Slide 1

Slide 1 text

1 Getting science done with accelerated Python computing platforms Jacob Tomlinson Senior Software Engineer

Slide 2

Slide 2 text

“ “ — Grace Hopper, 1982 The amount of data and the demand for instant access will continue to increase New Mission Critical Software 1970s-1980s

Slide 3

Slide 3 text

Excel Arduino Python C/C++ CUDA Ease of use Education Tool Builders Abstractions are powerful

Slide 4

Slide 4 text

BLAS/LAPACK Ease of use Performance NumPy/SciPy C/C++/Fortran Tool Builders Abstractions are powerful

Slide 5

Slide 5 text

The First Wave of PyData PyData classic And many, many more

Slide 6

Slide 6 text

But there’s a problem The First Wave of PyData PyData classic

Slide 7

Slide 7 text

The Next Decade of Data Expectations Internet scale data | Massive models | Real-time performance Data Volume in Zetabytes Source: IDC, Revelations in the Global DataSphere, US49346123, July 2023 Recommenders Fraud Detection LLMs Genomic Analysis Forecasting Cybersecurity

Slide 8

Slide 8 text

The First Wave of PyData Accelerated Computing

Slide 9

Slide 9 text

Handling compute requirements by scaling up

Slide 10

Slide 10 text

10 Accelerated Computing Swim Lanes RAPIDS makes accelerated computing more seamless while enabling specialization for maximum performance

Slide 11

Slide 11 text

Accelerated Python Libraries Overview

Slide 12

Slide 12 text

Accelerated pandas cudf.pandas: the zero code change GPU accelerator for pandas built on cuDF

Slide 13

Slide 13 text

Bringing NVIDIA accelerated computing to Polars Polars GPU Engine Powered by RAPIDS cuDF https://developer.nvidia.com/blog/polars-gpu-engine-powered-by-rapids-cudf-now-available-in-open-beta/

Slide 14

Slide 14 text

14 Accelerated Apache Spark Zero code change acceleration for Spark DataFrames and SQL spark.sql(""" select order count(*) as order_count from orders""" ) spark.conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin") spark.sql(""" select order count(*) as order_count from orders""" ) CPU Spark GPU Spark Average Speed-Ups: >5x • Operates as a software plugin to popular Apache Spark platform • Automatically accelerates supported operations (with CPU fallback if needed) • Requires no code changes • Works with Spark standalone, YARN clusters, Kubernetes clusters • Deploy on: Apache Spark 3.4.1, RAPIDS Spark release 24.04 See GTC session S62257 for details NVIDIA Decision Support Benchmark 3TB (Public Cloud) Amazon EMR Google Cloud Dataproc

Slide 15

Slide 15 text

15 Accelerated Dask Just set “cudf” as the backend and use Dask-CUDA Workers • Configurable Backend and GPU-Aware Workers • Memory Spilling (GPU->CPU->Disk) • Optimized Memory Management • Accelerated RDMA and Networking (UCX)

Slide 16

Slide 16 text

16 cuML Accelerated machine learning with a scikit-learn API >>> from sklearn.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier() >>> clf.fit(x, y) >>> from cuml.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier() >>> clf.fit(x, y) GPU CPU Scikit-learn cuML Time Series Preprocessing Classification Tree Models Cross Validation Clustering Explainability Dimensionality Reduction Regression 50+ GPU-Accelerated Algorithms A100 GPU vs. AMD EPYC 7642 (96 logical cores) cuML 23.04, scikit-learn 1.2.2, umap-learn 0.5.3

Slide 17

Slide 17 text

17 Accelerated XGBoost “XGBoost is All You Need” – Bojan Tunguz, 4x Kaggle Grandmaster >>> from xgboost import XGBClassifier >>> clf = XGBClassifier() >>> clf.fit(x, y) >>> from xgboost import XGBClassifier >>> clf = XGBClassifier(device=”cuda”) >>> clf.fit(x, y) GPU CPU XGBoost XGBoost Up to 20x Speedups • One line of code change to unlock up to 20x speedups with GPUs • Scalable to the world’s largest datasets with Dask and PySpark • Built-in SHAP support for model explainability • Deployable with Triton for lighting-fast inference in production • RAPIDS helps maintain the XGBoost project

Slide 18

Slide 18 text

18 Accelerated NetworkX nx-cugraph: the zero-code change GPU backend for NetworkX • Zero-code-change GPU-acceleration of for NetworkX code • Accelerates algorithms up to 600x, based on algorithm and graph size • Support for 60 popular graph algorithms and growing • Falls back to using CPU NetworkX for unsupported algorithms NetworkX 3.2, CPU: Intel(R) Xeon(R) Platinum 8480CL 2TB, GPU: NVIDIA H100 80GB pip install nx-cugraph-cu12 --extra-index-url https://pypi.nvidia.com conda install -c rapidsai -c conda-forge -c nvidia nx-cugraph

Slide 19

Slide 19 text

19 Any many more https://github.com/rapidsai

Slide 20

Slide 20 text

Handling data volumes by scaling out

Slide 21

Slide 21 text

21 Analysis Ready Data Chunked data stored in parallel file systems and object stores

Slide 22

Slide 22 text

22 Data Proximate Computing Calculate near the data

Slide 23

Slide 23 text

23 RAPIDS Deployment Models Scales from sharing GPUs to leveraging many GPUs at once Single Node Multi Node Shared Node Scale up interactive data science sessions with NVIDIA accelerated tools like cudf.pandas Scale out processing and training by leveraging GPU acceleration in distributed frameworks like Dask and Spark Scale out AI/ML APIs and model serving with NVIDIA Triton Inference Server and the Forest Inference Library

Slide 24

Slide 24 text

24 RAPIDS on Managed Notebook Platforms Serverless Jupyter in the cloud Example screenshot from Vertex AI documentation https://docs.rapids.ai/deployment/stable/cloud/gcp/vertex-ai/

Slide 25

Slide 25 text

25 RAPIDS on Compute pipelines Data processing services Example from AWS EMR documentation https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/aws-emr.html

Slide 26

Slide 26 text

26 RAPIDS on Virtual Machines Servers and workstations in the cloud Example from Azure Virtual Machine documentation https://docs.rapids.ai/deployment/stable/cloud/azure/azure-vm/

Slide 27

Slide 27 text

27 GPU Operator Kubernetes GPU GPU GPU GPU GPU GPU GPU GPU RAPIDS on Kubernetes Unified Cloud Deployments

Slide 28

Slide 28 text

Reducing cost, doing more with less

Slide 29

Slide 29 text

29 RAPIDS runs your workloads faster How do you want to spend those gains? Reduce cost Reduce the amount of time you need to run servers. Beneficial for reducing cloud costs. Do more work Run more workloads for the same time/cost. Process things that were not possible before. Performance boost Get work done faster. May help give a competitive advantage or reduce pressure on SLAs. Environment impact Reduce power needed to perform the same calculation. Using less power produces less CO2. Reduce context switching Reduce time people need to wait for calculations to complete which helps avoid switching to a different task. Improve accuracy Acceleration could allow for more iterations or to process more data leading to improved model accuracy

Slide 30

Slide 30 text

30 Use Case: Sharing resources with multi-tenancy Smoothing out demand peaks while reducing context switching Using Kubernetes we created an autoscaling cluster for interactive Jupyter sessions. Users only use GPUs when they are running computations. The cluster keeps some reserved GPU capacity so that user computations are fulfilled quickly. An overhead of 30% meant that 60% of user computations started within 2 seconds, and 90% within 60 seconds. This can be tuned to suit your needs, more overhead capacity results in reduced wait times. Whatever your preference your cost is always correlated to your compute demand. https://docs.rapids.ai/deployment/stable/examples/rapids-autoscaling-multi-tenant-kubernetes/notebook/

Slide 31

Slide 31 text

31 Thank you! Learn more at https://rapids.ai @jacobtomlinson.dev