Getting science done with accelerated Python computing platforms

1 Getting science done with accelerated Python computing platforms Jacob
Tomlinson Senior Software Engineer

“ “ — Grace Hopper, 1982 The amount of data
and the demand for instant access will continue to increase New Mission Critical Software 1970s-1980s

Excel Arduino Python C/C++ CUDA Ease of use Education Tool
Builders Abstractions are powerful

BLAS/LAPACK Ease of use Performance NumPy/SciPy C/C++/Fortran Tool Builders Abstractions
are powerful

The First Wave of PyData PyData classic And many, many
more

But there’s a problem The First Wave of PyData PyData
classic

The Next Decade of Data Expectations Internet scale data |
Massive models | Real-time performance Data Volume in Zetabytes Source: IDC, Revelations in the Global DataSphere, US49346123, July 2023 Recommenders Fraud Detection LLMs Genomic Analysis Forecasting Cybersecurity

The First Wave of PyData Accelerated Computing

Handling compute requirements by scaling up

10 Accelerated Computing Swim Lanes RAPIDS makes accelerated computing more
seamless while enabling specialization for maximum performance

Accelerated Python Libraries Overview

Accelerated pandas cudf.pandas: the zero code change GPU accelerator for
pandas built on cuDF

Bringing NVIDIA accelerated computing to Polars Polars GPU Engine Powered
by RAPIDS cuDF https://developer.nvidia.com/blog/polars-gpu-engine-powered-by-rapids-cudf-now-available-in-open-beta/

14 Accelerated Apache Spark Zero code change acceleration for Spark
DataFrames and SQL spark.sql(""" select order count(*) as order_count from orders""" ) spark.conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin") spark.sql(""" select order count(*) as order_count from orders""" ) CPU Spark GPU Spark Average Speed-Ups: >5x • Operates as a software plugin to popular Apache Spark platform • Automatically accelerates supported operations (with CPU fallback if needed) • Requires no code changes • Works with Spark standalone, YARN clusters, Kubernetes clusters • Deploy on: Apache Spark 3.4.1, RAPIDS Spark release 24.04 See GTC session S62257 for details NVIDIA Decision Support Benchmark 3TB (Public Cloud) Amazon EMR Google Cloud Dataproc

15 Accelerated Dask Just set “cudf” as the backend and
use Dask-CUDA Workers • Configurable Backend and GPU-Aware Workers • Memory Spilling (GPU->CPU->Disk) • Optimized Memory Management • Accelerated RDMA and Networking (UCX)

16 cuML Accelerated machine learning with a scikit-learn API >>>
from sklearn.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier() >>> clf.fit(x, y) >>> from cuml.ensemble import RandomForestClassifier >>> clf = RandomForestClassifier() >>> clf.fit(x, y) GPU CPU Scikit-learn cuML Time Series Preprocessing Classification Tree Models Cross Validation Clustering Explainability Dimensionality Reduction Regression 50+ GPU-Accelerated Algorithms A100 GPU vs. AMD EPYC 7642 (96 logical cores) cuML 23.04, scikit-learn 1.2.2, umap-learn 0.5.3

17 Accelerated XGBoost “XGBoost is All You Need” – Bojan
Tunguz, 4x Kaggle Grandmaster >>> from xgboost import XGBClassifier >>> clf = XGBClassifier() >>> clf.fit(x, y) >>> from xgboost import XGBClassifier >>> clf = XGBClassifier(device=”cuda”) >>> clf.fit(x, y) GPU CPU XGBoost XGBoost Up to 20x Speedups • One line of code change to unlock up to 20x speedups with GPUs • Scalable to the world’s largest datasets with Dask and PySpark • Built-in SHAP support for model explainability • Deployable with Triton for lighting-fast inference in production • RAPIDS helps maintain the XGBoost project

18 Accelerated NetworkX nx-cugraph: the zero-code change GPU backend for
NetworkX • Zero-code-change GPU-acceleration of for NetworkX code • Accelerates algorithms up to 600x, based on algorithm and graph size • Support for 60 popular graph algorithms and growing • Falls back to using CPU NetworkX for unsupported algorithms NetworkX 3.2, CPU: Intel(R) Xeon(R) Platinum 8480CL 2TB, GPU: NVIDIA H100 80GB pip install nx-cugraph-cu12 --extra-index-url https://pypi.nvidia.com conda install -c rapidsai -c conda-forge -c nvidia nx-cugraph

19 Any many more https://github.com/rapidsai

Handling data volumes by scaling out

21 Analysis Ready Data Chunked data stored in parallel file
systems and object stores

22 Data Proximate Computing Calculate near the data

23 RAPIDS Deployment Models Scales from sharing GPUs to leveraging
many GPUs at once Single Node Multi Node Shared Node Scale up interactive data science sessions with NVIDIA accelerated tools like cudf.pandas Scale out processing and training by leveraging GPU acceleration in distributed frameworks like Dask and Spark Scale out AI/ML APIs and model serving with NVIDIA Triton Inference Server and the Forest Inference Library

24 RAPIDS on Managed Notebook Platforms Serverless Jupyter in the
cloud Example screenshot from Vertex AI documentation https://docs.rapids.ai/deployment/stable/cloud/gcp/vertex-ai/

25 RAPIDS on Compute pipelines Data processing services Example from
AWS EMR documentation https://docs.nvidia.com/spark-rapids/user-guide/latest/getting-started/aws-emr.html

26 RAPIDS on Virtual Machines Servers and workstations in the
cloud Example from Azure Virtual Machine documentation https://docs.rapids.ai/deployment/stable/cloud/azure/azure-vm/

27 GPU Operator Kubernetes GPU GPU GPU GPU GPU GPU
GPU GPU RAPIDS on Kubernetes Unified Cloud Deployments

Reducing cost, doing more with less

29 RAPIDS runs your workloads faster How do you want
to spend those gains? Reduce cost Reduce the amount of time you need to run servers. Beneficial for reducing cloud costs. Do more work Run more workloads for the same time/cost. Process things that were not possible before. Performance boost Get work done faster. May help give a competitive advantage or reduce pressure on SLAs. Environment impact Reduce power needed to perform the same calculation. Using less power produces less CO2. Reduce context switching Reduce time people need to wait for calculations to complete which helps avoid switching to a different task. Improve accuracy Acceleration could allow for more iterations or to process more data leading to improved model accuracy

30 Use Case: Sharing resources with multi-tenancy Smoothing out demand
peaks while reducing context switching Using Kubernetes we created an autoscaling cluster for interactive Jupyter sessions. Users only use GPUs when they are running computations. The cluster keeps some reserved GPU capacity so that user computations are fulfilled quickly. An overhead of 30% meant that 60% of user computations started within 2 seconds, and 90% within 60 seconds. This can be tuned to suit your needs, more overhead capacity results in reduced wait times. Whatever your preference your cost is always correlated to your compute demand. https://docs.rapids.ai/deployment/stable/examples/rapids-autoscaling-multi-tenant-kubernetes/notebook/

31 Thank you! Learn more at https://rapids.ai @jacobtomlinson.dev

Getting science done with accelerated Python co...

Getting science done with accelerated Python computing platforms

Jacob Tomlinson

More Decks by Jacob Tomlinson

Other Decks in Science

Featured

Transcript

1 Getting science done with accelerated Python computing platforms Jacob

“ “ — Grace Hopper, 1982 The amount of data

Excel Arduino Python C/C++ CUDA Ease of use Education Tool

BLAS/LAPACK Ease of use Performance NumPy/SciPy C/C++/Fortran Tool Builders Abstractions

The First Wave of PyData PyData classic And many, many

But there’s a problem The First Wave of PyData PyData

The Next Decade of Data Expectations Internet scale data |

The First Wave of PyData Accelerated Computing

Handling compute requirements by scaling up

10 Accelerated Computing Swim Lanes RAPIDS makes accelerated computing more

Accelerated Python Libraries Overview

Accelerated pandas cudf.pandas: the zero code change GPU accelerator for

Bringing NVIDIA accelerated computing to Polars Polars GPU Engine Powered

14 Accelerated Apache Spark Zero code change acceleration for Spark

15 Accelerated Dask Just set “cudf” as the backend and

16 cuML Accelerated machine learning with a scikit-learn API >>>

17 Accelerated XGBoost “XGBoost is All You Need” – Bojan

18 Accelerated NetworkX nx-cugraph: the zero-code change GPU backend for

19 Any many more https://github.com/rapidsai

Handling data volumes by scaling out

21 Analysis Ready Data Chunked data stored in parallel file

22 Data Proximate Computing Calculate near the data

23 RAPIDS Deployment Models Scales from sharing GPUs to leveraging

24 RAPIDS on Managed Notebook Platforms Serverless Jupyter in the

25 RAPIDS on Compute pipelines Data processing services Example from

26 RAPIDS on Virtual Machines Servers and workstations in the

27 GPU Operator Kubernetes GPU GPU GPU GPU GPU GPU

Reducing cost, doing more with less

29 RAPIDS runs your workloads faster How do you want

30 Use Case: Sharing resources with multi-tenancy Smoothing out demand

31 Thank you! Learn more at https://rapids.ai @jacobtomlinson.dev