Accelerating Python on HPC with Dask and RAPIDS

1 Accelerating Python on HPC with Dask and RAPIDS Jacob
Tomlinson, Dask Maintainer and RAPIDS Developer EuroSciPy 2024

2 Overview What is Dask?

3 Dask Scales Python Lazy computation, out of core, distributed
execution Numpy Pandas Scikit-Learn

4 Graph Generation Dask converts code to a graph and
executes it

5 Dask Distributed Dask cluster process architecture Client Your Dask
code that run the business logic of your workflow. Converts code to task graphs instead of executing directly. Scheduler Receives task graphs and coordinates the execution of those tasks. Also makes autoscaling decisions. Workers Execute individual tasks on remote machines

6 Clusters vs Runners Deployment Paradigms Batch Runner Dynamic Cluster
Workload starts as a multi-node job Nodes coordinate at startup to elect a scheduler and run client code Workload starts as a single node job Dask spawns multiple single-node worker jobs dynamically as they are required

7 Tooling Dask tools for HPC

8 dask-jobqueue Directly interact with queues • Has tools for
dynamic clusters and runners • Supports many schedulers including PBS, SLURM, SGE, OAR and more • Integrates well with other Dask tooling like Dask’s Jupyter Lab extension

9 Cluster Example Interactive dynamic scaling from dask.distributed import Client
from dask_jobqueue.slurm import SLURMCluster cluster = SLURMCluster(cores=1, memory="4GB") cluster.scale(2) client = Client(cluster) ... client.close() cluster.close()

10 Runner Example Batch workloads from dask.distributed import Client from
dask_jobqueue.slurm import SLURMRunner with SLURMRunner(scheduler_file="scheduler-{job_id}.json") as runner: with Client(runner) as client: client.wait_for_workers(runner.n_workers) ... $ srun -n 100 python runner.py

11 dask-mpi Batch workloads on any MPI system from dask_mpi
import initialize initialize() from dask.distributed import Client client = Client() $ srun -n 100 python mpi-runner.py $ mpirun -np 4 dask-mpi \ --scheduler-file scheduler.json from distributed import Client client = Client(scheduler_file='scheduler.json')

12 dask-gateway Centrally managed cluster spawning

13 Use Cases Where people benefit from Dask on HPC

14 Xarray https://www.youtube.com/watch?v=wJHosuzqLaU

15 Hyperparameter Optimization - Optuna + XGBoost https://www.youtube.com/watch?v=R0Hdnhey0pc

16 Single Node, Multi GPU Parallel HPO Computational parallelism beyond
a single node X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=4) # NGPUs on system GPU cuda-worker GPU cuda-worker GPU cuda-worker GPU cuda-worker LocalCUDA cluster Multi Node Multi GPU GPU cuda-worker GPU cuda-worker GPU cuda-worker SLURMCluster … … X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=20) # NGPUs on SLURM cluster

17 Apache Beam Pipelines https://www.youtube.com/watch?v=uGEQkws1Low

18 Accelerating Taking performance further

19 RAPIDS https://github.com/rapidsai

20 Lightning-Fast End-to-End Performance Reducing data science processes from hours
to seconds *CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs. A100s Provide More Power than 100 CPU Nodes 16 More Cost-Effective than Similar CPU Configuration 20x Faster Performance than Similar CPU Configuration 70x

21 cudf.pandas cuDF pandas accelerator mode cuDF pandas accelerator mode
(cudf.pandas) is built on cuDF and accelerates pandas code on the GPU. It supports 100% of the Pandas API, using the GPU for supported operations, and automatically falling back to pandas for other operations. https://docs.rapids.ai/api/cudf/stable/cudf_pandas/

22 150x Faster pandas with Zero Code Change DuckDB data
benchmark, 5GB Performance comparison between Traditional pandas v1.5 on Intel Xeon Platinum 8480CL CPU and pandas v1.5 with RAPIDS cuDF on NVIDIA Grace Hopper Source: https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/

23 mpi4p y • Real-world workflows often need to share
data between libraries • RAPIDS supports device memory sharing between many popular data science and deep learning libraries • Keeps data on the GPU and avoids costly copying back and forth to host memory • Any library that supports DLPack or __cuda_array_interface__ will allow for sharing of memory buffers between RAPIDS and supported libraries Interoperability Zero copy data sharing between libraries

24 RAPIDS + Dask Leveraging GPUs with Dask

25 dask-cuda Start Dask workers for your GPUs

26 Dask cupy/cudf Using GPU backed data structures import dask
import dask.array as da dask.config.set({"array.backend": "cupy"}) # Get cupy-backed collection darr = da.ones(10, chunks=(5,)) Using GPU accelerated libraries in Dask is as easy as changing one setting.

27 • TCP sockets are slow! • UCX provides uniform
access to transports (TCP, InfiniBand, shared memory, NVLink) • Python bindings for UCX (ucx-py) • Will provide best communication performance, to Dask based on available hardware on nodes/cluster UCX Bringing hardware accelerated communications to Dask NVIDIA DGX-2 Inner join Benchmark cluster = LocalCUDACluster( protocol="ucx", enable_infiniband=True, enable_nvlink=True, ) client = Client(cluster)

28 NeMo Data Curator LLM data preprocessing (10s of TBs)
https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/

29 Roadmap Longer term plans for Dask on HPC •
Add more Runners to dask-jobqueue for other schedulers • Migrate dask-mpi into dask-jobqueue as a Runner • Improve dask-cuda compatibility in dask-jobqueue • Build out more Dask on HPC documentation and resources

30 30 Learn New Skills For Free Sharpen your skills
with free technical training. You can select a complimentary self-paced course when you register for the NVIDIA Developer Program. Courses on offer include: • Fundamentals of Accelerated Computing with CUDA Python • Getting Started With Accelerated Computing in CUDA C/C++ • Optimizing CUDA Machine Learning Codes with Nsight Profiling Tools • RAPIDS Accelerator for Apache Spark • Accelerating End-to-End Data Science Workflows By signing up you'll also enjoy other great benefits like access to SDKs, technical documentation, training resources, and more. Join here: https://nvda.ws/45vQqdr When you join the NVIDIA Developer Program

31 Thank you! Learn more 👇 https://dask.org https://jobqueue.dask.org https://rapids.ai

Accelerating Python on HPC with Dask and RAPIDS

Accelerating Python on HPC with Dask and RAPIDS

Jacob Tomlinson

More Decks by Jacob Tomlinson

Other Decks in Technology

Featured

Transcript

1 Accelerating Python on HPC with Dask and RAPIDS Jacob

2 Overview What is Dask?

3 Dask Scales Python Lazy computation, out of core, distributed

4 Graph Generation Dask converts code to a graph and

5 Dask Distributed Dask cluster process architecture Client Your Dask

6 Clusters vs Runners Deployment Paradigms Batch Runner Dynamic Cluster

7 Tooling Dask tools for HPC

8 dask-jobqueue Directly interact with queues • Has tools for

9 Cluster Example Interactive dynamic scaling from dask.distributed import Client

10 Runner Example Batch workloads from dask.distributed import Client from

11 dask-mpi Batch workloads on any MPI system from dask_mpi

12 dask-gateway Centrally managed cluster spawning

13 Use Cases Where people benefit from Dask on HPC

14 Xarray https://www.youtube.com/watch?v=wJHosuzqLaU

15 Hyperparameter Optimization - Optuna + XGBoost https://www.youtube.com/watch?v=R0Hdnhey0pc

16 Single Node, Multi GPU Parallel HPO Computational parallelism beyond

17 Apache Beam Pipelines https://www.youtube.com/watch?v=uGEQkws1Low

18 Accelerating Taking performance further

19 RAPIDS https://github.com/rapidsai

20 Lightning-Fast End-to-End Performance Reducing data science processes from hours

21 cudf.pandas cuDF pandas accelerator mode cuDF pandas accelerator mode

22 150x Faster pandas with Zero Code Change DuckDB data

23 mpi4p y • Real-world workflows often need to share

24 RAPIDS + Dask Leveraging GPUs with Dask

25 dask-cuda Start Dask workers for your GPUs

26 Dask cupy/cudf Using GPU backed data structures import dask

27 • TCP sockets are slow! • UCX provides uniform

28 NeMo Data Curator LLM data preprocessing (10s of TBs)

29 Roadmap Longer term plans for Dask on HPC •

30 30 Learn New Skills For Free Sharpen your skills

31 Thank you! Learn more 👇 https://dask.org https://jobqueue.dask.org https://rapids.ai