Accelerating Python on HPC with Dask and RAPIDS

Slide 1

Slide 1 text

1 Accelerating Python on HPC with Dask and RAPIDS Jacob Tomlinson, Dask Maintainer and RAPIDS Developer EuroSciPy 2024

Slide 2

Slide 2 text

2 Overview What is Dask?

Slide 3

Slide 3 text

3 Dask Scales Python Lazy computation, out of core, distributed execution Numpy Pandas Scikit-Learn

Slide 4

Slide 4 text

4 Graph Generation Dask converts code to a graph and executes it

Slide 5

Slide 5 text

5 Dask Distributed Dask cluster process architecture Client Your Dask code that run the business logic of your workflow. Converts code to task graphs instead of executing directly. Scheduler Receives task graphs and coordinates the execution of those tasks. Also makes autoscaling decisions. Workers Execute individual tasks on remote machines

Slide 6

Slide 6 text

6 Clusters vs Runners Deployment Paradigms Batch Runner Dynamic Cluster Workload starts as a multi-node job Nodes coordinate at startup to elect a scheduler and run client code Workload starts as a single node job Dask spawns multiple single-node worker jobs dynamically as they are required

Slide 7

Slide 7 text

7 Tooling Dask tools for HPC

Slide 8

Slide 8 text

8 dask-jobqueue Directly interact with queues ● Has tools for dynamic clusters and runners ● Supports many schedulers including PBS, SLURM, SGE, OAR and more ● Integrates well with other Dask tooling like Dask’s Jupyter Lab extension

Slide 9

Slide 9 text

9 Cluster Example Interactive dynamic scaling from dask.distributed import Client from dask_jobqueue.slurm import SLURMCluster cluster = SLURMCluster(cores=1, memory="4GB") cluster.scale(2) client = Client(cluster) ... client.close() cluster.close()

Slide 10

Slide 10 text

10 Runner Example Batch workloads from dask.distributed import Client from dask_jobqueue.slurm import SLURMRunner with SLURMRunner(scheduler_file="scheduler-{job_id}.json") as runner: with Client(runner) as client: client.wait_for_workers(runner.n_workers) ... $ srun -n 100 python runner.py

Slide 11

Slide 11 text

11 dask-mpi Batch workloads on any MPI system from dask_mpi import initialize initialize() from dask.distributed import Client client = Client() $ srun -n 100 python mpi-runner.py $ mpirun -np 4 dask-mpi \ --scheduler-file scheduler.json from distributed import Client client = Client(scheduler_file='scheduler.json')

Slide 12

Slide 12 text

12 dask-gateway Centrally managed cluster spawning

Slide 13

Slide 13 text

13 Use Cases Where people benefit from Dask on HPC

Slide 14

Slide 14 text

14 Xarray https://www.youtube.com/watch?v=wJHosuzqLaU

Slide 15

Slide 15 text

15 Hyperparameter Optimization - Optuna + XGBoost https://www.youtube.com/watch?v=R0Hdnhey0pc

Slide 16

Slide 16 text

16 Single Node, Multi GPU Parallel HPO Computational parallelism beyond a single node X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=4) # NGPUs on system GPU cuda-worker GPU cuda-worker GPU cuda-worker GPU cuda-worker LocalCUDA cluster Multi Node Multi GPU GPU cuda-worker GPU cuda-worker GPU cuda-worker SLURMCluster … … X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=20) # NGPUs on SLURM cluster

Slide 17

Slide 17 text

17 Apache Beam Pipelines https://www.youtube.com/watch?v=uGEQkws1Low

Slide 18

Slide 18 text

18 Accelerating Taking performance further

Slide 19

Slide 19 text

19 RAPIDS https://github.com/rapidsai

Slide 20

Slide 20 text

20 Lightning-Fast End-to-End Performance Reducing data science processes from hours to seconds *CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs. A100s Provide More Power than 100 CPU Nodes 16 More Cost-Effective than Similar CPU Configuration 20x Faster Performance than Similar CPU Configuration 70x

Slide 21

Slide 21 text

21 cudf.pandas cuDF pandas accelerator mode cuDF pandas accelerator mode (cudf.pandas) is built on cuDF and accelerates pandas code on the GPU. It supports 100% of the Pandas API, using the GPU for supported operations, and automatically falling back to pandas for other operations. https://docs.rapids.ai/api/cudf/stable/cudf_pandas/

Slide 22

Slide 22 text

22 150x Faster pandas with Zero Code Change DuckDB data benchmark, 5GB Performance comparison between Traditional pandas v1.5 on Intel Xeon Platinum 8480CL CPU and pandas v1.5 with RAPIDS cuDF on NVIDIA Grace Hopper Source: https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/

Slide 23

Slide 23 text

23 mpi4p y ● Real-world workflows often need to share data between libraries ● RAPIDS supports device memory sharing between many popular data science and deep learning libraries ● Keeps data on the GPU and avoids costly copying back and forth to host memory ● Any library that supports DLPack or __cuda_array_interface__ will allow for sharing of memory buffers between RAPIDS and supported libraries Interoperability Zero copy data sharing between libraries

Slide 24

Slide 24 text

24 RAPIDS + Dask Leveraging GPUs with Dask

Slide 25

Slide 25 text

25 dask-cuda Start Dask workers for your GPUs

Slide 26

Slide 26 text

26 Dask cupy/cudf Using GPU backed data structures import dask import dask.array as da dask.config.set({"array.backend": "cupy"}) # Get cupy-backed collection darr = da.ones(10, chunks=(5,)) Using GPU accelerated libraries in Dask is as easy as changing one setting.

Slide 27

Slide 27 text

27 ● TCP sockets are slow! ● UCX provides uniform access to transports (TCP, InfiniBand, shared memory, NVLink) ● Python bindings for UCX (ucx-py) ● Will provide best communication performance, to Dask based on available hardware on nodes/cluster UCX Bringing hardware accelerated communications to Dask NVIDIA DGX-2 Inner join Benchmark cluster = LocalCUDACluster( protocol="ucx", enable_infiniband=True, enable_nvlink=True, ) client = Client(cluster)

Slide 28

Slide 28 text

28 NeMo Data Curator LLM data preprocessing (10s of TBs) https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/

Slide 29

Slide 29 text

29 Roadmap Longer term plans for Dask on HPC ● Add more Runners to dask-jobqueue for other schedulers ● Migrate dask-mpi into dask-jobqueue as a Runner ● Improve dask-cuda compatibility in dask-jobqueue ● Build out more Dask on HPC documentation and resources

Slide 30

Slide 30 text

30 30 Learn New Skills For Free Sharpen your skills with free technical training. You can select a complimentary self-paced course when you register for the NVIDIA Developer Program. Courses on offer include: • Fundamentals of Accelerated Computing with CUDA Python • Getting Started With Accelerated Computing in CUDA C/C++ • Optimizing CUDA Machine Learning Codes with Nsight Profiling Tools • RAPIDS Accelerator for Apache Spark • Accelerating End-to-End Data Science Workflows By signing up you'll also enjoy other great benefits like access to SDKs, technical documentation, training resources, and more. Join here: https://nvda.ws/45vQqdr When you join the NVIDIA Developer Program

Slide 31

Slide 31 text

31 Thank you! Learn more 👇 https://dask.org https://jobqueue.dask.org https://rapids.ai