Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerating Python on HPC with Dask and RAPIDS

Accelerating Python on HPC with Dask and RAPIDS

Dask is a popular Python framework for scaling your workloads, whether you want to leverage all of the cores on your laptop and stream large datasets through memory, or scale your workload out to thousands of cores on large compute clusters. Dask allows you to distribute code using familiar APIs such as pandas, NumPy and scikit-learn or write your own distributed code with powerful parallel task-based programming primitives.

We will start by exploring the concept of adaptive clusters, which allow for dynamic scaling of resources based on the workload's demands. Adaptive clusters automatically submit and manage many jobs to an HPC queue, ensuring efficient resource utilisation and cost-effectiveness. This method is particularly useful for workloads with varying computational requirements, as it adjusts the number of active workers in real-time.

Next, we will dive into using runners that leverage parallel execution environments such as MPI or job schedulers like SLURM to bootstrap Dask clusters within a single large job allocation. Submitting a single job offers some benefits (aside from the fact that HPC administrators often prefer this approach), including better node locality, as the scheduler places processes on nodes that are physically closer together. This results in more efficient communication and reduced latency. Additionally, launching all workers simultaneously ensures balanced data distribution across the cluster.

The session will then shift focus to the accelerated side of Dask, demonstrating how to harness the power of GPUs to significantly boost computation speed. We will introduce Dask CUDA, part of RAPIDS, a suite of open-source libraries designed to execute end-to-end data science and analytics pipelines entirely on GPUs. By integrating Dask CUDA, users can achieve unprecedented levels of performance, particularly for data-intensive tasks such as machine learning and data preprocessing.

We will also explore the advantages of using UCX (Unified Communication X) to enhance Dask's performance on HPC systems with advanced networking technologies. UCX provides a high-performance communication layer that supports various network transports, including Infiniband and NVLink. By leveraging these accelerated networking options, users can achieve lower latency and higher bandwidth, resulting in faster data transfers between Dask workers and more efficient parallel computations.

Outline:
- Overview of Dask
- Scaling out Pandas and NumPy
- Custom parallel code
- Workflow engines
- Machine learning and AI applications
- Deploying Dask on HPC
- Adaptive clusters
- Fixed size runners
- Accelerating Dask on HPC
- RAPIDS and Dask CUDA
- UCX

Jacob Tomlinson

August 29, 2024
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. 1 Accelerating Python on HPC with Dask and RAPIDS Jacob

    Tomlinson, Dask Maintainer and RAPIDS Developer EuroSciPy 2024
  2. 5 Dask Distributed Dask cluster process architecture Client Your Dask

    code that run the business logic of your workflow. Converts code to task graphs instead of executing directly. Scheduler Receives task graphs and coordinates the execution of those tasks. Also makes autoscaling decisions. Workers Execute individual tasks on remote machines
  3. 6 Clusters vs Runners Deployment Paradigms Batch Runner Dynamic Cluster

    Workload starts as a multi-node job Nodes coordinate at startup to elect a scheduler and run client code Workload starts as a single node job Dask spawns multiple single-node worker jobs dynamically as they are required
  4. 8 dask-jobqueue Directly interact with queues • Has tools for

    dynamic clusters and runners • Supports many schedulers including PBS, SLURM, SGE, OAR and more • Integrates well with other Dask tooling like Dask’s Jupyter Lab extension
  5. 9 Cluster Example Interactive dynamic scaling from dask.distributed import Client

    from dask_jobqueue.slurm import SLURMCluster cluster = SLURMCluster(cores=1, memory="4GB") cluster.scale(2) client = Client(cluster) ... client.close() cluster.close()
  6. 10 Runner Example Batch workloads from dask.distributed import Client from

    dask_jobqueue.slurm import SLURMRunner with SLURMRunner(scheduler_file="scheduler-{job_id}.json") as runner: with Client(runner) as client: client.wait_for_workers(runner.n_workers) ... $ srun -n 100 python runner.py
  7. 11 dask-mpi Batch workloads on any MPI system from dask_mpi

    import initialize initialize() from dask.distributed import Client client = Client() $ srun -n 100 python mpi-runner.py $ mpirun -np 4 dask-mpi \ --scheduler-file scheduler.json from distributed import Client client = Client(scheduler_file='scheduler.json')
  8. 16 Single Node, Multi GPU Parallel HPO Computational parallelism beyond

    a single node X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=4) # NGPUs on system GPU cuda-worker GPU cuda-worker GPU cuda-worker GPU cuda-worker LocalCUDA cluster Multi Node Multi GPU GPU cuda-worker GPU cuda-worker GPU cuda-worker SLURMCluster … … X, y = … # NumPy Arrays # Optimize in parallel on your Dask cluster with parallel_backend("dask"): study.optimize(lambda trial: objective(trial, X, y), n_trials=100, n_jobs=20) # NGPUs on SLURM cluster
  9. 20 Lightning-Fast End-to-End Performance Reducing data science processes from hours

    to seconds *CPU approximate to n1-highmem-8 (8 vCPUs, 52GB memory) on Google Cloud Platform. TCO calculations-based on Cloud instance costs. A100s Provide More Power than 100 CPU Nodes 16 More Cost-Effective than Similar CPU Configuration 20x Faster Performance than Similar CPU Configuration 70x
  10. 21 cudf.pandas cuDF pandas accelerator mode cuDF pandas accelerator mode

    (cudf.pandas) is built on cuDF and accelerates pandas code on the GPU. It supports 100% of the Pandas API, using the GPU for supported operations, and automatically falling back to pandas for other operations. https://docs.rapids.ai/api/cudf/stable/cudf_pandas/
  11. 22 150x Faster pandas with Zero Code Change DuckDB data

    benchmark, 5GB Performance comparison between Traditional pandas v1.5 on Intel Xeon Platinum 8480CL CPU and pandas v1.5 with RAPIDS cuDF on NVIDIA Grace Hopper Source: https://developer.nvidia.com/blog/rapids-cudf-accelerates-pandas-nearly-150x-with-zero-code-changes/
  12. 23 mpi4p y • Real-world workflows often need to share

    data between libraries • RAPIDS supports device memory sharing between many popular data science and deep learning libraries • Keeps data on the GPU and avoids costly copying back and forth to host memory • Any library that supports DLPack or __cuda_array_interface__ will allow for sharing of memory buffers between RAPIDS and supported libraries Interoperability Zero copy data sharing between libraries
  13. 26 Dask cupy/cudf Using GPU backed data structures import dask

    import dask.array as da dask.config.set({"array.backend": "cupy"}) # Get cupy-backed collection darr = da.ones(10, chunks=(5,)) Using GPU accelerated libraries in Dask is as easy as changing one setting.
  14. 27 • TCP sockets are slow! • UCX provides uniform

    access to transports (TCP, InfiniBand, shared memory, NVLink) • Python bindings for UCX (ucx-py) • Will provide best communication performance, to Dask based on available hardware on nodes/cluster UCX Bringing hardware accelerated communications to Dask NVIDIA DGX-2 Inner join Benchmark cluster = LocalCUDACluster( protocol="ucx", enable_infiniband=True, enable_nvlink=True, ) client = Client(cluster)
  15. 28 NeMo Data Curator LLM data preprocessing (10s of TBs)

    https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/
  16. 29 Roadmap Longer term plans for Dask on HPC •

    Add more Runners to dask-jobqueue for other schedulers • Migrate dask-mpi into dask-jobqueue as a Runner • Improve dask-cuda compatibility in dask-jobqueue • Build out more Dask on HPC documentation and resources
  17. 30 30 Learn New Skills For Free Sharpen your skills

    with free technical training. You can select a complimentary self-paced course when you register for the NVIDIA Developer Program. Courses on offer include: • Fundamentals of Accelerated Computing with CUDA Python • Getting Started With Accelerated Computing in CUDA C/C++ • Optimizing CUDA Machine Learning Codes with Nsight Profiling Tools • RAPIDS Accelerator for Apache Spark • Accelerating End-to-End Data Science Workflows By signing up you'll also enjoy other great benefits like access to SDKs, technical documentation, training resources, and more. Join here: https://nvda.ws/45vQqdr When you join the NVIDIA Developer Program