NVIDIA_Final_Internship_Presentation_-_GPU_Software_Environments.pdf

1 The Art of Wrangling Your GPU Python Environments Melody
Wang, NVIDIA Intern Final Internship Presentation

2 Introduction

3 Hello! Jacob Tomlinson Jacob Tomlinson is a senior software
engineer at NVIDIA. His work involves maintaining open source projects including RAPIDS and Dask. He also tinkers with kr8s in his spare time. He lives in Exeter, UK. Melody Wang Melody is an intern at NVIDIA on the RAPIDS Cloud Deployment Team. She is currently a senior studying Statistics & Machine Learning, CS, and Human-Computer Interaction at Carnegie Mellon University, She is super excited to be attending PyData and getting involved in the open source community!

4 The Status Quo of GPU Environments

5 5 Environmental Stack In many GPU environments, some layers
of the stack are predefined.

6 Where Things go Wrong Scenarios Incompatible NVIDIA Driver •
Installed: NVIDIA Driver 510. • Required: RAPIDS 23.10 with CUDA 12.1 requires NVIDIA Driver 525+. Multiple CUDA Versions Installed • Issue: CUDA 11.2 and CUDA 12.1 are both installed, leading to conflicts in dynamic library loading. • Fix : uninstall lower version of CUDA. Unsupported Hardware • Issue: The GPU (e.g., GTX 960M) does not support the required CUDA compute capability for RAPIDS (minimum 6.0 for most RAPIDS libraries). Improperly Configured Environment Variables • Issue: $LD_LIBRARY_PATH and $PATH point to an old CUDA installation (e.g., CUDA 10.2). • Fix: re-export environment variable to point to the new path.

7 What We’ve Tried

8 Virtual Packages • Represent system-level features (like CUDA) without
explicitly installing large system libraries via Conda. • Ensures RAPIDS libraries are compatible with the underlying GPU setup. • When Conda detects a GPU with a compatible CUDA version, it creates a virtual package (e.g., __cuda). • These virtual packages allow Conda to resolve dependencies without actually bundling the entire CUDA toolkit or drivers. __cuda, __glibc, __linux, __archspec, etc.

9 Conda Forge • Provides consistent builds across platforms and
architectures (Windows, macOS, Linux, ARM). • Ensures that dependencies between packages are correctly managed to avoid conflicts. • Uses a centralized dependency graph to coordinate version updates across packages.

10 On the Horizon…

11 Build Infrastructure 1. PEP for index priority ▪ preferentially
resolve packages from trusted sources like Conda-Forge ▪ ensures greater reliability

12 Build Infrastructure 2. Arbitrary metadata ▪ Enhances the precision
of dependency resolution ▪ Replaces current workarounds

13 Build Infrastructure 3. Shared C++ dependencies ▪ Standardize dependencies
across the RAPIDS tool stack ▪ Dramatically reduces the necessary download & install size

14 Build Infrastructure 4. Wheels distributed on PyPI ▪ Decomposing
wheels so they share more and can be distributed via PyPI ▪ RAPIDS libraries are moving towards getting more of their CUDA dependencies

15 Build Infrastructure 5. Pre-Installations in Google Colab ▪ Preconfigured
cloud based environment for RAPIDS ▪ Users can get up and running quickly

16 Introducing RAPIDS Doctor…

17 RAPIDS DOCTOR How bridges it all

18 DEMO

19 ✅ Healthy Environment ❌ Broken Environment

20 Design Highlights • Different types of checks ◦ System
Requirements & Recommendations ◦ GPU, CUDA Drivers, & OS • Diagnosis & Prescription • Library entrypoint plugins ◦ Cudf, cuML ◦ Morpheus ◦ etc

21 Design Highlights: System & Hardware Checks Required Recommended

22 Design Highlights: Diagnosis & Prescription RAPIDS DOCTOR Goes beyond
identifying problems by offering specific, actionable solutions

23 Design Highlights: Library EntryPoint Plugins RAPIDS DOCTOR RAPIDS DOCTOR
cuDF cuML Morpheus More to come.. Clean, modular, extendable design

24 Future Roadmap Platform checks ◦ Docker ◦ Kubernetes Integrated
checks with additional libraries Cloud Integrations ◦ Sagemaker ◦ Vertex ◦ Databricks, etc.

25 Thank you! Jacob Tomlinson Mike McCarty Katrina Riehl James
Lamb RAPIDS Team

26 Q & A

NVIDIA_Final_Internship_Presentation_-_GPU_Soft...

NVIDIA_Final_Internship_Presentation_-_GPU_Software_Environments.pdf

Melody Wang

More Decks by Melody Wang

Featured

Transcript

1 The Art of Wrangling Your GPU Python Environments Melody

2 Introduction

3 Hello! Jacob Tomlinson Jacob Tomlinson is a senior software

4 The Status Quo of GPU Environments

5 5 Environmental Stack In many GPU environments, some layers

6 Where Things go Wrong Scenarios Incompatible NVIDIA Driver •

7 What We’ve Tried

8 Virtual Packages • Represent system-level features (like CUDA) without

9 Conda Forge • Provides consistent builds across platforms and

10 On the Horizon…

11 Build Infrastructure 1. PEP for index priority ▪ preferentially

12 Build Infrastructure 2. Arbitrary metadata ▪ Enhances the precision

13 Build Infrastructure 3. Shared C++ dependencies ▪ Standardize dependencies

14 Build Infrastructure 4. Wheels distributed on PyPI ▪ Decomposing

15 Build Infrastructure 5. Pre-Installations in Google Colab ▪ Preconfigured

16 Introducing RAPIDS Doctor…

17 RAPIDS DOCTOR How bridges it all

18 DEMO

19 ✅ Healthy Environment ❌ Broken Environment

20 Design Highlights • Different types of checks ◦ System

21 Design Highlights: System & Hardware Checks Required Recommended

22 Design Highlights: Diagnosis & Prescription RAPIDS DOCTOR Goes beyond

23 Design Highlights: Library EntryPoint Plugins RAPIDS DOCTOR RAPIDS DOCTOR

24 Future Roadmap Platform checks ◦ Docker ◦ Kubernetes Integrated

25 Thank you! Jacob Tomlinson Mike McCarty Katrina Riehl James

26 Q & A