Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

NVIDIA_Final_Internship_Presentation_-_GPU_Soft...

Melody Wang
December 05, 2024
1

 NVIDIA_Final_Internship_Presentation_-_GPU_Software_Environments.pdf

NVIDIA Final Internship Presentation Fall 2024

Melody Wang

December 05, 2024
Tweet

Transcript

  1. 1 The Art of Wrangling Your GPU Python Environments Melody

    Wang, NVIDIA Intern Final Internship Presentation
  2. 3 Hello! Jacob Tomlinson Jacob Tomlinson is a senior software

    engineer at NVIDIA. His work involves maintaining open source projects including RAPIDS and Dask. He also tinkers with kr8s in his spare time. He lives in Exeter, UK. Melody Wang Melody is an intern at NVIDIA on the RAPIDS Cloud Deployment Team. She is currently a senior studying Statistics & Machine Learning, CS, and Human-Computer Interaction at Carnegie Mellon University, She is super excited to be attending PyData and getting involved in the open source community!
  3. 6 Where Things go Wrong Scenarios Incompatible NVIDIA Driver •

    Installed: NVIDIA Driver 510. • Required: RAPIDS 23.10 with CUDA 12.1 requires NVIDIA Driver 525+. Multiple CUDA Versions Installed • Issue: CUDA 11.2 and CUDA 12.1 are both installed, leading to conflicts in dynamic library loading. • Fix : uninstall lower version of CUDA. Unsupported Hardware • Issue: The GPU (e.g., GTX 960M) does not support the required CUDA compute capability for RAPIDS (minimum 6.0 for most RAPIDS libraries). Improperly Configured Environment Variables • Issue: $LD_LIBRARY_PATH and $PATH point to an old CUDA installation (e.g., CUDA 10.2). • Fix: re-export environment variable to point to the new path.
  4. 8 Virtual Packages • Represent system-level features (like CUDA) without

    explicitly installing large system libraries via Conda. • Ensures RAPIDS libraries are compatible with the underlying GPU setup. • When Conda detects a GPU with a compatible CUDA version, it creates a virtual package (e.g., __cuda). • These virtual packages allow Conda to resolve dependencies without actually bundling the entire CUDA toolkit or drivers. __cuda, __glibc, __linux, __archspec, etc.
  5. 9 Conda Forge • Provides consistent builds across platforms and

    architectures (Windows, macOS, Linux, ARM). • Ensures that dependencies between packages are correctly managed to avoid conflicts. • Uses a centralized dependency graph to coordinate version updates across packages.
  6. 11 Build Infrastructure 1. PEP for index priority ▪ preferentially

    resolve packages from trusted sources like Conda-Forge ▪ ensures greater reliability
  7. 12 Build Infrastructure 2. Arbitrary metadata ▪ Enhances the precision

    of dependency resolution ▪ Replaces current workarounds
  8. 13 Build Infrastructure 3. Shared C++ dependencies ▪ Standardize dependencies

    across the RAPIDS tool stack ▪ Dramatically reduces the necessary download & install size
  9. 14 Build Infrastructure 4. Wheels distributed on PyPI ▪ Decomposing

    wheels so they share more and can be distributed via PyPI ▪ RAPIDS libraries are moving towards getting more of their CUDA dependencies
  10. 15 Build Infrastructure 5. Pre-Installations in Google Colab ▪ Preconfigured

    cloud based environment for RAPIDS ▪ Users can get up and running quickly
  11. 20 Design Highlights • Different types of checks ◦ System

    Requirements & Recommendations ◦ GPU, CUDA Drivers, & OS • Diagnosis & Prescription • Library entrypoint plugins ◦ Cudf, cuML ◦ Morpheus ◦ etc
  12. 22 Design Highlights: Diagnosis & Prescription RAPIDS DOCTOR Goes beyond

    identifying problems by offering specific, actionable solutions
  13. 23 Design Highlights: Library EntryPoint Plugins RAPIDS DOCTOR RAPIDS DOCTOR

    cuDF cuML Morpheus More to come.. Clean, modular, extendable design
  14. 24 Future Roadmap Platform checks ◦ Docker ◦ Kubernetes Integrated

    checks with additional libraries Cloud Integrations ◦ Sagemaker ◦ Vertex ◦ Databricks, etc.