Save 37% off PRO during our Black Friday Sale! »

Bristech - GPU Computing in Python

Bristech - GPU Computing in Python

I joined NVIDIA in 2019 and I was brand new to GPU development. In that time, I’ve gotten to grips with the fundamentals of writing accelerated code in Python. I was amazed to discover that I didn’t need to learn C++ and I didn’t need new development tools. Writing GPU code in Python is easier today than ever, and in this tutorial, I will share what I’ve learned and how you can get started with accelerating your code.

Once we’ve written a bit of GPU code we will look at some open source Python libraries that are part of the RAPIDS suite of tools. These libraries follow familiar APIs from the PyData ecosystem for working with Dataframes, ND-Arrays and doing statistics and machine learning, but under the hood they’ve been rewritten to run on NVIDIA GPUs giving large performance gains.

Ca3d0556227d66b3c15be1eadf69473b?s=128

Jacob Tomlinson

November 04, 2021
Tweet

Transcript

  1. Jacob Tomlinson | Bristech Nov 2021 Intro to RAPIDS and

    GPU development in Python
  2. None
  3. 3 RAPIDS Github https://github.com/rapidsai

  4. 4 Jake VanderPlas - PyCon 2017

  5. 5 Pandas Analytics CPU Memory Data Preparation Visualization Model Training

    Scikit-Learn Machine Learning NetworkX Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning Matplotlib Visualization Dask Open Source Data Science Ecosystem Familiar Python APIs
  6. 6 cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model

    Training cuML Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask RAPIDS End-to-End Accelerated GPU Data Science
  7. 7 Dask EASY SCALABILITY ▸ Easy to install and use

    on a laptop ▸ Scales out to thousand node clusters ▸ Modularly built for acceleration DEPLOYABLE ▸ HPC: SLURM, PBS, LSF, SGE ▸ Cloud: Kubernetes ▸ Hadoop/Spark: Yarn PYDATA NATIVE ▸ Easy Migration: Built on top of NumPy, Pandas Scikit-Learn, etc ▸ Easy Training: With the same API POPULAR ▸ Most Common parallelism framework today in the PyData and SciPy community ▸ Millions of monthly Downloads and Dozens of Integrations NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PYDATA Multi-core and distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures DASK Scale Out / Parallelize
  8. 8 Accelerated on single GPU NumPy -> CuPy/PyTorch/.. Pandas ->

    cuDF Scikit-Learn -> cuML NetworkX -> cuGraph Numba -> Numba RAPIDS AND OTHERS Multi-GPU On single Node (DGX) Or across a cluster RAPIDS + DASK WITH OPENUCX NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PYDATA Multi-core and distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures DASK Scale Up / Accelerate Scale Out / Parallelize Scale Out with RAPIDS + Dask with OpenUCX
  9. 9 Time in seconds (shorter is better) cuIO/cuDF (Load and

    Data Prep) Data Conversion XGBoost Faster Speeds, Real World Benefits Faster Data Access, Less Data Movement cuIO/cuDF – Load and Data Preparation XGBoost Machine Learning End-to-End Benchmark 200GB CSV dataset; Data prep includes joins, variable transformations CPU Cluster Configuration CPU nodes (61 GiB memory, 8 vCPUs, 64-bit platform), Apache Spark RAPIDS Version RAPIDS 0.17 A100 Cluster Configuration 16 A100 GPUs (40GB each)
  10. 10 What are GPUs?

  11. 11 Gaming Hardware For pwning n00bs

  12. 12 Mysterious Machine Learning Hardware For things like GauGAN Semantic

    Image Synthesis with Spatially-Adaptive Normalization Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu arXiv:1903.07291 [cs.CV]
  13. 13 CPU GPU

  14. 14 https://youtu.be/-P28LKWTzrI

  15. 15 GPU vs CPU https://docs.nvidia.com/cuda/cuda-c-programming-guide/

  16. 16 Using a GPU is like using two computers Icons

    made by Freepik from Flaticon Network SSH VNC RD SCP FTP SFTP Robocopy
  17. 17 Using a GPU is like using two computers PCI

    CUDA
  18. 18 What is CUDA?

  19. 19 CUDA

  20. 20 I don’t write C/C++

  21. 21 What does CUDA do? Construct GPU code with CUDA

    C/C++ language extensions Copy data from RAM to GPU Copy compiled code to GPU Execute code Copy data from GPU to RAM How do we run stuff on the GPU?
  22. 22 Let’s do it in Python

  23. 23

  24. 24

  25. 25 Live coding 󰛢

  26. 26 Writing a Kernel Differences between a kernel and a

    function • A kernel cannot return anything, it must instead modify memory • A kernel must specify its thread hierarchy (threads and blocks) A kernel is a GPU function
  27. 27 Threads, blocks, grids and warps https://docs.nvidia.com/cuda/cuda-c-programming-guide/

  28. 28 What? Rules of thumb for threads per block: •

    Should be a round multiple of the warp size (32) • A good place to start is 128-512 but benchmarking is required to determine the optimal value.
  29. 29 Imports

  30. 30 Data arrays

  31. 31 Example kernel

  32. 32 Running the kernel

  33. 33 Absolute positions

  34. 34 Thread and block positions

  35. 35 Example kernel (again)

  36. 36 How did the GPU update our numpy array? If

    you call a Numba CUDA kernel with data that isn’t on the GPU it will be copied to the GPU before running the kernel and copied back after. This isn’t always ideal as copying data can waste time.
  37. 37 Create a GPU array

  38. 38 Simplified position kernel

  39. 39 GPU Array

  40. 40 Copy to host

  41. 41 Higher level APIs

  42. 42

  43. 43 cuDF

  44. 44 cuDF cuIO Analytics Data Preparation Visualization Model Training cuML

    Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask GPU Memory RAPIDS End-to-End GPU Accelerated Data Science
  45. 45 Interoperability for the Win DLPack and __cuda_array_interface__ mpi4py

  46. 46 __cuda_array_interface__

  47. 47 Recap

  48. 48 Recap GPUs run the same function (kernel) many times

    in parallel When being called each function gets a unique index CUDA/C++ is used to write kernels, but high level languages like Python can also compile to it Memory must be copied between the CPU (host) and GPU (device) Many familiar Python APIs have GPU accelerated implementations to abstract all this away Takeaways on GPU computing
  49. THANK YOU Jacob Tomlinson @_jacobtomlinson