Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bristech - GPU Computing in Python

Bristech - GPU Computing in Python

I joined NVIDIA in 2019 and I was brand new to GPU development. In that time, I’ve gotten to grips with the fundamentals of writing accelerated code in Python. I was amazed to discover that I didn’t need to learn C++ and I didn’t need new development tools. Writing GPU code in Python is easier today than ever, and in this tutorial, I will share what I’ve learned and how you can get started with accelerating your code.

Once we’ve written a bit of GPU code we will look at some open source Python libraries that are part of the RAPIDS suite of tools. These libraries follow familiar APIs from the PyData ecosystem for working with Dataframes, ND-Arrays and doing statistics and machine learning, but under the hood they’ve been rewritten to run on NVIDIA GPUs giving large performance gains.

Jacob Tomlinson

November 04, 2021
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. Jacob Tomlinson | Bristech Nov 2021
    Intro to RAPIDS and GPU development
    in Python

    View Slide

  2. View Slide

  3. 3
    RAPIDS Github
    https://github.com/rapidsai

    View Slide

  4. 4
    Jake VanderPlas - PyCon 2017

    View Slide

  5. 5
    Pandas
    Analytics
    CPU Memory
    Data Preparation Visualization
    Model Training
    Scikit-Learn
    Machine Learning
    NetworkX
    Graph Analytics
    PyTorch,
    TensorFlow, MxNet
    Deep Learning
    Matplotlib
    Visualization
    Dask
    Open Source Data Science Ecosystem
    Familiar Python APIs

    View Slide

  6. 6
    cuDF cuIO
    Analytics
    GPU Memory
    Data Preparation Visualization
    Model Training
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch,
    TensorFlow, MxNet
    Deep Learning
    cuxfilter, pyViz,
    plotly
    Visualization
    Dask
    RAPIDS
    End-to-End Accelerated GPU Data Science

    View Slide

  7. 7
    Dask
    EASY SCALABILITY
    ▸ Easy to install and use on a laptop
    ▸ Scales out to thousand node clusters
    ▸ Modularly built for acceleration
    DEPLOYABLE
    ▸ HPC: SLURM, PBS, LSF, SGE
    ▸ Cloud: Kubernetes
    ▸ Hadoop/Spark: Yarn
    PYDATA NATIVE
    ▸ Easy Migration: Built on top of NumPy, Pandas
    Scikit-Learn, etc
    ▸ Easy Training: With the same API
    POPULAR
    ▸ Most Common parallelism framework today in the PyData
    and SciPy community
    ▸ Millions of monthly Downloads and Dozens of Integrations
    NumPy, Pandas, Scikit-Learn,
    Numba and many more
    Single CPU core
    In-memory data
    PYDATA
    Multi-core and distributed PyData
    NumPy -> Dask Array
    Pandas -> Dask DataFrame
    Scikit-Learn -> Dask-ML
    … -> Dask Futures
    DASK
    Scale Out / Parallelize

    View Slide

  8. 8
    Accelerated on single GPU
    NumPy -> CuPy/PyTorch/..
    Pandas -> cuDF
    Scikit-Learn -> cuML
    NetworkX -> cuGraph
    Numba -> Numba
    RAPIDS AND OTHERS
    Multi-GPU
    On single Node (DGX)
    Or across a cluster
    RAPIDS + DASK
    WITH OPENUCX
    NumPy, Pandas, Scikit-Learn,
    Numba and many more
    Single CPU core
    In-memory data
    PYDATA
    Multi-core and distributed PyData
    NumPy -> Dask Array
    Pandas -> Dask DataFrame
    Scikit-Learn -> Dask-ML
    … -> Dask Futures
    DASK
    Scale Up / Accelerate
    Scale Out / Parallelize
    Scale Out with RAPIDS + Dask with OpenUCX

    View Slide

  9. 9
    Time in seconds (shorter is better)
    cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost
    Faster Speeds, Real World Benefits
    Faster Data Access, Less Data Movement
    cuIO/cuDF –
    Load and Data Preparation XGBoost Machine Learning End-to-End
    Benchmark
    200GB CSV dataset; Data prep includes
    joins, variable transformations
    CPU Cluster Configuration
    CPU nodes (61 GiB memory, 8 vCPUs,
    64-bit platform), Apache Spark
    RAPIDS Version
    RAPIDS 0.17
    A100 Cluster Configuration
    16 A100 GPUs (40GB each)

    View Slide

  10. 10
    What are GPUs?

    View Slide

  11. 11
    Gaming Hardware
    For pwning n00bs

    View Slide

  12. 12
    Mysterious Machine Learning Hardware
    For things like GauGAN
    Semantic Image Synthesis
    with Spatially-Adaptive
    Normalization
    Taesung Park, Ming-Yu Liu,
    Ting-Chun Wang,
    Jun-Yan Zhu
    arXiv:1903.07291 [cs.CV]

    View Slide

  13. 13
    CPU GPU

    View Slide

  14. 14
    https://youtu.be/-P28LKWTzrI

    View Slide

  15. 15
    GPU vs CPU
    https://docs.nvidia.com/cuda/cuda-c-programming-guide/

    View Slide

  16. 16
    Using a GPU is like using two computers
    Icons made by Freepik from Flaticon
    Network
    SSH
    VNC
    RD
    SCP
    FTP
    SFTP
    Robocopy

    View Slide

  17. 17
    Using a GPU is like using two computers
    PCI
    CUDA

    View Slide

  18. 18
    What is CUDA?

    View Slide

  19. 19
    CUDA

    View Slide

  20. 20
    I don’t write C/C++

    View Slide

  21. 21
    What does CUDA do?
    Construct GPU code with CUDA C/C++ language extensions
    Copy data from RAM to GPU
    Copy compiled code to GPU
    Execute code
    Copy data from GPU to RAM
    How do we run stuff on the GPU?

    View Slide

  22. 22
    Let’s do it in Python

    View Slide

  23. 23

    View Slide

  24. 24

    View Slide

  25. 25
    Live coding 󰛢

    View Slide

  26. 26
    Writing a Kernel
    Differences between a kernel and a function
    ● A kernel cannot return anything, it must instead modify memory
    ● A kernel must specify its thread hierarchy (threads and blocks)
    A kernel is a GPU function

    View Slide

  27. 27
    Threads, blocks, grids and warps
    https://docs.nvidia.com/cuda/cuda-c-programming-guide/

    View Slide

  28. 28
    What?
    Rules of thumb for threads per block:
    ● Should be a round multiple of the warp size (32)
    ● A good place to start is 128-512 but benchmarking is required to
    determine the optimal value.

    View Slide

  29. 29
    Imports

    View Slide

  30. 30
    Data arrays

    View Slide

  31. 31
    Example kernel

    View Slide

  32. 32
    Running the kernel

    View Slide

  33. 33
    Absolute positions

    View Slide

  34. 34
    Thread and block positions

    View Slide

  35. 35
    Example kernel (again)

    View Slide

  36. 36
    How did the GPU update our numpy array?
    If you call a Numba CUDA kernel with data that isn’t on the GPU it will be
    copied to the GPU before running the kernel and copied back after.
    This isn’t always ideal as copying data can waste time.

    View Slide

  37. 37
    Create a GPU array

    View Slide

  38. 38
    Simplified position kernel

    View Slide

  39. 39
    GPU Array

    View Slide

  40. 40
    Copy to host

    View Slide

  41. 41
    Higher level APIs

    View Slide

  42. 42

    View Slide

  43. 43
    cuDF

    View Slide

  44. 44
    cuDF cuIO
    Analytics
    Data Preparation Visualization
    Model Training
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch,
    TensorFlow, MxNet
    Deep Learning
    cuxfilter, pyViz,
    plotly
    Visualization
    Dask
    GPU Memory
    RAPIDS
    End-to-End GPU Accelerated Data Science

    View Slide

  45. 45
    Interoperability for the Win
    DLPack and __cuda_array_interface__
    mpi4py

    View Slide

  46. 46
    __cuda_array_interface__

    View Slide

  47. 47
    Recap

    View Slide

  48. 48
    Recap
    GPUs run the same function (kernel) many times in parallel
    When being called each function gets a unique index
    CUDA/C++ is used to write kernels, but high level languages like Python can also compile to it
    Memory must be copied between the CPU (host) and GPU (device)
    Many familiar Python APIs have GPU accelerated implementations to abstract all this away
    Takeaways on GPU computing

    View Slide

  49. THANK YOU
    Jacob Tomlinson
    @_jacobtomlinson

    View Slide