Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tech Exeter Conference: Intro to GPU Development in Python

Tech Exeter Conference: Intro to GPU Development in Python

Writing code for GPUs has come a long way over the last few years and it is now easier than ever to get started. You can even do it in Python! This talk will cover setting up your Python environment for GPU development. How coding for GPUs differs from CPUs, and the kind of problems GPUs excel at solving. We will dive into some real examples using Numba and also touch on a suite of Python Data Science tools called RAPIDS.

Session takeaways

* You don't need to learn C++ to develop on GPUs
* GPUs are useful for more than just machine learning
* hardware accelerators like GPUs are going to be more important than ever in order to scale our current workloads

Jacob Tomlinson

September 09, 2020
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. Jacob Tomlinson | Tech Exeter Conference 2020
    Intro to GPU development
    in Python

    View Slide

  2. View Slide

  3. 3
    RAPIDS Github
    https://github.com/rapidsai

    View Slide

  4. 4
    GPU-Accelerated ETL
    The Average Data Scientist Spends 90+% of Their
    Time in ETL as Opposed to Training Models

    View Slide

  5. 5
    Lightning-fast performance on real-world use cases
    Up to 350x faster queries; Hours to Seconds!
    TPCx-BB is a data science benchmark consisting of 30 end-to-end
    queries representing real-world ETL and Machine Learning workflows,
    involving both structured and unstructured data. It can be run at
    multiple “Scale Factors”.
    ▸ SF1 - 1GB
    ▸ SF1K - 1 TB
    ▸ SF10K - 10 TB
    RAPIDS results at SF1K (2 DGX A100s) and SF10K (16 DGX A100s) show
    GPUs provide dramatic cost and time-savings for small scale and
    large-scale data analytics problems
    ▸ SF1K 37.1x average speed-up
    ▸ SF10K 19.5x average speed-up (7x Normalized for Cost)

    View Slide

  6. 6
    Dask
    DEPLOYABLE
    ▸ HPC: SLURM, PBS, LSF, SGE
    ▸ Cloud: Kubernetes
    ▸ Hadoop/Spark: Yarn
    PYDATA NATIVE
    ▸ Easy Migration: Built on top of NumPy, Pandas Scikit-Learn, etc
    ▸ Easy Training: With the same APIs
    ▸ Trusted: With the same developer community
    EASY SCALABILITY
    ▸ Easy to install and use on a laptop
    ▸ Scales out to thousand node clusters
    POPULAR
    ▸ Most Common parallelism framework today in the PyData and SciPy community

    View Slide

  7. 7
    What are GPUs?

    View Slide

  8. 8
    Gaming Hardware
    For pwning n00bs

    View Slide

  9. 9
    Mysterious Machine Learning Hardware
    For things like GauGAN
    Semantic Image Synthesis
    with Spatially-Adaptive
    Normalization
    Taesung Park, Ming-Yu Liu,
    Ting-Chun Wang,
    Jun-Yan Zhu
    arXiv:1903.07291 [cs.CV]

    View Slide

  10. 10
    CPU GPU

    View Slide

  11. 11
    https://youtu.be/-P28LKWTzrI

    View Slide

  12. 12
    GPU vs CPU
    https://docs.nvidia.com/cuda/cuda-c-programming-guide/

    View Slide

  13. 13
    Using a GPU is like using two computers
    Icons made by Freepik from Flaticon
    Network
    SSH
    VNC
    RD
    SCP
    FTP
    SFTP
    Robocopy

    View Slide

  14. 14
    Using a GPU is like using two computers
    Icons made by Freepik from Flaticon
    PCI
    CUDA

    View Slide

  15. 15
    What is CUDA?

    View Slide

  16. 16
    CUDA

    View Slide

  17. 17
    I don’t write C/C++

    View Slide

  18. 18
    What does CUDA do?
    Construct GPU code with CUDA C/C++ language extensions
    Copy data from RAM to GPU
    Copy compiled code to GPU
    Execute code
    Copy data from GPU to RAM
    How do we run stuff on the GPU?

    View Slide

  19. 19
    Let’s do it in Python

    View Slide

  20. 20

    View Slide

  21. 21

    View Slide

  22. 22
    Writing a Kernel
    Differences between a kernel and a function
    ● A kernel cannot return anything, it must instead modify memory
    ● A kernel must specify its thread hierarchy (threads and blocks)
    A kernel is a GPU function

    View Slide

  23. 23
    Threads, blocks, grids and warps
    https://docs.nvidia.com/cuda/cuda-c-programming-guide/

    View Slide

  24. 24
    What?
    Rules of thumb for threads per block:
    ● Should be a round multiple of the warp size (32)
    ● A good place to start is 128-512 but benchmarking is required to
    determine the optimal value.

    View Slide

  25. 25
    Imports

    View Slide

  26. 26
    Data arrays

    View Slide

  27. 27
    Example kernel

    View Slide

  28. 28
    Running the kernel

    View Slide

  29. 29
    Absolute positions

    View Slide

  30. 30
    Thread and block positions

    View Slide

  31. 31
    Example kernel (again)

    View Slide

  32. 32
    How did the GPU update our numpy array?
    If you call a Numba CUDA kernel with data that isn’t on the GPU it will be
    copied to the GPU before running the kernel and copied back after.
    This isn’t always ideal as copying data can waste time.

    View Slide

  33. 33
    Create a GPU array

    View Slide

  34. 34
    Simplified position kernel

    View Slide

  35. 35
    GPU Array

    View Slide

  36. 36
    Copy to host

    View Slide

  37. 37
    Higher level APIs

    View Slide

  38. 38

    View Slide

  39. 39
    cuDF

    View Slide

  40. 40
    cuDF cuIO
    Analytics
    Data Preparation Visualization
    Model Training
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch,
    TensorFlow, MxNet
    Deep Learning
    cuxfilter, pyViz,
    plotly
    Visualization
    Dask
    GPU Memory
    RAPIDS
    End-to-End GPU Accelerated Data Science

    View Slide

  41. 41
    Interoperability for the Win
    DLPack and __cuda_array_interface__
    mpi4py

    View Slide

  42. 42
    __cuda_array_interface__

    View Slide

  43. 43
    Recap

    View Slide

  44. 44
    Recap
    GPUs run the same function (kernel) many times in parallel
    When being called each function gets a unique index
    CUDA/C++ is used to write kernels, but high level languages like Python can also compile to it
    Memory must be copied between the CPU (host) and GPU (device)
    Many familiar Python APIs have GPU accelerated implementations to abstract all this away
    Takeaways on GPU computing

    View Slide

  45. THANK YOU
    Jacob Tomlinson
    @_jacobtomlinson

    View Slide