Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tech Exeter Conference: Intro to GPU Development in Python

Tech Exeter Conference: Intro to GPU Development in Python

Writing code for GPUs has come a long way over the last few years and it is now easier than ever to get started. You can even do it in Python! This talk will cover setting up your Python environment for GPU development. How coding for GPUs differs from CPUs, and the kind of problems GPUs excel at solving. We will dive into some real examples using Numba and also touch on a suite of Python Data Science tools called RAPIDS.

Session takeaways

* You don't need to learn C++ to develop on GPUs
* GPUs are useful for more than just machine learning
* hardware accelerators like GPUs are going to be more important than ever in order to scale our current workloads

Jacob Tomlinson

September 09, 2020

More Decks by Jacob Tomlinson

Other Decks in Technology


  1. Jacob Tomlinson | Tech Exeter Conference 2020 Intro to GPU

    development in Python
  2. None
  3. 3 RAPIDS Github https://github.com/rapidsai

  4. 4 GPU-Accelerated ETL The Average Data Scientist Spends 90+% of

    Their Time in ETL as Opposed to Training Models
  5. 5 Lightning-fast performance on real-world use cases Up to 350x

    faster queries; Hours to Seconds! TPCx-BB is a data science benchmark consisting of 30 end-to-end queries representing real-world ETL and Machine Learning workflows, involving both structured and unstructured data. It can be run at multiple “Scale Factors”. ▸ SF1 - 1GB ▸ SF1K - 1 TB ▸ SF10K - 10 TB RAPIDS results at SF1K (2 DGX A100s) and SF10K (16 DGX A100s) show GPUs provide dramatic cost and time-savings for small scale and large-scale data analytics problems ▸ SF1K 37.1x average speed-up ▸ SF10K 19.5x average speed-up (7x Normalized for Cost)

    Cloud: Kubernetes ▸ Hadoop/Spark: Yarn PYDATA NATIVE ▸ Easy Migration: Built on top of NumPy, Pandas Scikit-Learn, etc ▸ Easy Training: With the same APIs ▸ Trusted: With the same developer community EASY SCALABILITY ▸ Easy to install and use on a laptop ▸ Scales out to thousand node clusters POPULAR ▸ Most Common parallelism framework today in the PyData and SciPy community
  7. 7 What are GPUs?

  8. 8 Gaming Hardware For pwning n00bs

  9. 9 Mysterious Machine Learning Hardware For things like GauGAN Semantic

    Image Synthesis with Spatially-Adaptive Normalization Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu arXiv:1903.07291 [cs.CV]
  10. 10 CPU GPU

  11. 11 https://youtu.be/-P28LKWTzrI

  12. 12 GPU vs CPU https://docs.nvidia.com/cuda/cuda-c-programming-guide/

  13. 13 Using a GPU is like using two computers Icons

    made by Freepik from Flaticon Network SSH VNC RD SCP FTP SFTP Robocopy
  14. 14 Using a GPU is like using two computers Icons

    made by Freepik from Flaticon PCI CUDA
  15. 15 What is CUDA?

  16. 16 CUDA

  17. 17 I don’t write C/C++

  18. 18 What does CUDA do? Construct GPU code with CUDA

    C/C++ language extensions Copy data from RAM to GPU Copy compiled code to GPU Execute code Copy data from GPU to RAM How do we run stuff on the GPU?
  19. 19 Let’s do it in Python

  20. 20

  21. 21

  22. 22 Writing a Kernel Differences between a kernel and a

    function • A kernel cannot return anything, it must instead modify memory • A kernel must specify its thread hierarchy (threads and blocks) A kernel is a GPU function
  23. 23 Threads, blocks, grids and warps https://docs.nvidia.com/cuda/cuda-c-programming-guide/

  24. 24 What? Rules of thumb for threads per block: •

    Should be a round multiple of the warp size (32) • A good place to start is 128-512 but benchmarking is required to determine the optimal value.
  25. 25 Imports

  26. 26 Data arrays

  27. 27 Example kernel

  28. 28 Running the kernel

  29. 29 Absolute positions

  30. 30 Thread and block positions

  31. 31 Example kernel (again)

  32. 32 How did the GPU update our numpy array? If

    you call a Numba CUDA kernel with data that isn’t on the GPU it will be copied to the GPU before running the kernel and copied back after. This isn’t always ideal as copying data can waste time.
  33. 33 Create a GPU array

  34. 34 Simplified position kernel

  35. 35 GPU Array

  36. 36 Copy to host

  37. 37 Higher level APIs

  38. 38

  39. 39 cuDF

  40. 40 cuDF cuIO Analytics Data Preparation Visualization Model Training cuML

    Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask GPU Memory RAPIDS End-to-End GPU Accelerated Data Science
  41. 41 Interoperability for the Win DLPack and __cuda_array_interface__ mpi4py

  42. 42 __cuda_array_interface__

  43. 43 Recap

  44. 44 Recap GPUs run the same function (kernel) many times

    in parallel When being called each function gets a unique index CUDA/C++ is used to write kernels, but high level languages like Python can also compile to it Memory must be copied between the CPU (host) and GPU (device) Many familiar Python APIs have GPU accelerated implementations to abstract all this away Takeaways on GPU computing
  45. THANK YOU Jacob Tomlinson @_jacobtomlinson