Bristech - GPU Computing in Python

Slide 1

Slide 1 text

Jacob Tomlinson | Bristech Nov 2021 Intro to RAPIDS and GPU development in Python

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

3 RAPIDS Github https://github.com/rapidsai

Slide 4

Slide 4 text

4 Jake VanderPlas - PyCon 2017

Slide 5

Slide 5 text

5 Pandas Analytics CPU Memory Data Preparation Visualization Model Training Scikit-Learn Machine Learning NetworkX Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning Matplotlib Visualization Dask Open Source Data Science Ecosystem Familiar Python APIs

Slide 6

Slide 6 text

6 cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask RAPIDS End-to-End Accelerated GPU Data Science

Slide 7

Slide 7 text

7 Dask EASY SCALABILITY ▸ Easy to install and use on a laptop ▸ Scales out to thousand node clusters ▸ Modularly built for acceleration DEPLOYABLE ▸ HPC: SLURM, PBS, LSF, SGE ▸ Cloud: Kubernetes ▸ Hadoop/Spark: Yarn PYDATA NATIVE ▸ Easy Migration: Built on top of NumPy, Pandas Scikit-Learn, etc ▸ Easy Training: With the same API POPULAR ▸ Most Common parallelism framework today in the PyData and SciPy community ▸ Millions of monthly Downloads and Dozens of Integrations NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PYDATA Multi-core and distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures DASK Scale Out / Parallelize

Slide 8

Slide 8 text

8 Accelerated on single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML NetworkX -> cuGraph Numba -> Numba RAPIDS AND OTHERS Multi-GPU On single Node (DGX) Or across a cluster RAPIDS + DASK WITH OPENUCX NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PYDATA Multi-core and distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures DASK Scale Up / Accelerate Scale Out / Parallelize Scale Out with RAPIDS + Dask with OpenUCX

Slide 9

Slide 9 text

9 Time in seconds (shorter is better) cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost Faster Speeds, Real World Benefits Faster Data Access, Less Data Movement cuIO/cuDF – Load and Data Preparation XGBoost Machine Learning End-to-End Benchmark 200GB CSV dataset; Data prep includes joins, variable transformations CPU Cluster Configuration CPU nodes (61 GiB memory, 8 vCPUs, 64-bit platform), Apache Spark RAPIDS Version RAPIDS 0.17 A100 Cluster Configuration 16 A100 GPUs (40GB each)

Slide 10

Slide 10 text

10 What are GPUs?

Slide 11

Slide 11 text

11 Gaming Hardware For pwning n00bs

Slide 12

Slide 12 text

12 Mysterious Machine Learning Hardware For things like GauGAN Semantic Image Synthesis with Spatially-Adaptive Normalization Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu arXiv:1903.07291 [cs.CV]

Slide 13

Slide 13 text

13 CPU GPU

Slide 14

Slide 14 text

14 https://youtu.be/-P28LKWTzrI

Slide 15

Slide 15 text

15 GPU vs CPU https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Slide 16

Slide 16 text

16 Using a GPU is like using two computers Icons made by Freepik from Flaticon Network SSH VNC RD SCP FTP SFTP Robocopy

Slide 17

Slide 17 text

17 Using a GPU is like using two computers PCI CUDA

Slide 18

Slide 18 text

18 What is CUDA?

Slide 19

Slide 19 text

19 CUDA

Slide 20

Slide 20 text

20 I don’t write C/C++

Slide 21

Slide 21 text

21 What does CUDA do? Construct GPU code with CUDA C/C++ language extensions Copy data from RAM to GPU Copy compiled code to GPU Execute code Copy data from GPU to RAM How do we run stuff on the GPU?

Slide 22

Slide 22 text

22 Let’s do it in Python

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

25 Live coding 󰛢

Slide 26

Slide 26 text

26 Writing a Kernel Differences between a kernel and a function ● A kernel cannot return anything, it must instead modify memory ● A kernel must specify its thread hierarchy (threads and blocks) A kernel is a GPU function

Slide 27

Slide 27 text

27 Threads, blocks, grids and warps https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Slide 28

Slide 28 text

28 What? Rules of thumb for threads per block: ● Should be a round multiple of the warp size (32) ● A good place to start is 128-512 but benchmarking is required to determine the optimal value.

Slide 29

Slide 29 text

29 Imports

Slide 30

Slide 30 text

30 Data arrays

Slide 31

Slide 31 text

31 Example kernel

Slide 32

Slide 32 text

32 Running the kernel

Slide 33

Slide 33 text

33 Absolute positions

Slide 34

Slide 34 text

34 Thread and block positions

Slide 35

Slide 35 text

35 Example kernel (again)

Slide 36

Slide 36 text

36 How did the GPU update our numpy array? If you call a Numba CUDA kernel with data that isn’t on the GPU it will be copied to the GPU before running the kernel and copied back after. This isn’t always ideal as copying data can waste time.

Slide 37

Slide 37 text

37 Create a GPU array

Slide 38

Slide 38 text

38 Simplified position kernel

Slide 39

Slide 39 text

39 GPU Array

Slide 40

Slide 40 text

40 Copy to host

Slide 41

Slide 41 text

41 Higher level APIs

Slide 42

Slide 42 text

Slide 43

Slide 43 text

43 cuDF

Slide 44

Slide 44 text

44 cuDF cuIO Analytics Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch, TensorFlow, MxNet Deep Learning cuxfilter, pyViz, plotly Visualization Dask GPU Memory RAPIDS End-to-End GPU Accelerated Data Science

Slide 45

Slide 45 text

45 Interoperability for the Win DLPack and __cuda_array_interface__ mpi4py

Slide 46

Slide 46 text

46 __cuda_array_interface__

Slide 47

Slide 47 text

47 Recap

Slide 48

Slide 48 text

48 Recap GPUs run the same function (kernel) many times in parallel When being called each function gets a unique index CUDA/C++ is used to write kernels, but high level languages like Python can also compile to it Memory must be copied between the CPU (host) and GPU (device) Many familiar Python APIs have GPU accelerated implementations to abstract all this away Takeaways on GPU computing

Slide 49

Slide 49 text

THANK YOU Jacob Tomlinson @_jacobtomlinson