William Horton - CUDA in your Python: Effective Parallel Programming on the GPU

Slide 1

Slide 1 text

CUDA in your Python: Effective Parallel Programming on the GPU William Horton @hortonhearsafoo

Slide 2

Slide 2 text

Moore’s Law is dead @hortonhearsafoo

Slide 3

Slide 3 text

Moore’s Law The number of transistors on an integrated circuit will double every two years https://www.intel.com/pressroom/kits/events/moores_law_40th/ @hortonhearsafoo

Slide 4

Slide 4 text

@hortonhearsafoo

Slide 5

Slide 5 text

By ourworldindata.org - Original text : Data source: https://en.wikipedia.org/wiki/Transistor_countURL: https://ourworldindata.org/wp-content/uploads/2013/05/Transistor-Count-over-time.pngArticle: https://ourworldindata.org/technological-progress), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=71553709 @hortonhearsafoo

Slide 6

Slide 6 text

https://www.amazon.com/Barrons- Physics-Robert-Pelcovits-Ph-D/dp/ 1438007426 Physics!

Slide 7

Slide 7 text

The Death of Moore’s Law @hortonhearsafoo

Slide 8

Slide 8 text

The Death of Moore’s Law “I guess I see Moore’s Law dying here in the next decade or so, but that’s not surprising.” - Gordon Moore, 2015 @hortonhearsafoo

Slide 9

Slide 9 text

Why GPUs? @hortonhearsafoo

Slide 10

Slide 10 text

History of the GPU The GPU (graphics processing unit) was originally developed for gaming Designed to be good at matrix operations Typical workload for gaming graphics requires arithmetic operations on large amounts of data (pixels, objects in a scene, etc.) @hortonhearsafoo

Slide 11

Slide 11 text

Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition Intel Core i9-9900K @hortonhearsafoo

Slide 12

Slide 12 text

Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition Intel Core i9-9900K Cores: 4352 CUDA Cores across 68 Streaming Multiprocessors Base Clock: 1.350 GHz Boost Clock: 1.635 GHz Cores: 8 (16 Hyperthreads) Base Clock: 3.6 GHz Boost Clock: 5.0 GHz @hortonhearsafoo

Slide 13

Slide 13 text

GPU Architecture from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Slide 14

Slide 14 text

The rise of GPGPU General-purpose computing on GPU “In the past the processing units of the GPU were designed only for computer graphics but now GPUs are truly general-purpose parallel processors.” (“GPGPU Computing”, Oancea 2014) Different models: CUDA (Nvidia), APP (AMD), OpenCL (open standard maintained by Khronos Group) @hortonhearsafoo

Slide 15

Slide 15 text

About me @hortonhearsafoo

Slide 16

Slide 16 text

My work Senior Software Engineer on the Data team at Compass Tools my team uses: PySpark, Kafka, Airﬂow @hortonhearsafoo We’re hiring in NYC and Seattle!

Slide 17

Slide 17 text

The Future: GPUs for Data Pipelines? @hortonhearsafoo

Slide 18

Slide 18 text

The Future: GPU Databases? https://eng.uber.com/aresdb/ @hortonhearsafoo

Slide 19

Slide 19 text

My hobbies include... Deep Learning! @hortonhearsafoo

Slide 20

Slide 20 text

Horton’s Law AWS Bill Interest in deep learning @hortonhearsafoo

Slide 21

Slide 21 text

@hortonhearsafoo

Slide 22

Slide 22 text

How can I start programming the GPU? @hortonhearsafoo

Slide 23

Slide 23 text

NumPy example import numpy as np x = np.random.randn(10000000).astype(np.ﬂoat32) y = np.random.randn(10000000).astype(np.ﬂoat32) z = x + y @hortonhearsafoo

Slide 24

Slide 24 text

GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.ﬂoat32) y = cp.random.randn(10000000).astype(cp.ﬂoat32) z = x + y @hortonhearsafoo

Slide 25

Slide 25 text

GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.ﬂoat32) y = cp.random.randn(10000000).astype(cp.ﬂoat32) z = x + y @hortonhearsafoo

Slide 26

Slide 26 text

The End. Go program the GPU! @hortonhearsafoo

Slide 27

Slide 27 text

Benchmark (using %timeit) CPU: GPU: 30x speedup! @hortonhearsafoo

Slide 28

Slide 28 text

Different Approaches to CUDA in Python 1. Drop-in replacement 2. Compiling CUDA strings in Python 3. C/C++ extension @hortonhearsafoo

Slide 29

Slide 29 text

Drop-in replacement @hortonhearsafoo

Slide 30

Slide 30 text

CuPy: a drop-in NumPy replacement Developed for the deep learning framework Chainer Supports NumPy-like indexing, data types, broadcasting @hortonhearsafoo

Slide 31

Slide 31 text

API differences @hortonhearsafoo

Slide 32

Slide 32 text

More CUDA drop-ins From RAPIDS cuDF: drop-in for pandas dataframes cuML: CUDA-powered scikit-learn @hortonhearsafoo https://github.com/rapidsai/cudf

Slide 33

Slide 33 text

Compiling CUDA strings in Python @hortonhearsafoo

Slide 34

Slide 34 text

The CUDA API @hortonhearsafoo

Slide 35

Slide 35 text

Threads, Blocks, and Grids https://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Slide 36

Slide 36 text

Threads Threads execute CUDA code, and have a threadIdx in up to 3 dimensions The threadIdx is used for specifying which part of the data to do work on Why up to 3 dimensions? @hortonhearsafoo

Slide 37

Slide 37 text

Data parallelism “Same operations are performed on different subsets of same data.“ (https://en.wikipedia.org/wiki/Data_parallelism) Or, different processors take distinct slices of the data and do the same thing to it Many operations on vectors and matrices can be performed in a data-parallel way

Slide 38

Slide 38 text

1-D example [0, 1, 2, 3, 4, 5, 6, 7, 8] t0 t1 t2 t3 Threads Data @hortonhearsafoo

Slide 39

Slide 39 text

1-D example Threads to indexes: t0: 0, 4, 8 t1: 1, 5 t2: 2, 6 t3: 3, 7 Rule: threadIdx + numThreads * i @hortonhearsafoo

Slide 40

Slide 40 text

2-D example [[0, 1, 2], [3, 4, 5], [6, 7, 8]] t0 t1 t2 t3 Threads Data ? ? ? % % @hortonhearsafoo

Slide 41

Slide 41 text

3-D example ? @hortonhearsafoo

Slide 42

Slide 42 text

2-D example [[0, 1, 2], [3, 4, 5], [6, 7, 8]] t0,0 t0,1 t1,1 t1,0 Threads Data @hortonhearsafoo

Slide 43

Slide 43 text

from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Slide 44

Slide 44 text

Blocks & Grids Blocks organize groups of threads “Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series...Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses” Also can be indexed in up to three dimensions using blockIdx and blockDim Grids: just groups of blocks @hortonhearsafoo

Slide 45

Slide 45 text

Threads, Blocks, and Grids @hortonhearsafoo

Slide 46

Slide 46 text

GPU Architecture from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Slide 47

Slide 47 text

CUDA Kernel example https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Slide 48

Slide 48 text

Host and device CPU (Host) GPU (Device) Data @hortonhearsafoo

Slide 49

Slide 49 text

CUDA Kernels Kernels are C/C++ code with additional syntax, most importantly __global__ for identifying the kernel function, and the <<<...>>> syntax for specifying grid size and block size @hortonhearsafoo

Slide 50

Slide 50 text

CUDA Kernel example https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Slide 51

Slide 51 text

PyCUDA @hortonhearsafoo

Slide 52

Slide 52 text

PyCUDA Built by Andreas Klöckner, a researcher in scientific computing at UIUC Described in the paper “PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation” (2012) Used for scientific and research projects: Sailfish: Lattice Boltzmann Fluid Dynamics, Copenhagen CT toolbox, LINGO Chemical Similarities (and more! https://wiki.tiker.net/PyCuda/ShowCase) @hortonhearsafoo

Slide 53

Slide 53 text

PyCUDA example code @hortonhearsafoo

Slide 54

Slide 54 text

Beneﬁts of PyCUDA Automatic Memory Management Data Transfer: In, Out, and InOut Automatic Error Checking Metaprogramming @hortonhearsafoo

Slide 55

Slide 55 text

Automatic Memory Management One of the big beneﬁts of PyCUDA: Object cleanup is tied to lifetime of objects. Once your Python object goes out of scope, it will free the CUDA allocated memory for you. @hortonhearsafoo

Slide 56

Slide 56 text

Data Transfer: In, Out, and InOut import numpy import pycuda.driver as cuda ... a = numpy.random.randn(4,4).astype(numpy.ﬂoat32) func(cuda.InOut(a), block=(4, 4, 1)) @hortonhearsafoo

Slide 57

Slide 57 text

Data Transfer: In, Out, and InOut They are wrappers around your numpy array to transfer data to and from the GPU For example, to perform an operation on a np array a you’d have to: 1. Create the array on CPU 2. Allocate GPU memory of that size 3. Transfer data from CPU to GPU 4. Run the CUDA kernel 5. Transfer data back from GPU to CPU Instead you can use cuda.InOut(a) and it does all that for you! @hortonhearsafoo

Slide 58

Slide 58 text

Automatic Error Checking “[I]f an asynchronous error occurs, it will be reported by some subsequent unrelated runtime function call.” ??? PyCUDA checks CUDA errors and surfaces them as speciﬁc Python Exceptions @hortonhearsafoo

Slide 59

Slide 59 text

Metaprogramming When writing CUDA kernels, some parameters have to be chosen carefully, like “What’s the optimal number of threads per block?” This is often done with heuristics According to the PyCUDA docs: “The solution to this problem that PyCUDA tries to promote is: Forget heuristics. Benchmark at run time and use whatever works fastest.” @hortonhearsafoo https://documen.tician.de/pycuda/metaprog.html

Slide 60

Slide 60 text

Metaprogramming example @hortonhearsafoo

Slide 61

Slide 61 text

CUDA as a C/C++ extension @hortonhearsafoo

Slide 62

Slide 62 text

“Why not just use C++?” @hortonhearsafoo

Slide 63

Slide 63 text

@hortonhearsafoo

Slide 64

Slide 64 text

Python API & C/C++/(CUDA) performance @hortonhearsafoo

Slide 65

Slide 65 text

Python C Extensions https://docs.python.org/3/extending/extending.html @hortonhearsafoo

Slide 66

Slide 66 text

CUDA to C/C++ Nvidia provides the nvcc compiler nvcc takes in your CUDA C source code and does several things: 1. Compiles the kernel to GPU assembly code 2. Replaces the special syntax (<<<...>>>) in the C/C++ code 3. Optionally, can use your C/C++ compiler to compile the host (CPU) code @hortonhearsafoo

Slide 67

Slide 67 text

The Python side Starting point: https://github.com/rmcgibbo/npcuda-example Uses Cython to generate C++ class setuptools then helps us compile and link everything together @hortonhearsafoo

Slide 68

Slide 68 text

The Python side @hortonhearsafoo

Slide 69

Slide 69 text

Manual Memory Management @hortonhearsafoo

Slide 70

Slide 70 text

@hortonhearsafoo

Slide 71

Slide 71 text

Manual Memory Management Advanced CUDA memory features: Page-Locked Host Memory Portable Memory Write-Combining Memory Mapped Memory @hortonhearsafoo

Slide 72

Slide 72 text

A Compiler! @hortonhearsafoo

Slide 73

Slide 73 text

How to get started @hortonhearsafoo

Slide 74

Slide 74 text

Accessing a GPU Google Colab (free!) https://colab.research.google.com/ Kaggle Kernels (also free!) https://www.kaggle.com/kernels Cloud GPU Instances: AWS p2.xlarge ($0.90 per Hour) Google Cloud ($0.45 USD per GPU) @hortonhearsafoo

Slide 75

Slide 75 text

Horton’s Law AWS Bill Interest in deep learning @hortonhearsafoo

Slide 76

Slide 76 text

Where to go next? Applying CUDA to your workﬂow Parallel programming algorithms Other kinds of devices (xPUs like TPU, FPGA through PYNQ) @hortonhearsafoo

Slide 77

Slide 77 text

The End. Go program the GPU! @hortonhearsafoo

Slide 78

Slide 78 text

Links “Gordon Moore: The Man Whose Name Means Progress” https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means -progress NVIDIA CUDA C Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ RAPIDS: https://rapids.ai/ AresDB: https://eng.uber.com/aresdb/ CuPy: https://cupy.chainer.org/ PyCUDA: https://documen.tician.de/pycuda/