CUDA in your Python: Effective
Parallel Programming on the GPU
William Horton
@hortonhearsafoo
Slide 2
Slide 2 text
Moore’s Law is dead
@hortonhearsafoo
Slide 3
Slide 3 text
Moore’s Law
The number of transistors
on an integrated circuit will
double every two years
https://www.intel.com/pressroom/kits/events/moores_law_40th/
@hortonhearsafoo
Slide 4
Slide 4 text
@hortonhearsafoo
Slide 5
Slide 5 text
By ourworldindata.org - Original text : Data source: https://en.wikipedia.org/wiki/Transistor_countURL: https://ourworldindata.org/wp-content/uploads/2013/05/Transistor-Count-over-time.pngArticle:
https://ourworldindata.org/technological-progress), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=71553709
@hortonhearsafoo
The Death of Moore’s Law
“I guess I see Moore’s Law
dying here in the next
decade or so, but that’s not
surprising.”
- Gordon Moore, 2015
@hortonhearsafoo
Slide 9
Slide 9 text
Why GPUs?
@hortonhearsafoo
Slide 10
Slide 10 text
History of the GPU
The GPU (graphics processing unit) was originally developed for gaming
Designed to be good at matrix operations
Typical workload for gaming graphics requires arithmetic operations on large
amounts of data (pixels, objects in a scene, etc.)
@hortonhearsafoo
Slide 11
Slide 11 text
Specs: GPU vs CPU
Nvidia RTX 2080 Ti Founder’s Edition Intel Core i9-9900K
@hortonhearsafoo
Slide 12
Slide 12 text
Specs: GPU vs CPU
Nvidia RTX 2080 Ti Founder’s Edition Intel Core i9-9900K
Cores: 4352 CUDA Cores across 68
Streaming Multiprocessors
Base Clock: 1.350 GHz
Boost Clock: 1.635 GHz
Cores: 8 (16 Hyperthreads)
Base Clock: 3.6 GHz
Boost Clock: 5.0 GHz
@hortonhearsafoo
Slide 13
Slide 13 text
GPU Architecture
from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
@hortonhearsafoo
Slide 14
Slide 14 text
The rise of GPGPU
General-purpose computing on GPU
“In the past the processing units of the GPU were designed only for computer
graphics but now GPUs are truly general-purpose parallel processors.” (“GPGPU
Computing”, Oancea 2014)
Different models: CUDA (Nvidia), APP (AMD), OpenCL (open standard maintained
by Khronos Group)
@hortonhearsafoo
Slide 15
Slide 15 text
About me
@hortonhearsafoo
Slide 16
Slide 16 text
My work
Senior Software Engineer on the Data team at Compass
Tools my team uses: PySpark, Kafka, Airflow
@hortonhearsafoo
We’re hiring in NYC and
Seattle!
Slide 17
Slide 17 text
The Future: GPUs for Data Pipelines?
@hortonhearsafoo
Slide 18
Slide 18 text
The Future: GPU Databases?
https://eng.uber.com/aresdb/
@hortonhearsafoo
Slide 19
Slide 19 text
My hobbies include...
Deep Learning!
@hortonhearsafoo
Slide 20
Slide 20 text
Horton’s Law
AWS Bill
Interest in deep learning
@hortonhearsafoo
Slide 21
Slide 21 text
@hortonhearsafoo
Slide 22
Slide 22 text
How can I start programming the
GPU?
@hortonhearsafoo
Slide 23
Slide 23 text
NumPy example
import numpy as np
x = np.random.randn(10000000).astype(np.float32)
y = np.random.randn(10000000).astype(np.float32)
z = x + y
@hortonhearsafoo
Slide 24
Slide 24 text
GPU example
import cupy as cp
x = cp.random.randn(10000000).astype(cp.float32)
y = cp.random.randn(10000000).astype(cp.float32)
z = x + y
@hortonhearsafoo
Slide 25
Slide 25 text
GPU example
import cupy as cp
x = cp.random.randn(10000000).astype(cp.float32)
y = cp.random.randn(10000000).astype(cp.float32)
z = x + y
@hortonhearsafoo
Different
Approaches to
CUDA in Python
1. Drop-in replacement
2. Compiling CUDA
strings in Python
3. C/C++ extension
@hortonhearsafoo
Slide 29
Slide 29 text
Drop-in replacement
@hortonhearsafoo
Slide 30
Slide 30 text
CuPy: a drop-in NumPy replacement
Developed for the deep learning framework
Chainer
Supports NumPy-like indexing, data types,
broadcasting
@hortonhearsafoo
Slide 31
Slide 31 text
API differences
@hortonhearsafoo
Slide 32
Slide 32 text
More CUDA drop-ins
From RAPIDS
cuDF: drop-in for pandas dataframes
cuML: CUDA-powered scikit-learn
@hortonhearsafoo
https://github.com/rapidsai/cudf
Slide 33
Slide 33 text
Compiling CUDA strings in Python
@hortonhearsafoo
Slide 34
Slide 34 text
The CUDA API
@hortonhearsafoo
Slide 35
Slide 35 text
Threads, Blocks, and Grids
https://docs.nvidia.com/cuda/pdf/
CUDA_C_Programming_Guide.pdf
@hortonhearsafoo
Slide 36
Slide 36 text
Threads
Threads execute CUDA code, and have a threadIdx in up to 3 dimensions
The threadIdx is used for specifying which part of the data to do work on
Why up to 3 dimensions?
@hortonhearsafoo
Slide 37
Slide 37 text
Data parallelism
“Same operations are performed on different subsets of same data.“
(https://en.wikipedia.org/wiki/Data_parallelism)
Or, different processors take distinct slices of the data and do the same thing to it
Many operations on vectors and matrices can be performed in a data-parallel way
Slide 38
Slide 38 text
1-D example
[0, 1, 2, 3, 4, 5, 6, 7, 8]
t0 t1 t2 t3
Threads
Data
@hortonhearsafoo
Slide 39
Slide 39 text
1-D example
Threads to indexes:
t0: 0, 4, 8
t1: 1, 5
t2: 2, 6
t3: 3, 7
Rule: threadIdx + numThreads * i
@hortonhearsafoo
Slide 40
Slide 40 text
2-D example
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]
t0 t1 t2 t3
Threads
Data
? ? ?
% %
@hortonhearsafoo
Slide 41
Slide 41 text
3-D example
?
@hortonhearsafoo
Slide 42
Slide 42 text
2-D example
[[0, 1, 2],
[3, 4, 5],
[6, 7, 8]]
t0,0
t0,1 t1,1
t1,0
Threads
Data
@hortonhearsafoo
Slide 43
Slide 43 text
from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
@hortonhearsafoo
Slide 44
Slide 44 text
Blocks & Grids
Blocks organize groups of threads
“Thread blocks are required to execute independently: It must be possible to
execute them in any order, in parallel or in series...Threads within a block can
cooperate by sharing data through some shared memory and by synchronizing
their execution to coordinate memory accesses”
Also can be indexed in up to three dimensions using blockIdx and blockDim
Grids: just groups of blocks
@hortonhearsafoo
Slide 45
Slide 45 text
Threads, Blocks, and Grids
@hortonhearsafoo
Slide 46
Slide 46 text
GPU Architecture
from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
@hortonhearsafoo
Slide 47
Slide 47 text
CUDA Kernel example
https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
@hortonhearsafoo
Slide 48
Slide 48 text
Host and device
CPU (Host) GPU (Device)
Data
@hortonhearsafoo
Slide 49
Slide 49 text
CUDA Kernels
Kernels are C/C++ code with additional syntax, most importantly __global__ for
identifying the kernel function, and the <<<...>>> syntax for specifying grid size and
block size
@hortonhearsafoo
Slide 50
Slide 50 text
CUDA Kernel example
https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
@hortonhearsafoo
Slide 51
Slide 51 text
PyCUDA
@hortonhearsafoo
Slide 52
Slide 52 text
PyCUDA
Built by Andreas Klöckner, a researcher in scientific computing at UIUC
Described in the paper “PyCUDA and PyOpenCL: A scripting-based approach to
GPU run-time code generation” (2012)
Used for scientific and research projects: Sailfish: Lattice Boltzmann Fluid
Dynamics, Copenhagen CT toolbox, LINGO Chemical Similarities (and more!
https://wiki.tiker.net/PyCuda/ShowCase)
@hortonhearsafoo
Slide 53
Slide 53 text
PyCUDA example code
@hortonhearsafoo
Slide 54
Slide 54 text
Benefits of PyCUDA
Automatic Memory Management
Data Transfer: In, Out, and InOut
Automatic Error Checking
Metaprogramming
@hortonhearsafoo
Slide 55
Slide 55 text
Automatic Memory Management
One of the big benefits of PyCUDA: Object cleanup is tied to lifetime of objects.
Once your Python object goes out of scope, it will free the CUDA allocated
memory for you.
@hortonhearsafoo
Slide 56
Slide 56 text
Data Transfer: In, Out, and InOut
import numpy
import pycuda.driver as cuda
...
a = numpy.random.randn(4,4).astype(numpy.float32)
func(cuda.InOut(a), block=(4, 4, 1))
@hortonhearsafoo
Slide 57
Slide 57 text
Data Transfer: In, Out, and InOut
They are wrappers around your numpy array to transfer data to and from the GPU
For example, to perform an operation on a np array a you’d have to:
1. Create the array on CPU
2. Allocate GPU memory of that size
3. Transfer data from CPU to GPU
4. Run the CUDA kernel
5. Transfer data back from GPU to CPU
Instead you can use cuda.InOut(a) and it does all that for you!
@hortonhearsafoo
Slide 58
Slide 58 text
Automatic Error Checking
“[I]f an asynchronous error occurs, it will be reported by some subsequent
unrelated runtime function call.” ???
PyCUDA checks CUDA errors and surfaces them as specific Python Exceptions
@hortonhearsafoo
Slide 59
Slide 59 text
Metaprogramming
When writing CUDA kernels, some parameters have to be chosen carefully, like
“What’s the optimal number of threads per block?”
This is often done with heuristics
According to the PyCUDA docs:
“The solution to this problem that PyCUDA tries to promote is:
Forget heuristics. Benchmark at run time and use whatever works fastest.”
@hortonhearsafoo
https://documen.tician.de/pycuda/metaprog.html
Slide 60
Slide 60 text
Metaprogramming example
@hortonhearsafoo
Slide 61
Slide 61 text
CUDA as a C/C++ extension
@hortonhearsafoo
Slide 62
Slide 62 text
“Why not just use C++?”
@hortonhearsafoo
Slide 63
Slide 63 text
@hortonhearsafoo
Slide 64
Slide 64 text
Python API &
C/C++/(CUDA) performance
@hortonhearsafoo
Slide 65
Slide 65 text
Python C Extensions
https://docs.python.org/3/extending/extending.html
@hortonhearsafoo
Slide 66
Slide 66 text
CUDA to C/C++
Nvidia provides the nvcc compiler
nvcc takes in your CUDA C source code and does several things:
1. Compiles the kernel to GPU assembly code
2. Replaces the special syntax (<<<...>>>) in the C/C++ code
3. Optionally, can use your C/C++ compiler to compile the host (CPU) code
@hortonhearsafoo
Slide 67
Slide 67 text
The Python side
Starting point: https://github.com/rmcgibbo/npcuda-example
Uses Cython to generate C++ class
setuptools then helps us compile and link everything together
@hortonhearsafoo
Accessing a GPU
Google Colab (free!) https://colab.research.google.com/
Kaggle Kernels (also free!) https://www.kaggle.com/kernels
Cloud GPU Instances:
AWS p2.xlarge ($0.90 per Hour)
Google Cloud ($0.45 USD per GPU)
@hortonhearsafoo
Slide 75
Slide 75 text
Horton’s Law
AWS Bill
Interest in deep learning
@hortonhearsafoo
Slide 76
Slide 76 text
Where to go next?
Applying CUDA to your workflow
Parallel programming algorithms
Other kinds of devices (xPUs like TPU, FPGA through PYNQ)
@hortonhearsafoo
Slide 77
Slide 77 text
The End. Go program the GPU!
@hortonhearsafoo
Slide 78
Slide 78 text
Links
“Gordon Moore: The Man Whose Name Means Progress”
https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means
-progress
NVIDIA CUDA C Programming Guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/
RAPIDS: https://rapids.ai/
AresDB: https://eng.uber.com/aresdb/
CuPy: https://cupy.chainer.org/
PyCUDA: https://documen.tician.de/pycuda/