William Horton - CUDA in your Python: Effective Parallel Programming on the GPU

CUDA in your Python: Effective Parallel Programming on the GPU
William Horton @hortonhearsafoo

Moore’s Law is dead @hortonhearsafoo

Moore’s Law The number of transistors on an integrated circuit
will double every two years https://www.intel.com/pressroom/kits/events/moores_law_40th/ @hortonhearsafoo

@hortonhearsafoo

By ourworldindata.org - Original text : Data source: https://en.wikipedia.org/wiki/Transistor_countURL: https://ourworldindata.org/wp-content/uploads/2013/05/Transistor-Count-over-time.pngArticle:
https://ourworldindata.org/technological-progress), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=71553709 @hortonhearsafoo

https://www.amazon.com/Barrons- Physics-Robert-Pelcovits-Ph-D/dp/ 1438007426 Physics!

The Death of Moore’s Law @hortonhearsafoo

The Death of Moore’s Law “I guess I see Moore’s
Law dying here in the next decade or so, but that’s not surprising.” - Gordon Moore, 2015 @hortonhearsafoo

Why GPUs? @hortonhearsafoo

History of the GPU The GPU (graphics processing unit) was
originally developed for gaming Designed to be good at matrix operations Typical workload for gaming graphics requires arithmetic operations on large amounts of data (pixels, objects in a scene, etc.) @hortonhearsafoo

Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition
Intel Core i9-9900K @hortonhearsafoo

Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition
Intel Core i9-9900K Cores: 4352 CUDA Cores across 68 Streaming Multiprocessors Base Clock: 1.350 GHz Boost Clock: 1.635 GHz Cores: 8 (16 Hyperthreads) Base Clock: 3.6 GHz Boost Clock: 5.0 GHz @hortonhearsafoo

GPU Architecture from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

The rise of GPGPU General-purpose computing on GPU “In the
past the processing units of the GPU were designed only for computer graphics but now GPUs are truly general-purpose parallel processors.” (“GPGPU Computing”, Oancea 2014) Different models: CUDA (Nvidia), APP (AMD), OpenCL (open standard maintained by Khronos Group) @hortonhearsafoo

About me @hortonhearsafoo

My work Senior Software Engineer on the Data team at
Compass Tools my team uses: PySpark, Kafka, Airﬂow @hortonhearsafoo We’re hiring in NYC and Seattle!

The Future: GPUs for Data Pipelines? @hortonhearsafoo

The Future: GPU Databases? https://eng.uber.com/aresdb/ @hortonhearsafoo

My hobbies include... Deep Learning! @hortonhearsafoo

Horton’s Law AWS Bill Interest in deep learning @hortonhearsafoo

@hortonhearsafoo

How can I start programming the GPU? @hortonhearsafoo

NumPy example import numpy as np x = np.random.randn(10000000).astype(np.ﬂoat32) y
= np.random.randn(10000000).astype(np.ﬂoat32) z = x + y @hortonhearsafoo

GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.ﬂoat32) y
= cp.random.randn(10000000).astype(cp.ﬂoat32) z = x + y @hortonhearsafoo

The End. Go program the GPU! @hortonhearsafoo

Benchmark (using %timeit) CPU: GPU: 30x speedup! @hortonhearsafoo

Different Approaches to CUDA in Python 1. Drop-in replacement 2.
Compiling CUDA strings in Python 3. C/C++ extension @hortonhearsafoo

Drop-in replacement @hortonhearsafoo

CuPy: a drop-in NumPy replacement Developed for the deep learning
framework Chainer Supports NumPy-like indexing, data types, broadcasting @hortonhearsafoo

API differences @hortonhearsafoo

More CUDA drop-ins From RAPIDS cuDF: drop-in for pandas dataframes
cuML: CUDA-powered scikit-learn @hortonhearsafoo https://github.com/rapidsai/cudf

Compiling CUDA strings in Python @hortonhearsafoo

The CUDA API @hortonhearsafoo

Threads, Blocks, and Grids https://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Threads Threads execute CUDA code, and have a threadIdx in
up to 3 dimensions The threadIdx is used for specifying which part of the data to do work on Why up to 3 dimensions? @hortonhearsafoo

Data parallelism “Same operations are performed on different subsets of
same data.“ (https://en.wikipedia.org/wiki/Data_parallelism) Or, different processors take distinct slices of the data and do the same thing to it Many operations on vectors and matrices can be performed in a data-parallel way

1-D example [0, 1, 2, 3, 4, 5, 6, 7,
8] t0 t1 t2 t3 Threads Data @hortonhearsafoo

1-D example Threads to indexes: t0: 0, 4, 8 t1:
1, 5 t2: 2, 6 t3: 3, 7 Rule: threadIdx + numThreads * i @hortonhearsafoo

2-D example [[0, 1, 2], [3, 4, 5], [6, 7,
8]] t0 t1 t2 t3 Threads Data ? ? ? % % @hortonhearsafoo

3-D example ? @hortonhearsafoo

2-D example [[0, 1, 2], [3, 4, 5], [6, 7,
8]] t0,0 t0,1 t1,1 t1,0 Threads Data @hortonhearsafoo

from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Blocks & Grids Blocks organize groups of threads “Thread blocks
are required to execute independently: It must be possible to execute them in any order, in parallel or in series...Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses” Also can be indexed in up to three dimensions using blockIdx and blockDim Grids: just groups of blocks @hortonhearsafoo

Threads, Blocks, and Grids @hortonhearsafoo

GPU Architecture from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

CUDA Kernel example https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Host and device CPU (Host) GPU (Device) Data @hortonhearsafoo

CUDA Kernels Kernels are C/C++ code with additional syntax, most
importantly __global__ for identifying the kernel function, and the <<<...>>> syntax for specifying grid size and block size @hortonhearsafoo

CUDA Kernel example https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

PyCUDA @hortonhearsafoo

PyCUDA Built by Andreas Klöckner, a researcher in scientific computing
at UIUC Described in the paper “PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation” (2012) Used for scientific and research projects: Sailfish: Lattice Boltzmann Fluid Dynamics, Copenhagen CT toolbox, LINGO Chemical Similarities (and more! https://wiki.tiker.net/PyCuda/ShowCase) @hortonhearsafoo

PyCUDA example code @hortonhearsafoo

Beneﬁts of PyCUDA Automatic Memory Management Data Transfer: In, Out,
and InOut Automatic Error Checking Metaprogramming @hortonhearsafoo

Automatic Memory Management One of the big beneﬁts of PyCUDA:
Object cleanup is tied to lifetime of objects. Once your Python object goes out of scope, it will free the CUDA allocated memory for you. @hortonhearsafoo

Data Transfer: In, Out, and InOut import numpy import pycuda.driver
as cuda ... a = numpy.random.randn(4,4).astype(numpy.ﬂoat32) func(cuda.InOut(a), block=(4, 4, 1)) @hortonhearsafoo

Data Transfer: In, Out, and InOut They are wrappers around
your numpy array to transfer data to and from the GPU For example, to perform an operation on a np array a you’d have to: 1. Create the array on CPU 2. Allocate GPU memory of that size 3. Transfer data from CPU to GPU 4. Run the CUDA kernel 5. Transfer data back from GPU to CPU Instead you can use cuda.InOut(a) and it does all that for you! @hortonhearsafoo

Automatic Error Checking “[I]f an asynchronous error occurs, it will
be reported by some subsequent unrelated runtime function call.” ??? PyCUDA checks CUDA errors and surfaces them as speciﬁc Python Exceptions @hortonhearsafoo

Metaprogramming When writing CUDA kernels, some parameters have to be
chosen carefully, like “What’s the optimal number of threads per block?” This is often done with heuristics According to the PyCUDA docs: “The solution to this problem that PyCUDA tries to promote is: Forget heuristics. Benchmark at run time and use whatever works fastest.” @hortonhearsafoo https://documen.tician.de/pycuda/metaprog.html

Metaprogramming example @hortonhearsafoo

CUDA as a C/C++ extension @hortonhearsafoo

“Why not just use C++?” @hortonhearsafoo

@hortonhearsafoo

Python API & C/C++/(CUDA) performance @hortonhearsafoo

Python C Extensions https://docs.python.org/3/extending/extending.html @hortonhearsafoo

CUDA to C/C++ Nvidia provides the nvcc compiler nvcc takes
in your CUDA C source code and does several things: 1. Compiles the kernel to GPU assembly code 2. Replaces the special syntax (<<<...>>>) in the C/C++ code 3. Optionally, can use your C/C++ compiler to compile the host (CPU) code @hortonhearsafoo

The Python side Starting point: https://github.com/rmcgibbo/npcuda-example Uses Cython to generate
C++ class setuptools then helps us compile and link everything together @hortonhearsafoo

The Python side @hortonhearsafoo

Manual Memory Management @hortonhearsafoo

@hortonhearsafoo

Manual Memory Management Advanced CUDA memory features: Page-Locked Host Memory
Portable Memory Write-Combining Memory Mapped Memory @hortonhearsafoo

A Compiler! @hortonhearsafoo

How to get started @hortonhearsafoo

Accessing a GPU Google Colab (free!) https://colab.research.google.com/ Kaggle Kernels (also
free!) https://www.kaggle.com/kernels Cloud GPU Instances: AWS p2.xlarge ($0.90 per Hour) Google Cloud ($0.45 USD per GPU) @hortonhearsafoo

Horton’s Law AWS Bill Interest in deep learning @hortonhearsafoo

Where to go next? Applying CUDA to your workﬂow Parallel
programming algorithms Other kinds of devices (xPUs like TPU, FPGA through PYNQ) @hortonhearsafoo

The End. Go program the GPU! @hortonhearsafoo

Links “Gordon Moore: The Man Whose Name Means Progress” https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means
-progress NVIDIA CUDA C Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ RAPIDS: https://rapids.ai/ AresDB: https://eng.uber.com/aresdb/ CuPy: https://cupy.chainer.org/ PyCUDA: https://documen.tician.de/pycuda/

William Horton - CUDA in your Python: Effective...

William Horton - CUDA in your Python: Effective Parallel Programming on the GPU

More Decks by PyCon 2019

Other Decks in Programming

Featured

Transcript