CUDA in your Python: Parallel Programming on the GPU (PyBay 2019)

CUDA in your Python: Parallel Programming on the GPU William
Horton @hortonhearsafoo

Moore’s Law is dead @hortonhearsafoo

Moore’s Law The number of transistors on an integrated circuit
will double every two years https://www.intel.com/pressroom/kits/events/moores_law_40th/ @hortonhearsafoo

@hortonhearsafoo

By ourworldindata.org - Original text : Data source: https://en.wikipedia.org/wiki/Transistor_countURL: https://ourworldindata.org/wp-content/uploads/2013/05/Transistor-Count-over-time.pngArticle:
https://ourworldindata.org/technological-progress), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=71553709 @hortonhearsafoo

https://www.amazon.com/Barrons- Physics-Robert-Pelcovits-Ph-D/dp/ 1438007426 Physics! (Dennard scaling ) @hortonhearsafoo

The Death of Moore’s Law @hortonhearsafoo

The Death of Moore’s Law “I guess I see Moore’s
Law dying here in the next decade or so, but that’s not surprising.” - Gordon Moore, 2015 @hortonhearsafoo

Why GPUs? @hortonhearsafoo

History of the GPU The GPU (graphics processing unit) was
originally developed for gaming Designed to be good at matrix operations Typical workload for gaming graphics requires arithmetic operations on large amounts of data (pixels, objects in a scene, etc.) @hortonhearsafoo

Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition
Intel Core i9-9900K @hortonhearsafoo

Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition
Intel Core i9-9900K Cores: 68 Streaming Multiprocessors (SMs) containing 4352 CUDA Cores Base Clock: 1.350 GHz Boost Clock: 1.635 GHz Cores: 8 (16 Hyperthreads) Base Clock: 3.6 GHz Boost Clock: 5.0 GHz @hortonhearsafoo

GPU Architecture from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

The rise of GPGPU General-purpose computing on GPU “In the
past the processing units of the GPU were designed only for computer graphics but now GPUs are truly general-purpose parallel processors.” (“GPGPU Computing”, Oancea 2014) Different models: CUDA (Nvidia), APP (AMD), OpenCL (open standard maintained by Khronos Group) @hortonhearsafoo

Supercomputing Test of Time Award https://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf @hortonhearsafoo

Fundamental Questions Why GPUs? What is CUDA? CUDA in my
Python? What next? @hortonhearsafoo

About me @hortonhearsafoo

My work Senior Software Engineer working on data pipelines &
machine learning Tools we’re using: PySpark, Airﬂow, Pandas, Jupyter, MXNet/Gluon @hortonhearsafoo

@hortonhearsafoo

My hobbies include... Deep Learning! @hortonhearsafoo

Horton’s Law AWS Bill Interest in deep learning @hortonhearsafoo

@hortonhearsafoo

The Present: GPUs for Deep Learning @hortonhearsafoo

The Future: GPU Databases? https://eng.uber.com/aresdb/ @hortonhearsafoo

The Future: GPUs for the whole Data Pipeline? @hortonhearsafoo

How can I start programming the GPU? @hortonhearsafoo

NumPy example import numpy as np x = np.random.randn(10000000).astype(np.ﬂoat32) y
= np.random.randn(10000000).astype(np.ﬂoat32) z = x + y @hortonhearsafoo

GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.ﬂoat32) y
= cp.random.randn(10000000).astype(cp.ﬂoat32) z = x + y @hortonhearsafoo

Benchmark (using %timeit) CPU: GPU: 30x speedup! @hortonhearsafoo

The End. Go program the GPU! @hortonhearsafoo

Different Approaches to CUDA in Python 1. Drop-in replacement 2.
Numba + CUDA JIT 3. Compiling CUDA strings in Python 4. C/C++ extension @hortonhearsafoo

Drop-in replacement @hortonhearsafoo

CuPy: a drop-in NumPy replacement Developed for the deep learning
framework Chainer Supports NumPy-like indexing, data types, broadcasting @hortonhearsafoo

API differences @hortonhearsafoo

More CUDA drop-ins From RAPIDS cuDF: drop-in for pandas dataframes
cuML: CUDA-powered scikit-learn @hortonhearsafoo https://github.com/rapidsai/cudf

Writing CUDA Kernels @hortonhearsafoo

The CUDA API @hortonhearsafoo

Threads, Blocks, and Grids https://docs.nvidia.com/cuda/pdf/ CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Threads Threads execute CUDA code, and have a threadIdx in
up to 3 dimensions The threadIdx is used for specifying which part of the data to do work on Why up to 3 dimensions? @hortonhearsafoo

Data parallelism “Same operations are performed on different subsets of
same data.“ (https://en.wikipedia.org/wiki/Data_parallelism) Or, different processors take distinct slices of the data and do the same thing to it Many operations on vectors and matrices can be performed in a data-parallel way @hortonhearsafoo

1-D example [0, 1, 2, 3, 4, 5, 6, 7,
8] t0 t1 t2 t3 Threads Data @hortonhearsafoo

1-D example Threads to indexes: t0: 0, 4, 8 t1:
1, 5 t2: 2, 6 t3: 3, 7 Rule: threadIdx + numThreads * i @hortonhearsafoo

2-D example [[0, 1, 2], [3, 4, 5], [6, 7,
8]] t0 t1 t2 t3 Threads Data ? ? ? % % @hortonhearsafoo

3-D example ? @hortonhearsafoo

2-D example [[0, 1, 2], [3, 4, 5], [6, 7,
8]] t0,0 t0,1 t1,1 t1,0 Threads Data @hortonhearsafoo

from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Blocks & Grids Blocks organize groups of threads, and provide
two main features: - Fast shared memory - Thread synchronization Also can be indexed in up to three dimensions using blockIdx and blockDim Grids: just groups of blocks @hortonhearsafoo

Threads, Blocks, and Grids @hortonhearsafoo

GPU Architecture from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

CUDA Kernel example https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Host and device CPU (Host) GPU (Device) Data @hortonhearsafoo

CUDA Kernels Kernels are C/C++ code with additional syntax, most
importantly __global__ for identifying the kernel function, and the <<<...>>> syntax for specifying grid size and block size @hortonhearsafoo

CUDA Kernel example https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf @hortonhearsafoo

Numba + CUDA JIT @hortonhearsafoo

What is Numba? "Numba is a just-in-time compiler for Python
that works best on code that uses NumPy arrays and functions, and loops.” “When a call is made to a Numba decorated function it is compiled to machine code “just-in-time” for execution and all or part of your code can subsequently run at native machine code speed!” https://numba.pydata.org/numba-doc/dev/user/5minguide.html @hortonhearsafoo

Numba Example (CPU) https://numba.pydata.org/ @hortonhearsafoo

Numba: not just for CPUs! https://www.youtube.com/watch?v=1AwG0T4gaO0 @hortonhearsafoo

CUDA JIT Example http://numba.pydata.org/numba-doc/latest/cuda/kernels.html#thread-positioning @hortonhearsafoo

Limitations http://numba.pydata.org/numba-doc/latest/cuda/cudapysupported.html @hortonhearsafoo

Useful features of Numba for CUDA Access to CUDA features
in Python functions CUDA Simulator Interoperability using __cuda_array_interface__ @hortonhearsafoo

Access to CUDA features in Python functions Accessing thread position
(with some convenience functions) Syncing threads CUDA Atomic Operations @hortonhearsafoo

CUDA Simulator “Numba includes a CUDA Simulator that implements most
of the semantics in CUDA Python using the Python interpreter and some additional Python code.” So you can use print statements and debuggers! @hortonhearsafoo

Interoperability using __cuda_array_interface__ Inspired by NumPy’s __array_interface__ A dictionary containing
necessary information like shape, type, a pointer to the location in memory, and optionally the data layout (stride) Supported by CuPy, PyTorch, RAPIDS CuDF @hortonhearsafoo

PyCUDA @hortonhearsafoo

PyCUDA Built by Andreas Klöckner, a researcher in scientific computing
at UIUC Described in the paper “PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation” (2012) Used for scientific and research projects: Sailfish: Lattice Boltzmann Fluid Dynamics, Copenhagen CT toolbox, LINGO Chemical Similarities (and more! https://wiki.tiker.net/PyCuda/ShowCase) @hortonhearsafoo

PyCUDA example code @hortonhearsafoo

Beneﬁts of PyCUDA Automatic Memory Management Data Transfer: In, Out,
and InOut Automatic Error Checking Metaprogramming @hortonhearsafoo

Automatic Memory Management One of the big beneﬁts of PyCUDA:
Object cleanup is tied to lifetime of objects. Once your Python object goes out of scope, it will free the CUDA allocated memory for you. @hortonhearsafoo

Data Transfer: In, Out, and InOut import numpy import pycuda.driver
as cuda ... a = numpy.random.randn(4,4).astype(numpy.ﬂoat32) func(cuda.InOut(a), block=(4, 4, 1)) @hortonhearsafoo

Data Transfer: In, Out, and InOut They are wrappers around
your numpy array to transfer data to and from the GPU For example, to perform an operation on a np array a you’d have to: 1. Create the array on CPU 2. Allocate GPU memory of that size 3. Transfer data from CPU to GPU 4. Run the CUDA kernel 5. Transfer data back from GPU to CPU Instead you can use cuda.InOut(a) and it does all that for you! @hortonhearsafoo

Automatic Error Checking “[I]f an asynchronous error occurs, it will
be reported by some subsequent unrelated runtime function call.” ??? PyCUDA checks CUDA errors and surfaces them as speciﬁc Python Exceptions @hortonhearsafoo

Metaprogramming When writing CUDA kernels, some parameters have to be
chosen carefully, like “What’s the optimal number of threads per block?” This is often done with heuristics According to the PyCUDA docs: “The solution to this problem that PyCUDA tries to promote is: Forget heuristics. Benchmark at run time and use whatever works fastest.” @hortonhearsafoo https://documen.tician.de/pycuda/metaprog.html

Metaprogramming example @hortonhearsafoo

(CuPy RawKernel) From “Parallelizing Custom CuPy Kernels with Dask” by
Peter Andreas Entschev https://medium.com/rapids-ai/parallelizing-custom-cupy-kernels-with-dask-4d2ccd3b0732

CUDA as a C/C++ extension @hortonhearsafoo

“Why not just use C++?” @hortonhearsafoo

@hortonhearsafoo

Python API & C/C++/(CUDA) performance @hortonhearsafoo

Python C/C++ Extensions https://docs.python.org/3/extending/extending.html @hortonhearsafoo

Two main challenges 1. Writing/generating your C++ (host) code 2.
Creating a setup.py ﬁle that compiles your CUDA and C++ code A good example of how to tackle these: https://github.com/rmcgibbo/npcuda-example by Robert McGibbon and Yutong Zhao @hortonhearsafoo

How to write your C++ code The ecosystem is pretty
crowded Options: Pyrex, Cython, SWIG, SIP, Boost.Python, PyCXX, Ctypes, Py++, f2py, PyD, Robin, pybind11 (from https://github.com/cython/cython/wiki/WrappingCorCpp) @hortonhearsafoo

What to choose? I like Cython It feels like it
was written for Python developers Actively maintained and used in a variety of projects: SciPy, spaCy, Falcon http://blog.behnel.de/posts/cython-pybind11-cffi-which-tool-to-choose.html @hortonhearsafoo

Some Cython code https://github.com/rmcgibbo/npcuda-example/blob/master/cython/wrapper.pyx @hortonhearsafoo

CUDA to C/C++ Nvidia provides the nvcc compiler nvcc takes
in your CUDA C source code and does several things: 1. Compiles the kernel to GPU assembly code 2. Replaces the special syntax (<<<...>>>) in the C/C++ code 3. Optionally, can use your C/C++ compiler to compile the host (CPU) code @hortonhearsafoo

https://docs.python.org/3/distutils/setupscript.html @hortonhearsafoo

setup.py @hortonhearsafoo https://github.com/rmcgibbo/npcuda-example/blob/master/cython/setup.py

How to get nvcc into the picture? https://github.com/rmcgibbo/npcuda-example/blob/master/cython/setup.py

Advantages of a C/C++ extension @hortonhearsafoo

Fewer dependencies Don’t have to worry about versioning of Python
libraries Could make it easier to ship to internal or external clients @hortonhearsafoo

Manual Memory Management @hortonhearsafoo

@hortonhearsafoo

Manual Memory Management Advanced CUDA memory features: Page-Locked Host Memory
Portable Memory Write-Combining Memory Mapped Memory @hortonhearsafoo

A Compiler! @hortonhearsafoo

How to get started @hortonhearsafoo

Accessing a GPU Google Colab (free!) https://colab.research.google.com/ Kaggle Kernels (also
free!) https://www.kaggle.com/kernels Cloud GPU Instances: AWS p2.xlarge ($0.90 per Hour) Google Cloud ($0.45 USD per GPU) @hortonhearsafoo

Udacity Parallel Programming/CUDA Course Course: https://classroom.udacity.com/courses/cs344 YouTube playlist: https://www.youtube.com/watch?v=F620ommtjqk&list=PLAwxTw4SYaPnFKojVQ rmyOGFCqHTxfdv2
Code: https://github.com/udacity/cs344 Last updated in 2015 (careful installing the right dependencies!), but still good material @hortonhearsafoo

RAPIDS Tutorial Scipy 2019 Tutorial Video: https://www.youtube.com/watch?v=kQ6UWd9t2Go Code: https://github.com/Quansight/scipy-2019-rapids-tutorial @hortonhearsafoo

Where to go next? Applying CUDA to your workﬂow Parallel
programming algorithms Other kinds of devices (xPUs like TPU, FPGA through PYNQ) @hortonhearsafoo

Horton’s Law AWS Bill Interest in deep learning @hortonhearsafoo

The End. Go program the GPU! @hortonhearsafoo

Links “Gordon Moore: The Man Whose Name Means Progress” https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means
-progress NVIDIA CUDA C Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ RAPIDS: https://rapids.ai/ AresDB: https://eng.uber.com/aresdb/ CuPy: https://cupy.chainer.org/ Numba: http://numba.pydata.org/numba-doc/latest/index.html PyCUDA: https://documen.tician.de/pycuda/

CUDA in your Python: Parallel Programming on t...

CUDA in your Python: Parallel Programming on the GPU (PyBay 2019)

More Decks by William Horton

Other Decks in Technology

Featured

Transcript