Upgrade to Pro — share decks privately, control downloads, hide ads and more …

William Horton - CUDA in your Python: Effective Parallel Programming on the GPU

William Horton - CUDA in your Python: Effective Parallel Programming on the GPU

t’s 2019, and Moore’s Law is dead. CPU performance is plateauing, but GPUs provide a chance for continued hardware performance gains, if you can structure your programs to make good use of them.

CUDA is a platform developed by Nvidia for GPGPU--general purpose computing with GPUs. It backs some of the most popular deep learning libraries, like Tensorflow and Pytorch, but has broader uses in data analysis, data science, and machine learning.

There are several ways that you can start taking advantage of CUDA in your Python programs.

For some common Python libraries, there are drop-in replacements that let you start running computations on the GPU while still using familiar APIs. For example, CuPy provides a NumPy-like API for interacting with multi-dimensional arrays. Similarly, cuDF is a recent project that mimics the pandas interface for dataframes.

If you want more control over your use of CUDA APIs, you can use the PyCUDA library, which provides bindings for the CUDA API that you can call from your Python code. Compared with drop-in libraries, it gives you the ability to manually allocate memory on the GPU, and write custom CUDA functions (called kernels). However, its drawbacks include writing your CUDA code as large strings in Python, and compiling your CUDA code at runtime.

Finally, for the best performance you can use the Python C/C++ extension interface, the approach taken by deep learning libraries like Pytorch. One of the strengths of Python is the ability to drop down into C/C++, and libraries like NumPy take advantage of this for increased speed. If you use Nvidia’s nvcc compiler for CUDA, you can use the same extension interface to write custom CUDA kernels, and then call them from your Python code.

This talk will explore each of these methods, provide examples to get started, and discuss in more detail the pros and cons of each approach.

https://us.pycon.org/2019/schedule/presentation/206/

PyCon 2019

May 04, 2019
Tweet

More Decks by PyCon 2019

Other Decks in Programming

Transcript

  1. Moore’s Law The number of transistors on an integrated circuit

    will double every two years https://www.intel.com/pressroom/kits/events/moores_law_40th/ @hortonhearsafoo
  2. By ourworldindata.org - Original text : Data source: https://en.wikipedia.org/wiki/Transistor_countURL: https://ourworldindata.org/wp-content/uploads/2013/05/Transistor-Count-over-time.pngArticle:

    https://ourworldindata.org/technological-progress), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=71553709 @hortonhearsafoo
  3. The Death of Moore’s Law “I guess I see Moore’s

    Law dying here in the next decade or so, but that’s not surprising.” - Gordon Moore, 2015 @hortonhearsafoo
  4. History of the GPU The GPU (graphics processing unit) was

    originally developed for gaming Designed to be good at matrix operations Typical workload for gaming graphics requires arithmetic operations on large amounts of data (pixels, objects in a scene, etc.) @hortonhearsafoo
  5. Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition

    Intel Core i9-9900K @hortonhearsafoo
  6. Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition

    Intel Core i9-9900K Cores: 4352 CUDA Cores across 68 Streaming Multiprocessors Base Clock: 1.350 GHz Boost Clock: 1.635 GHz Cores: 8 (16 Hyperthreads) Base Clock: 3.6 GHz Boost Clock: 5.0 GHz @hortonhearsafoo
  7. The rise of GPGPU General-purpose computing on GPU “In the

    past the processing units of the GPU were designed only for computer graphics but now GPUs are truly general-purpose parallel processors.” (“GPGPU Computing”, Oancea 2014) Different models: CUDA (Nvidia), APP (AMD), OpenCL (open standard maintained by Khronos Group) @hortonhearsafoo
  8. My work Senior Software Engineer on the Data team at

    Compass Tools my team uses: PySpark, Kafka, Airflow @hortonhearsafoo We’re hiring in NYC and Seattle!
  9. NumPy example import numpy as np x = np.random.randn(10000000).astype(np.float32) y

    = np.random.randn(10000000).astype(np.float32) z = x + y @hortonhearsafoo
  10. GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.float32) y

    = cp.random.randn(10000000).astype(cp.float32) z = x + y @hortonhearsafoo
  11. GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.float32) y

    = cp.random.randn(10000000).astype(cp.float32) z = x + y @hortonhearsafoo
  12. Different Approaches to CUDA in Python 1. Drop-in replacement 2.

    Compiling CUDA strings in Python 3. C/C++ extension @hortonhearsafoo
  13. CuPy: a drop-in NumPy replacement Developed for the deep learning

    framework Chainer Supports NumPy-like indexing, data types, broadcasting @hortonhearsafoo
  14. More CUDA drop-ins From RAPIDS cuDF: drop-in for pandas dataframes

    cuML: CUDA-powered scikit-learn @hortonhearsafoo https://github.com/rapidsai/cudf
  15. Threads Threads execute CUDA code, and have a threadIdx in

    up to 3 dimensions The threadIdx is used for specifying which part of the data to do work on Why up to 3 dimensions? @hortonhearsafoo
  16. Data parallelism “Same operations are performed on different subsets of

    same data.“ (https://en.wikipedia.org/wiki/Data_parallelism) Or, different processors take distinct slices of the data and do the same thing to it Many operations on vectors and matrices can be performed in a data-parallel way
  17. 1-D example [0, 1, 2, 3, 4, 5, 6, 7,

    8] t0 t1 t2 t3 Threads Data @hortonhearsafoo
  18. 1-D example Threads to indexes: t0: 0, 4, 8 t1:

    1, 5 t2: 2, 6 t3: 3, 7 Rule: threadIdx + numThreads * i @hortonhearsafoo
  19. 2-D example [[0, 1, 2], [3, 4, 5], [6, 7,

    8]] t0 t1 t2 t3 Threads Data ? ? ? % % @hortonhearsafoo
  20. 2-D example [[0, 1, 2], [3, 4, 5], [6, 7,

    8]] t0,0 t0,1 t1,1 t1,0 Threads Data @hortonhearsafoo
  21. Blocks & Grids Blocks organize groups of threads “Thread blocks

    are required to execute independently: It must be possible to execute them in any order, in parallel or in series...Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses” Also can be indexed in up to three dimensions using blockIdx and blockDim Grids: just groups of blocks @hortonhearsafoo
  22. CUDA Kernels Kernels are C/C++ code with additional syntax, most

    importantly __global__ for identifying the kernel function, and the <<<...>>> syntax for specifying grid size and block size @hortonhearsafoo
  23. PyCUDA Built by Andreas Klöckner, a researcher in scientific computing

    at UIUC Described in the paper “PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation” (2012) Used for scientific and research projects: Sailfish: Lattice Boltzmann Fluid Dynamics, Copenhagen CT toolbox, LINGO Chemical Similarities (and more! https://wiki.tiker.net/PyCuda/ShowCase) @hortonhearsafoo
  24. Benefits of PyCUDA Automatic Memory Management Data Transfer: In, Out,

    and InOut Automatic Error Checking Metaprogramming @hortonhearsafoo
  25. Automatic Memory Management One of the big benefits of PyCUDA:

    Object cleanup is tied to lifetime of objects. Once your Python object goes out of scope, it will free the CUDA allocated memory for you. @hortonhearsafoo
  26. Data Transfer: In, Out, and InOut import numpy import pycuda.driver

    as cuda ... a = numpy.random.randn(4,4).astype(numpy.float32) func(cuda.InOut(a), block=(4, 4, 1)) @hortonhearsafoo
  27. Data Transfer: In, Out, and InOut They are wrappers around

    your numpy array to transfer data to and from the GPU For example, to perform an operation on a np array a you’d have to: 1. Create the array on CPU 2. Allocate GPU memory of that size 3. Transfer data from CPU to GPU 4. Run the CUDA kernel 5. Transfer data back from GPU to CPU Instead you can use cuda.InOut(a) and it does all that for you! @hortonhearsafoo
  28. Automatic Error Checking “[I]f an asynchronous error occurs, it will

    be reported by some subsequent unrelated runtime function call.” ??? PyCUDA checks CUDA errors and surfaces them as specific Python Exceptions @hortonhearsafoo
  29. Metaprogramming When writing CUDA kernels, some parameters have to be

    chosen carefully, like “What’s the optimal number of threads per block?” This is often done with heuristics According to the PyCUDA docs: “The solution to this problem that PyCUDA tries to promote is: Forget heuristics. Benchmark at run time and use whatever works fastest.” @hortonhearsafoo https://documen.tician.de/pycuda/metaprog.html
  30. CUDA to C/C++ Nvidia provides the nvcc compiler nvcc takes

    in your CUDA C source code and does several things: 1. Compiles the kernel to GPU assembly code 2. Replaces the special syntax (<<<...>>>) in the C/C++ code 3. Optionally, can use your C/C++ compiler to compile the host (CPU) code @hortonhearsafoo
  31. The Python side Starting point: https://github.com/rmcgibbo/npcuda-example Uses Cython to generate

    C++ class setuptools then helps us compile and link everything together @hortonhearsafoo
  32. Manual Memory Management Advanced CUDA memory features: Page-Locked Host Memory

    Portable Memory Write-Combining Memory Mapped Memory @hortonhearsafoo
  33. Accessing a GPU Google Colab (free!) https://colab.research.google.com/ Kaggle Kernels (also

    free!) https://www.kaggle.com/kernels Cloud GPU Instances: AWS p2.xlarge ($0.90 per Hour) Google Cloud ($0.45 USD per GPU) @hortonhearsafoo
  34. Where to go next? Applying CUDA to your workflow Parallel

    programming algorithms Other kinds of devices (xPUs like TPU, FPGA through PYNQ) @hortonhearsafoo
  35. Links “Gordon Moore: The Man Whose Name Means Progress” https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means

    -progress NVIDIA CUDA C Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ RAPIDS: https://rapids.ai/ AresDB: https://eng.uber.com/aresdb/ CuPy: https://cupy.chainer.org/ PyCUDA: https://documen.tician.de/pycuda/