Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CUDA in your Python: Parallel Programming on the GPU (PyBay 2019)

CUDA in your Python: Parallel Programming on the GPU (PyBay 2019)

It’s 2019, and Moore’s Law is dead. CPU performance is plateauing, but GPUs provide a chance for continued hardware performance gains, if you can structure your programs to make good use of them. In this talk you will learn how to speed up your Python programs using Nvidia’s CUDA platform.

William Horton

August 17, 2019
Tweet

More Decks by William Horton

Other Decks in Technology

Transcript

  1. Moore’s Law The number of transistors on an integrated circuit

    will double every two years https://www.intel.com/pressroom/kits/events/moores_law_40th/ @hortonhearsafoo
  2. By ourworldindata.org - Original text : Data source: https://en.wikipedia.org/wiki/Transistor_countURL: https://ourworldindata.org/wp-content/uploads/2013/05/Transistor-Count-over-time.pngArticle:

    https://ourworldindata.org/technological-progress), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=71553709 @hortonhearsafoo
  3. The Death of Moore’s Law “I guess I see Moore’s

    Law dying here in the next decade or so, but that’s not surprising.” - Gordon Moore, 2015 @hortonhearsafoo
  4. History of the GPU The GPU (graphics processing unit) was

    originally developed for gaming Designed to be good at matrix operations Typical workload for gaming graphics requires arithmetic operations on large amounts of data (pixels, objects in a scene, etc.) @hortonhearsafoo
  5. Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition

    Intel Core i9-9900K @hortonhearsafoo
  6. Specs: GPU vs CPU Nvidia RTX 2080 Ti Founder’s Edition

    Intel Core i9-9900K Cores: 68 Streaming Multiprocessors (SMs) containing 4352 CUDA Cores Base Clock: 1.350 GHz Boost Clock: 1.635 GHz Cores: 8 (16 Hyperthreads) Base Clock: 3.6 GHz Boost Clock: 5.0 GHz @hortonhearsafoo
  7. The rise of GPGPU General-purpose computing on GPU “In the

    past the processing units of the GPU were designed only for computer graphics but now GPUs are truly general-purpose parallel processors.” (“GPGPU Computing”, Oancea 2014) Different models: CUDA (Nvidia), APP (AMD), OpenCL (open standard maintained by Khronos Group) @hortonhearsafoo
  8. Fundamental Questions Why GPUs? What is CUDA? CUDA in my

    Python? What next? @hortonhearsafoo
  9. My work Senior Software Engineer working on data pipelines &

    machine learning Tools we’re using: PySpark, Airflow, Pandas, Jupyter, MXNet/Gluon @hortonhearsafoo
  10. NumPy example import numpy as np x = np.random.randn(10000000).astype(np.float32) y

    = np.random.randn(10000000).astype(np.float32) z = x + y @hortonhearsafoo
  11. GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.float32) y

    = cp.random.randn(10000000).astype(cp.float32) z = x + y @hortonhearsafoo
  12. GPU example import cupy as cp x = cp.random.randn(10000000).astype(cp.float32) y

    = cp.random.randn(10000000).astype(cp.float32) z = x + y @hortonhearsafoo
  13. Different Approaches to CUDA in Python 1. Drop-in replacement 2.

    Numba + CUDA JIT 3. Compiling CUDA strings in Python 4. C/C++ extension @hortonhearsafoo
  14. CuPy: a drop-in NumPy replacement Developed for the deep learning

    framework Chainer Supports NumPy-like indexing, data types, broadcasting @hortonhearsafoo
  15. More CUDA drop-ins From RAPIDS cuDF: drop-in for pandas dataframes

    cuML: CUDA-powered scikit-learn @hortonhearsafoo https://github.com/rapidsai/cudf
  16. Threads Threads execute CUDA code, and have a threadIdx in

    up to 3 dimensions The threadIdx is used for specifying which part of the data to do work on Why up to 3 dimensions? @hortonhearsafoo
  17. Data parallelism “Same operations are performed on different subsets of

    same data.“ (https://en.wikipedia.org/wiki/Data_parallelism) Or, different processors take distinct slices of the data and do the same thing to it Many operations on vectors and matrices can be performed in a data-parallel way @hortonhearsafoo
  18. 1-D example [0, 1, 2, 3, 4, 5, 6, 7,

    8] t0 t1 t2 t3 Threads Data @hortonhearsafoo
  19. 1-D example Threads to indexes: t0: 0, 4, 8 t1:

    1, 5 t2: 2, 6 t3: 3, 7 Rule: threadIdx + numThreads * i @hortonhearsafoo
  20. 2-D example [[0, 1, 2], [3, 4, 5], [6, 7,

    8]] t0 t1 t2 t3 Threads Data ? ? ? % % @hortonhearsafoo
  21. 2-D example [[0, 1, 2], [3, 4, 5], [6, 7,

    8]] t0,0 t0,1 t1,1 t1,0 Threads Data @hortonhearsafoo
  22. Blocks & Grids Blocks organize groups of threads, and provide

    two main features: - Fast shared memory - Thread synchronization Also can be indexed in up to three dimensions using blockIdx and blockDim Grids: just groups of blocks @hortonhearsafoo
  23. CUDA Kernels Kernels are C/C++ code with additional syntax, most

    importantly __global__ for identifying the kernel function, and the <<<...>>> syntax for specifying grid size and block size @hortonhearsafoo
  24. What is Numba? "Numba is a just-in-time compiler for Python

    that works best on code that uses NumPy arrays and functions, and loops.” “When a call is made to a Numba decorated function it is compiled to machine code “just-in-time” for execution and all or part of your code can subsequently run at native machine code speed!” https://numba.pydata.org/numba-doc/dev/user/5minguide.html @hortonhearsafoo
  25. Useful features of Numba for CUDA Access to CUDA features

    in Python functions CUDA Simulator Interoperability using __cuda_array_interface__ @hortonhearsafoo
  26. Access to CUDA features in Python functions Accessing thread position

    (with some convenience functions) Syncing threads CUDA Atomic Operations @hortonhearsafoo
  27. CUDA Simulator “Numba includes a CUDA Simulator that implements most

    of the semantics in CUDA Python using the Python interpreter and some additional Python code.” So you can use print statements and debuggers! @hortonhearsafoo
  28. Interoperability using __cuda_array_interface__ Inspired by NumPy’s __array_interface__ A dictionary containing

    necessary information like shape, type, a pointer to the location in memory, and optionally the data layout (stride) Supported by CuPy, PyTorch, RAPIDS CuDF @hortonhearsafoo
  29. PyCUDA Built by Andreas Klöckner, a researcher in scientific computing

    at UIUC Described in the paper “PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation” (2012) Used for scientific and research projects: Sailfish: Lattice Boltzmann Fluid Dynamics, Copenhagen CT toolbox, LINGO Chemical Similarities (and more! https://wiki.tiker.net/PyCuda/ShowCase) @hortonhearsafoo
  30. Benefits of PyCUDA Automatic Memory Management Data Transfer: In, Out,

    and InOut Automatic Error Checking Metaprogramming @hortonhearsafoo
  31. Automatic Memory Management One of the big benefits of PyCUDA:

    Object cleanup is tied to lifetime of objects. Once your Python object goes out of scope, it will free the CUDA allocated memory for you. @hortonhearsafoo
  32. Data Transfer: In, Out, and InOut import numpy import pycuda.driver

    as cuda ... a = numpy.random.randn(4,4).astype(numpy.float32) func(cuda.InOut(a), block=(4, 4, 1)) @hortonhearsafoo
  33. Data Transfer: In, Out, and InOut They are wrappers around

    your numpy array to transfer data to and from the GPU For example, to perform an operation on a np array a you’d have to: 1. Create the array on CPU 2. Allocate GPU memory of that size 3. Transfer data from CPU to GPU 4. Run the CUDA kernel 5. Transfer data back from GPU to CPU Instead you can use cuda.InOut(a) and it does all that for you! @hortonhearsafoo
  34. Automatic Error Checking “[I]f an asynchronous error occurs, it will

    be reported by some subsequent unrelated runtime function call.” ??? PyCUDA checks CUDA errors and surfaces them as specific Python Exceptions @hortonhearsafoo
  35. Metaprogramming When writing CUDA kernels, some parameters have to be

    chosen carefully, like “What’s the optimal number of threads per block?” This is often done with heuristics According to the PyCUDA docs: “The solution to this problem that PyCUDA tries to promote is: Forget heuristics. Benchmark at run time and use whatever works fastest.” @hortonhearsafoo https://documen.tician.de/pycuda/metaprog.html
  36. (CuPy RawKernel) From “Parallelizing Custom CuPy Kernels with Dask” by

    Peter Andreas Entschev https://medium.com/rapids-ai/parallelizing-custom-cupy-kernels-with-dask-4d2ccd3b0732
  37. Two main challenges 1. Writing/generating your C++ (host) code 2.

    Creating a setup.py file that compiles your CUDA and C++ code A good example of how to tackle these: https://github.com/rmcgibbo/npcuda-example by Robert McGibbon and Yutong Zhao @hortonhearsafoo
  38. How to write your C++ code The ecosystem is pretty

    crowded Options: Pyrex, Cython, SWIG, SIP, Boost.Python, PyCXX, Ctypes, Py++, f2py, PyD, Robin, pybind11 (from https://github.com/cython/cython/wiki/WrappingCorCpp) @hortonhearsafoo
  39. What to choose? I like Cython It feels like it

    was written for Python developers Actively maintained and used in a variety of projects: SciPy, spaCy, Falcon http://blog.behnel.de/posts/cython-pybind11-cffi-which-tool-to-choose.html @hortonhearsafoo
  40. CUDA to C/C++ Nvidia provides the nvcc compiler nvcc takes

    in your CUDA C source code and does several things: 1. Compiles the kernel to GPU assembly code 2. Replaces the special syntax (<<<...>>>) in the C/C++ code 3. Optionally, can use your C/C++ compiler to compile the host (CPU) code @hortonhearsafoo
  41. Fewer dependencies Don’t have to worry about versioning of Python

    libraries Could make it easier to ship to internal or external clients @hortonhearsafoo
  42. Manual Memory Management Advanced CUDA memory features: Page-Locked Host Memory

    Portable Memory Write-Combining Memory Mapped Memory @hortonhearsafoo
  43. Accessing a GPU Google Colab (free!) https://colab.research.google.com/ Kaggle Kernels (also

    free!) https://www.kaggle.com/kernels Cloud GPU Instances: AWS p2.xlarge ($0.90 per Hour) Google Cloud ($0.45 USD per GPU) @hortonhearsafoo
  44. Udacity Parallel Programming/CUDA Course Course: https://classroom.udacity.com/courses/cs344 YouTube playlist: https://www.youtube.com/watch?v=F620ommtjqk&list=PLAwxTw4SYaPnFKojVQ rmyOGFCqHTxfdv2

    Code: https://github.com/udacity/cs344 Last updated in 2015 (careful installing the right dependencies!), but still good material @hortonhearsafoo
  45. Where to go next? Applying CUDA to your workflow Parallel

    programming algorithms Other kinds of devices (xPUs like TPU, FPGA through PYNQ) @hortonhearsafoo
  46. Links “Gordon Moore: The Man Whose Name Means Progress” https://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means

    -progress NVIDIA CUDA C Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/ RAPIDS: https://rapids.ai/ AresDB: https://eng.uber.com/aresdb/ CuPy: https://cupy.chainer.org/ Numba: http://numba.pydata.org/numba-doc/latest/index.html PyCUDA: https://documen.tician.de/pycuda/