Programming the GPU easily with Python using NumbaPro

Applications of Programming the GPU Directly from Python Using NumbaPro
Supercomputing 2013 November 20, 2013 Travis E. Oliphant, Ph.D.

Inroduction Enterprise Python Scientiﬁc Computing Data Processing Data Analysis Visualisation
Scalable Computing Wakari •Products •Training •Support •Consulting Free Python distribution Enterprise version available which includes GPU support Scientiﬁc Python in your browser Available to install in your data-center

Big Picture Empower domain experts, subject matter experts, and other
occasional programmers with high-level tools that exploit modern hardware Array Oriented Computing ?

Why Array-oriented computing •Express domain knowledge directly in arrays (tensors,
matrices, vectors) --- easier to teach programming in domain •Can take advantage of parallelism and accelerators like the GPU Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Object Attr1 Attr2 Attr3 Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6

Why Python License Community Readable Syntax Modern Constructs Batteries Included
Free and Open Source, Permissive License • Broad and friendly community • Over 36,000 packages on PyPI • Commercial Support • Many conferences (PyData, SciPy, PyCon...) • Executable pseudo-code • Can understand and edit code a year later • Fun to develop • Use of Indentation IPython • Interactive prompt on steroids (Notebook) • Allows less working memory • Allows failing quickly for exploration • List comprehensions • Iterator protocol and generators • Meta-programming • Introspection • JIT Compiler and Concurrency (Numba) • Internet (FTP, HTTP, SMTP, XMLRPC) • Compression and Databases • Great Visualization tools (Bokeh, Matplotlib, etc.) • Powerful libraries for STEM • Integration with C/C++/Fortran

Breaking the Speed Barrier (Numba!) Numba aims to be the
world’s best array-oriented compiler. rapid iteration and development + fast code execution ideal combination! = Python syntax but no GIL Native code speed for Numerical computing (NumPy code)

NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple
AMD OpenCL ISPC CUDA CLANG OpenMP LLVM-PY Python Function Machine Code ARM

Example @jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return
sin(x*pi)/(pi*x) Numba

Compiler Overview LLVM IR x86 C++ ARM PTX C Fortran
Numba Numba turns Python into a “compiled language”

~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)

Anaconda Accelerate Python and NumPy stack compiled to Parallel Architectures
(GPUs and multi-core machines) • Compile NumPy array expressions for the CPU and GPU • Create parallel-for loops • Parallel execution of ufuncs • Run ufuncs on the GPU • Write CUDA directly in Python! • Requires CUDA 5.5 Fast development and execution $ conda install accelerate

NumbaPro Features •CUDA Python •Vectorize --- NumPy functions on the
GPU •Array expressions •Parallel for loops •Access to fast libraries (cuRAND, cuFFT, cuBLAS)

Compile NumPy array expressions import numbapro from numba import autojit
@autojit def formula(a, b, c): a[1:,1:] = a[1:,1:] + b[1:,:-1] + c[1:,:-1] @autojit def express(m1, m2): m2[1:-1:2,0,...,::2] = (m1[1:-1:2,...,::2] * m1[-2:1:-2,...,::2]) return m2

Create parallel-for loops “prange” directive that spawns compiled tasks in
threads (like Open-MP parallel-for pragma) import numbapro # import first to make prange available from numba import autojit, prange @autojit def parallel_sum2d(a): sum = 0.0 for i in prange(a.shape[0]): for j in range(a.shape[1]): sum += a[i,j]

Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel
element-by-element over entire arrays Write kernels in Python! from numbapro import vectorize from math import sin, pi @vectorize([‘f8(f8)’, ‘f4(f4)’]) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) x = numpy.linspace(-5,5,100) y = sinc(x)

Ufuncs in parallel (multi-thread or GPU) from numbapro import vectorize
from math import sin @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘gpu’) def sinc(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) @vectorize([‘f8(f8)’, ‘f4(f4)’], target=‘parallel’) def sinc2(x): if x==0.0: return 1.0 else: return sin(x*pi)/(pi*x) x = numpy.linspace(-5,5,1000) y = sinc(x) # On GPU z = sinc2(x) # Multiple CPUs

Example Benchmark Overhead of memory transfer is over-come after about
1 million ﬂoats for simple computation About 1ms of overhead for memory transfer and set-up

Using Vectorize from numbapro import vectorize sig = 'uint8(uint32, f4,
f4, f4, f4, uint32, uint32, uint32)' @vectorize([sig], target='gpu') def mandel(tid, min_x, max_x, min_y, max_y, width, height, iters): pixel_size_x = (max_x - min_x) / width pixel_size_y = (max_y - min_y) / height x = tid % width y = tid / width real = min_x + x * pixel_size_x imag = min_y + y * pixel_size_y c = complex(real, imag) z = 0.0j for i in range(iters): z = z * z + c if (z.real * z.real + z.imag * z.imag) >= 4: return i return 255 Kind Time Speed-up Python 263.6 1.0x CPU 2.639 100x GPU 0.1676 1573x Tesla S2050

Grouping Calculations prototype = "void(float32[:,:], float32[:,:], float32[:,:])" @guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu')
def matmul(A, B, C): m, n = A.shape n, p = B.shape for i in range(m): for j in range(p): C[i, j] = 0 for k in range(n): C[i, j] += A[i, k] * B[k, j] Create “generalized ufuncs” whose elements are “arrays” # creates an array of 1000 x 2 x 4 A = np.arange(matrix_ct * 2 * 4, dtype=np.float32 ).reshape(1000, 2, 4) # creates an array of 1000 x 4 x 5 B = np.arange(matrix_ct * 4 * 5, dtype=np.float32 ).reshape(1000, 4, 5) # outputs an array of 1000 x 2 x 5 C = matmul(A, B)

Using cuBLAS import numpy as np from numbapro.cudalib import cublas
A = np.array(np.arange(N ** 2, dtype=np.float32).reshape(N, N)) B = np.array(np.arange(N) + 10, dtype=A.dtype) D = np.zeros_like(A, order='F') # NumPy E = np.dot(A, np.diag(B)) # cuBLAS blas = cublas.Blas() blas.gemm('T', 'T', N, N, N, 1.0, A, np.diag(B), 1.0, D)

FFT Convolution with cuFFT from numbapro.cudalib import cufft # host
-> device d_img = cuda.to_device(img) # image d_fltr = cuda.to_device(fltr) # filter # FFT forward cufft.fft_inplace(d_img) cufft.fft_inplace(d_fltr) # multply vmult(d_img, d_fltr, out=d_img) # inplace # FFT inverse cufft.ifft_inplace(d_img) # device -> host filted_img = d_img.copy_to_host() @vectorize(['complex64(complex64, complex64)'], target='gpu') def vmult(a, b): return a * b works with 1d,2d,3d

Monte-Carlo Pricing and cuRAND from numbapro import vectorize, cuda from
numbapro.cudalib import curand @vectorize(['f8(f8, f8, f8, f8, f8)'], target='gpu') def step(last, dt, c0, c1, noise): return last * math.exp(c0 * dt + c1 * noise) def monte_carlo_pricer(paths, dt, interest, volatility): n = paths.shape[0] blksz = cuda.get_current_device().MAX_THREADS_PER_BLOCK gridsz = int(math.ceil(float(n) / blksz)) # Instantiate cuRAND PRNG prng = curand.PRNG(curand.PRNG.MRG32K3A) # Allocate device side array d_normdist = cuda.device_array(n, dtype=np.double) c0 = interest - 0.5 * volatility ** 2 c1 = volatility * math.sqrt(dt) # Simulation loop d_last = cuda.to_device(paths[:, 0]) for j in range(1, paths.shape[1]): prng.normal(d_normdist, mean=0, sigma=1) d_paths = cuda.to_device(paths[:, j]) step(d_last, dt, c0, c1, d_normdist, out=d_paths) d_paths.copy_to_host(paths[:, j]) d_last = d_paths from numbapro import jit, cuda @jit('void(double[:], double[:], double, double, double, double[:])', target='gpu') def step(last, paths, dt, c0, c1, normdist): # expands to i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x i = cuda.grid(1) if i >= paths.shape[0]: return noise = normdist[i] paths[i] = last[i] * math.exp(c0 * dt + c1 * noise) Tuned kernel

Benchmark http://continuum.io/blog/monte-carlo-pricer

CUDA Python from numbapro import cuda from numba import autojit
@autojit(target=‘gpu’) def array_scale(src, dst, scale): tid = cuda.threadIdx.x blkid = cuda.blockIdx.x blkdim = cuda.blockDim.x i = tid + blkid * blkdim if i >= n: return dst[i] = src[i] * scale src = np.arange(N, dtype=np.float) dst = np.empty_like(src) array_scale[grid, block](src, dst, 5.0) CUDA Development directly in Python

Example: Matrix multiply @cuda.jit(argtypes=[f4[:,:], f4[:,:], f4[:,:]]) def cu_square_matrix_mul(A, B, C):
sA = cuda.shared.array(shape=(tpb, tpb), dtype=f4) sB = cuda.shared.array(shape=(tpb, tpb), dtype=f4) tx = cuda.threadIdx.x ty = cuda.threadIdx.y bx = cuda.blockIdx.x by = cuda.blockIdx.y bw = cuda.blockDim.x bh = cuda.blockDim.y x = tx + bx * bw y = ty + by * bh acc = 0. for i in range(bpg): if x < n and y < n: sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x] cuda.syncthreads() if x < n and y < n: for j in range(tpb): acc += sA[ty, j] * sB[j, tx] cuda.syncthreads() if x < n and y < n: C[y, x] = acc bpg = 50 tpb = 32 n = bpg * tpb A = np.array(np.random.random((n, n)), dtype=np.float32) B = np.array(np.random.random((n, n)), dtype=np.float32) C = np.empty_like(A) stream = cuda.stream() with stream.auto_synchronize(): dA = cuda.to_device(A, stream) dB = cuda.to_device(B, stream) dC = cuda.to_device(C, stream) cu_square_matrix_mul[(bpg, bpg), (tpb, tpb), stream](dA, dB, dC) dC.to_host(stream)

Performance Results About 6x faster on the GPU. GeForce GTX
560 Ti core i7

How to get it •Anaconda Accelerate (Anaconda Add-On) •Available as
part of on-premise Wakari •On wakari.io --- cloud based Python (GPU instances coming soon) •http://github.com/ContinuumIO/numbapro- examples conda install accelerate pip install conda conda init conda create -n test accelerate Anaconda User Other Python users enterprise.wakari.io

Programming the GPU easily with Python using Nu...

Programming the GPU easily with Python using NumbaPro

Travis E. Oliphant

More Decks by Travis E. Oliphant

Other Decks in Technology

Featured

Transcript

Applications of Programming the GPU Directly from Python Using NumbaPro

Inroduction Enterprise Python Scientiﬁc Computing Data Processing Data Analysis Visualisation

Big Picture Empower domain experts, subject matter experts, and other

Why Array-oriented computing •Express domain knowledge directly in arrays (tensors,

Why Python License Community Readable Syntax Modern Constructs Batteries Included

Breaking the Speed Barrier (Numba!) Numba aims to be the

NumPy + Mamba = Numba LLVM Library Intel Nvidia Apple

Example @jit(‘f8(f8)’) def sinc(x): if x==0.0: return 1.0 else: return

Compiler Overview LLVM IR x86 C++ ARM PTX C Fortran

~150x speed-up Real-time image processing in Python (50 fps Mandelbrot)

Anaconda Accelerate Python and NumPy stack compiled to Parallel Architectures

NumbaPro Features •CUDA Python •Vectorize --- NumPy functions on the

Compile NumPy array expressions import numbapro from numba import autojit

Create parallel-for loops “prange” directive that spawns compiled tasks in

Fast vectorize NumPy’s ufuncs take “kernels” and apply the kernel

Ufuncs in parallel (multi-thread or GPU) from numbapro import vectorize

Example Benchmark Overhead of memory transfer is over-come after about

Using Vectorize from numbapro import vectorize sig = 'uint8(uint32, f4,

Grouping Calculations prototype = "void(float32[:,:], float32[:,:], float32[:,:])" @guvectorize([prototype], '(m,n),(n,p)->(m,p)', target='gpu')

Using cuBLAS import numpy as np from numbapro.cudalib import cublas

FFT Convolution with cuFFT from numbapro.cudalib import cufft # host

Monte-Carlo Pricing and cuRAND from numbapro import vectorize, cuda from

Benchmark http://continuum.io/blog/monte-carlo-pricer

CUDA Python from numbapro import cuda from numba import autojit

Example: Matrix multiply @cuda.jit(argtypes=[f4[:,:], f4[:,:], f4[:,:]]) def cu_square_matrix_mul(A, B, C):

Performance Results About 6x faster on the GPU. GeForce GTX

How to get it •Anaconda Accelerate (Anaconda Add-On) •Available as