Shohei Hido - CuPy: A NumPy-compatible Library for GPU

A NumPy-compatible Library for GPU Shohei Hido VP of Research
Preferred Networks

Preferred Networks: An AI Startup in Japan • Founded: March
2014 (120 engineers and researchers) • Major news • $100+M investment from Toyota for autonomous driving • 2nd place at Amazon Robotics Challenge 2016 • Fastest ImageNet training on GPU cluster (15 minutes using 1,024 GPUs) 2 Deep learning research Industrial applications Manufacturing Automotive Healthcare

Key takeaways • CuPy is an open-source NumPy for NVIDIA
GPU • Python users can easily write CPU/GPU-agnostic code • Existing NumPy code can be accelerated thanks to GPU and CUDA libraries

• What is CuPy • Example: CPU/GPU agnostic implementation of
k-means • Introduction to CuPy • Recent updates & conclusion

CuPy: A NumPy-Compatible Library for NVIDIA GPU • NumPy is
extensively used in Python but GPU is not supported • GPU is getting faster and more important for scientific computing import numpy as np x_cpu = np.random.rand(10) W_cpu = np.random.rand(10, 5) y_cpu = np.dot(x_cpu, W_cpu) import cupy as cp x_gpu = cp.random.rand(10) W_gpu = cp.random.rand(10, 5) y_gpu = cp.dot(x_gpu, W_gpu) y_gpu = cp.asarray(y_cpu) y_cpu = cp.asnumpy(y_gpu) for xp in [numpy, cupy]: x = xp.random.rand(10) W = xp.random.rand(10, 5) y = xp.dot(x, W) CPU/GPU-agnostic NVIDIA GPU CPU

CuPy is actively developed (1,600+ github stars, 11,000+ commits) Ryosuke
Okuta CTO Preferred Networks

Deep learning framework https://chainer.org/ Probabilistic and graphical modeling https://github.com/jmschrei/pomegranate Natural
language processing https://spacy.io/ Python libraries powered by CuPy

Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy

Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)

Our mission: make CuPy the default tool for GPU computation
in Python https://anaconda.org/anaconda/cupy/ • CuPy is now available on Anaconda in collaboration w/ Anaconda team • You can install cupy with “$ conda install cupy” on Linux 64-bit • We are working on Windows version

Don’t have GPU for CuPy? Google Colaboratory gives you one
(for free!) …

Implementation of CPU/GPU agnostic k-means fit(): 37 lines https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py

K-means (1/3): Call function and initialization • fit() follows the
training API of scikit-learn • xp represents either numpy or cupy • Cluster centers are initialized by positions of random samples <- Specify NumPy or CuPy

K-means (2/3): Calculate distance to all of the cluster centers
• xp.linalg.norm is to compute the distance and supported both in numpy and cupy • _fit_calc_distances() uses custom kernel on cupy

Customized kernel with C++ snippet in cupy.ElementwiseKernel • A kernel
is generated by element-wise operation defined in C++ snippet

K-means (3/3): Update positions of cluster centers • xp.stack is
to update the cluster centers and supported both in numpy and cupy • _fit_calc_center() is also custom kernel based

Another element-wise kernel • It just adds all of the
points inside each cluster and count the number

Performance comparison with NumPy • CuPy is faster than NumPy
even in simple manipulation of large matrix Benchmark code Size CuPy [ms] NumPy [ms] 10^4 0.58 0.03 10^5 0.97 0.20 10^6 1.84 2.00 10^7 12.48 55.55 10^8 84.73 517.17 Benchmark result 6x faster

• Data types (dtypes) ◦ bool_, int8, int16, int32, int64,
uint8, uint16, uint32, uint64, float16, float32, float64, complex64, and complex128 • All basic indexing ◦ indexing by ints, slices, newaxes, and Ellipsis • Most of advanced indexing ◦ except indexing patterns with boolean masks • Most of the array creation routines ◦ empty, ones_like, diag, etc... • Most of the array manipulation routines ◦ reshape, rollaxis, concatenate, etc... • All operators with broadcasting • All universal functions for element-wise operations ◦ except those for complex numbers • Linear algebra functions accelerated by cuBLAS ◦ including product: dot, matmul, etc... ◦ including decomposition: cholesky, svd, etc... • Reduction along axes ◦ sum, max, argmax, etc... • Sort operations implemented by Thrust ◦ sort, argsort, and lexsort • Sparse matrix accelerated by cuSPARSE Compatibility with NumPy

Comparison with other Python libraries for/on CUDA • CuPy is
the only library that is designed for high compatibility with NumPy still allowing users to write customized CUDA kernels for better performance CuPy PyCUDA MinPy* NVIDIA CUDA support ✔ ✔ ✔ CPU/GPU agnostic coding ✔ ✔ Automatic gradient support ** ✔ NumPy compatible interface ✔ ✔ User-defined CUDA kernel ✔ ✔ * https://github.com/dmlc/minpy ** Autograd is supported by Chainer

Inside CuPy • CuPy extensively relies on NVIDIA libraries for
better performance Linear algebra NVIDIA GPU CUDA cuDNN cuBLAS cuRAND cuSPARSE NCCL Thrust Sparse matrix DNN Utility Random numbers cuSOLVER User- defined CUDA kernel Multi- GPU data transfer Sort CuPy

Looks very easy? • CUDA and its libraries are not
designed for Python nor NumPy ━ CuPy is not just a wrapper of CUDA libraries for Python ━ CuPy is a fast numerical computation library on GPU with NumPy-compatible API • NumPy specification is not documented ━ We have carefully investigated some unexpected behaviors of NumPy ━ CuPy tries to replicate NumPy’s behavior as much as possible • NumPy’s behaviors vary between different versions ━ e.g, NumPy v1.14 changed the output format of __str__ • `[ 0. 1.]` -> `[0. 1.]` (no space)

Advanced features of CuPy (1/2) Memory pool GPU Memory profiler
Function name Used Bytes Acquired Bytes Occurrence LinearFunction 5.16GB 0.18GB 3900 ReLU 0.99GB 0.46GB 1300 SoftMaxEnropy 7.71MB 5.08MB 1300 Accuracy 0.62MB 0.35MB 700 • This enables function-wise memory profiling on Chainer • Avoiding cudaMalloc is a common practice in CUDA programming • CuPy supports memory pooling using Best-Fit with Coalescing (BFC) algorithm • It reduces memory usage to 25% on seq2seq model

Advanced features of CuPy (2/2) Kernel fusion (experimental) @cp.fuse() def
fused_func(x, y, z): return (x * y) + z • By adding decorator @cp.fuse(), CuPy stores a series of operations • Then it compiles a single kernel to execute the operations

• Start providing pre-built wheel packages of CuPy – cupy-cuda80,
cupy-cuda90, and cupy-cuda91 – $ pip install cupy-cuda80 • Memory pool is now the default allocator – Added line memory profiler using memory hook and traceback • CUDA stream is fully supported stream = cupy.cuda.stream.Stream() with stream: y = cupy.linalg.norm(x) stream.synchronize() stream = cupy.cuda.stream.Stream() stream.use() y = cupy.linalg.norm(x) stream.synchronize() What’s new in CuPy v4?

cupy.argpartition cupy.unravel_index cupy.percentile cupy.moveaxis cupy.blackman cupy.hamming cupy.hanning cupy.isclose cupy.iscomplex cupy.iscomplexobj
cupy.isfortran cupy.isreal cupy.isrealobj cupy.linalg.tensorinv cupy.random.shuffle cupy.random.set_random_state cupy.random.RandomState.tomaxint cupy.sparse.random cupy.sparse.csr_matrix.eliminate_zeros cupy.sparse.coo_matrix.eliminate_zeros cupy.sparse.csc_matrix.eliminate_zeros cupyx.scatter_add cupy.fft Standard FFTs: fft, ifft, fft2, ifft2, fftn, ifftn Real FFTs: rfft, irfft, rfft2, irfft2., rfftn, irfftn Hermitian FFTs: hfft, ihfft Helper routines: fftfreq, rfftfreq, fftshift, ifftshift Newly added functions in v4

• Windows support • AMD GPU support via HIP •
More useful fusion function • Add more functions (NumPy, SciPy) • Add more probability distributions • Provide simple CUDA kernel • Support DLPack and TensorComprehension – toDLPack() and fromDLPack() @cupy.fuse() def sample2(x, y): return cupy.sum(x + y, axis=0) * 2 CuPy v5 - planned features

Summary: CuPy is a drop-in replacement of NumPy for GPU
1. Highly-compatible with NumPy ━ data types, indexing, broadcasting, operations ━ Users can write CPU/GPU-agnostic code 2. High performance on NVIDIA GPUs ━ cuBLAS, cuDNN, cuRAND, cuSPARSE, and NCCL 3. Easy to install ━ $ pip install cupy ━ $ conda install cupy 4. Easy to write custom kernel ━ ElementwiseKernel, ReductionKernel import numpy as np x = np.random.rand(10) W = np.random.rand(10, 5) y = np.dot(x, W) import cupy as cp x = cp.random.rand(10) W = cp.random.rand(10, 5) y = cp.dot(x, W) to GPU to CPU Your contribution will be highly appreciated & We are hiring!

Shohei Hido - CuPy: A NumPy-compatible Library ...

Shohei Hido - CuPy: A NumPy-compatible Library for GPU

PyCon 2018

More Decks by PyCon 2018

Other Decks in Programming

Featured

Transcript

A NumPy-compatible Library for GPU Shohei Hido VP of Research

Preferred Networks: An AI Startup in Japan • Founded: March

Key takeaways • CuPy is an open-source NumPy for NVIDIA

• What is CuPy • Example: CPU/GPU agnostic implementation of

CuPy: A NumPy-Compatible Library for NVIDIA GPU • NumPy is

CuPy is actively developed (1,600+ github stars, 11,000+ commits) Ryosuke

Deep learning framework https://chainer.org/ Probabilistic and graphical modeling https://github.com/jmschrei/pomegranate Natural

Reputation (1/2): Travis Oliphant, creator of NumPy and SciPy

Reputation (2/2): Stephan Merity of Salesforce Research (MetaMind)

Our mission: make CuPy the default tool for GPU computation

Don’t have GPU for CuPy? Google Colaboratory gives you one

• What is CuPy • Example: CPU/GPU agnostic implementation of

Implementation of CPU/GPU agnostic k-means fit(): 37 lines https://github.com/cupy/cupy/blob/master/examples/kmeans/kmeans.py

K-means (1/3): Call function and initialization • fit() follows the

K-means (2/3): Calculate distance to all of the cluster centers

Customized kernel with C++ snippet in cupy.ElementwiseKernel • A kernel

K-means (3/3): Update positions of cluster centers • xp.stack is

Another element-wise kernel • It just adds all of the

• What is CuPy • Example: CPU/GPU agnostic implementation of

Performance comparison with NumPy • CuPy is faster than NumPy

• Data types (dtypes) ◦ bool_, int8, int16, int32, int64,

Comparison with other Python libraries for/on CUDA • CuPy is

Inside CuPy • CuPy extensively relies on NVIDIA libraries for

Looks very easy? • CUDA and its libraries are not

Advanced features of CuPy (1/2) Memory pool GPU Memory profiler

Advanced features of CuPy (2/2) Kernel fusion (experimental) @cp.fuse() def

• What is CuPy • Example: CPU/GPU agnostic implementation of

• Start providing pre-built wheel packages of CuPy – cupy-cuda80,

cupy.argpartition cupy.unravel_index cupy.percentile cupy.moveaxis cupy.blackman cupy.hamming cupy.hanning cupy.isclose cupy.iscomplex cupy.iscomplexobj

• Windows support • AMD GPU support via HIP •

Summary: CuPy is a drop-in replacement of NumPy for GPU