Fast Numerical Computing and Deep Learning in Ruby with Cumo

Fast Numerical Computing and Deep Learning in Ruby with Cumo

Aef92c3acc29ad8543e04135687fc4f1?s=128

Naotoshi Seo

May 31, 2018
Tweet

Transcript

  1. Fast Numerical Computing and Deep Learning in Ruby with Cumo

    Naotoshi Seo RubyKaigi 2018 May 31
  2. This Talk • About Cumo (pronounced like “koomo”) which I

    am recently working personally with Ruby Association Grant 2017 • Cumo is a high speed numerical computing library for Ruby using CUDA (GPU) • Explains Cumo inside • Also, explains CUDA programming basics • GOAL: To increase people contributing to Ruby in Scientific Computing (Deep Learning) field 2 https://github.com/sonots/cumo
  3. Self Introduction • Naotoshi Seo @sonots • DeNA Co., Ltd.

    • CRuby committer • Recently working on development of DNN framework at Preferred Networks, Inc (出向) 3
  4. Outline 4 • Overview of Scientific Computing in Ruby •

    Why Cumo is needed • CUDA Programming Basics • Cumo Features • Notices (or Difficulties) I met • Feature Proposals to Ruby
  5. Why Cumo is need 5 ( Overview of Scientific Computing

    in Ruby )
  6. 4DJFOUJpD$PNQVUJOHJO1ZUIPO 6 NumPy CuPy PyCUDA Chainer TensorFlow MXNet Cython pybind11

    DNN Tensor CUDA binding Useful tools for writing bindings C++ C++ Python
  7. 4DJFOUJpD$PNQVUJOHJO3VCZ 7 Numo/NArray Cumo RbCUDA Red-chainer TensorFlow.rb MXNet.rb Rubex /"

    DNN Tensor CUDA binding Useful tools for writing bindings (or NMatrix) PyCall Binding to Python C++ C++ Ruby
  8. CUDA Programming Basics 8 ( Why GPU is required )

  9. Why GPU is required • Because it is faster •

    ex) 30 days in CPU -> 4 days in GPU 9 $6%"1SPHSBNNJOH#BTJDT
  10. Characteristics of GPU • GPU is bad at branching •

    GPU simplifies branch prediction and out-of-order mechanism instead. • GPU is suitable for matrix computation 10 • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU $6%"1SPHSBNNJOH#BTJDT
  11. CUDA architecture 1. Allocate GPU memory 2. Transfer data from

    host (CPU) to device (GPU) 3. Process at GPU 4. Transfer result from device (GPU) to host (CPU) 5. Free GPU memory 11 • Typically, CUDA programs are written as $6%"1SPHSBNNJOH#BTJDT
  12. Basic CUDA codes 12 __global__ void addArraysOnGPU(float *A, float *B,

    float *C, const int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main(int argc, char **argv) { // malloc device global memory float *d_A, *d_B, *d_C; cudaMalloc((float**)&d_A, nBytes); cudaMalloc((float**)&d_B, nBytes); cudaMalloc((float**)&d_C, nBytes); // transfer data from host to device cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice); // invoke kernel addArraysOnGPU<<<grid, block>>>(d_A, d_B, d_C, nElem); // copy kernel result back to host side cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost); // free device global memory CHECK(cudaFree(d_A)); CHECK(cudaFree(d_B)); CHECK(cudaFree(d_C)); } $6%"1SPHSBNNJOH#BTJDT
  13. Asynchrounous 13 CPU GPU Kernel1 Idle • Kernel execution is

    asynchronous • cudaMalloc, cudaMemcpy, cudaFree are synchronous Launch kernel1 cudaMalloc cudaMemcpy (H2D) cudaFree cudaMemcpy (D2H) $6%"1SPHSBNNJOH#BTJDT
  14. GPU is like a job queue • You may assume

    GPU is a job queue • CPU enqueues a job to GPU (launches a kernel) • The point is that CPU latency is hidden as long as it finishes earlier than GPU • Please note that cudaMalloc/Free requires to synchronize GPU and CPU (I will talk how to avoid later) 14 launch sync GPU CPU Idle get results $6%"1SPHSBNNJOH#BTJDT
  15. Outline 15 • Overview of Scientific Computing in Ruby •

    Why Cumo is needed • CUDA Programming Basics • Cumo Features • Notices (or Difficulties) I met • Feature Proposals to Ruby
  16. Cumo Features • Highly compatible with Ruby/Numo • Element-wise operations

    • Reduction operations • Dot operation using cuBLAS • CUDA memory pool • JIT compilation of user-defined functions 16
  17. Highly Compatible with Numo • Ruby/Numo users can easily switch

    into Cumo to leverage power of GPU 17 pOEOBNF SCca YBSHTTFEJa FT/VNP$VNPHa FTOVNPDVNPH JGHQV
 SFRVJSFDVNPOBSSBZ
 9VNP$VNP
 FMTF
 SFRVJSFOVNPOBSSBZ
 9VNP/VNP
 FOE
 B9VNP%'MPBU[FSPT   
 C9VNP%'MPBUPOFT   
 DB C )JHIMZDPNQBUJCMFXJUI/VNP
  18. Element-wise opeations • Element-wise is like matrix additions • All

    elements are independent • Easy to perform in parallel 18 1 2 3 4 5 6 2 3 4 5 6 7 + A B 3 5 7 9 11 13 = C       Thread
 IDs &MFNFOUXJTFPQFSBUJPOT
  19. Reduction opeations • Like sum • sum([1,2,3,4]) #=> 10 •

    Elements are not independent • Not so easy to perform in parallel 19 3FEVDUJPOPQFSBUJPOT
  20. 20 http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf 3FEVDUJPODPNQVUBUJPOT

  21. 21 3FEVDUJPODPNQVUBUJPOT

  22. Dot product using cuBLAS • Dot is more complicated than

    reduction • NVIDIA's cuBLAS library supports it as GEMM (GEneral matrix-matrix mulitplication) and fast • However, cuBLAS supports only f-contiguous (column major) although we write CRuby extensions in C (c-contiguous, raw-major) 22 %PUQSPEVDU 1 2 3 4 5 6 7 8 9 1 4 7 2 5 8 3 6 9 C-contiguous F (Fortran)
  23. Using cuBLAS with c- contiguous data 23 A = [1,

    2, 3, 4, 5, 6] B = [1, 2, 3, 4, 5, 6] C = [9, 12, 15, 19, 26, 33, 29, 40, 51] A = [1, 2, 3, 4, 5, 6] B = [1, 2, 3, 4, 5, 6] C = [9, 12, 15, 19, 26, 33, 29, 40, 51] $DPOUJVHPVT 3PXNBKPS 'DPOUJVHPVT $PMVNONBKPS No data copy, changing only attributes (shape) https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html %PUQSPEVDU
  24. CUDA memory pool 24

  25. Why We Need Memory Pool 25 $6%".FNPSZ1PPM • cudaMalloc /

    cudaFree makes slow • memory allocation / free themselves are slow • cudaMalloc synchronizes CPU and GPU CPU GPU Free Kernel1 cudaFree Launch synchronize Idle Kernel2 something Idle Launch cudaMalloc synchronize Malloc
  26. Memory Pool 26 $6%".FNPSZ1PPM • Cache to memory pool •

    Avoid cudaMalloc / Free as much as possible
  27. (Maybe) The Simplest Way 27 $6%".FNPSZ1PPM 1. Round up memory

    size by 512 2. Append into a list (free list) 3. Search the list to reuse chunk 512 2048 free_list 1024 0 /
  28. Best-fit algorithm 28 $6%".FNPSZ1PPM 512 1024 1536 2048 2560 ….

    free bins or arena chunks size free_list or bin 1. Round up memory size by 512 2. cudaMalloc, and use it 3. Push to arena intead of cudaFree 4. Pop from arena if a block of exactly same size is available in arena instead of cudaMalloc O(1) to find a chunk of 2560
  29. Issues on Best-Fit 29 $6%".FNPSZ1PPM • Cache miss occurs even

    if there exist memory blocks of larger sizes than the required size (best-fit). • The cache miss typically occurs for Natural Language Processing applications whose input data size are varying. 512 1024 1536 2048 2560 …. size Want How about using cache of larger size (2560)?
  30. Best-fit with coalesing (BFC) 30 $6%".FNPSZ1PPM 512 1024 1536 2048

    2560 …. 1. Pop a chunk if larger size than required size is available 2. Split and use only necessary size. Push back a chunk of remained size Split Pop 512 2048 512 1024 1536 2048 2560 …. Push (1) (2) Split and merge 4QMJU Split next prev want to use The one glibc malloc uses
  31. Best-fit with coalesing (BFC) 31 $6%".FNPSZ1PPM 1. Merge the chunk

    to free with next or prev chunks in free lists 2. Push back the merged chunk Merge 512 2048 512 1024 1536 2048 2560 …. Push .FSHF next prev
  32. 32 • Cumo supports users to write their own CUDA

    kernel on Ruby • JIT compile using NVRTC (NVIDIA Runtime Compilation), and caches it on file system. JIT compiling user-defined functions kernel = Cumo::ElementwiseKernel.new(
 'float32 x, float32 y, float32 z',
 'float32 w', # output type
 'w = (x * y) + z;', # CUDA code
 'my_kernel')
 w = kernel.call(x, y, z) +*5DPNQJMJOHVTFSEFpOFEGVODUJPOT
  33. Performance Comparison with Numo 33

  34. Element-wise kernel 4J[F /VNP $VNP ?   ? 

     ?   ?   ?   a = Xumo::Float32.ones(size) b = Xumo::Float32.ones(size) a + b 40 times faster for size of 10^8 34 Smaller is better UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  35. Dot product 35 4J[F /VNP $VNP ?   ?

      ?   ?   ?   a = Xumo::Float32.ones(100, size/100) b = Xumo::Float32.ones(size/100, 100) a.dot(b) 2800 times faster for size of 10^8 UIJT ※ Numo without Numo/Linalg Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) Smaller is better 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  36. red-chainer mnist example 36 • 20 times faster w/o memory

    pool • 75 times faster w/ memory pool Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  37. Outline 37 • Overview of Scientific Computing in Ruby •

    Why Cumo is needed • CUDA Programming Basics • Cumo Features • Notices (or Difficulties) I met • Feature Proposals to Ruby
  38. Notices (or Difficulties) I met • GPU unfriendness with GC

    • Difficulties compiling CUDA kernels • Lack of mkmf features • Reduction kernels synchronize with CPU • Broadcast operations were slow 38
  39. GPU unfriendness with GC • One criteria to perform GC

    in Ruby is main memory usage (malloc_limit) • GPU memory usage is not taken into account • In the case of CuPy, because Python uses reference counting, we could release GPU memory immediately after the array object is not referenced anymore. 39 def add
 a = Cumo::DFloat.ones(3, 5)
 b = Cumo::DFloat.ones(3, 5)
 a + b end c = add a and b are not immediately freed (16VOGSJFOEOFTTXJUI($
  40. GPU unfriendness with GC (2) • (Partial) Solution • Added

    NArray#free to release memory to GPU on user-desired timing • Future work? • Something like NSAutoreleasePool to release all (or restricted) objects created inside a scope. 40 def add
 a = Cumo::DFloat.ones(3, 5)
 b = Cumo::DFloat.ones(3, 5)
 c = a + b a.free; b.free c end c = add a and b are immediately freed NSAutoreleasePool *pool = \ [[NSAutoreleasePool alloc] init]; NSObject *obj = \ [[[NSObject alloc] init] autorelease]; .... [pool release]; (16VOGSJFOEOFTTXJUI($
  41. 41 • Need to use nvcc (NVIDIA CUDA Compiler) instead

    of gcc to compile CUDA kernels. • However, mkmf supports to specify only CC and CXX compilers (no .cu file) • Solution: Made a wrapper ruby script • For files with .cu extensions, use nvcc • For files with .c extensions, use gcc Lack of mkmf features %J⒏DVMUJFTDPNQJMJOH$6%"LFSOFMT
  42. 42 • Numo returns a Ruby numeric object for reduction

    kernels (for cases of 0-dimensional NArray). • In Cumo, needs to copy GPU memory to host memory to create a Ruby nemeric object. • It results in synchronization with CPU. • Solution: Introduced partial incompatibility with Numo to return 0-dimensional NArray. Reduction Kernels Synchronize with CPU Numo::Int64.ones(2, 3).sum #=> 6
 Cumo::Int64.ones(2, 3).sum #=> Cumo::Int64#shape=[] 6 Returns a 0-dimensional NArray instead of a Ruby numeric object to avoid CPU and GPU synchronization. 3FEVDUJPO,FSOFMT4ZODISPOJ[FXJUI$16
  43. 1 2 3 43 • Broadcast Broadcast operations were slow

    #SPBEDBTUPQFSBUJPOTXFSFTMPX 1 2 3 4 5 6 7 8 9 10 11 12 + 1 2 3 = 2 4 6 5 7 9 8 10 12 11 13 15 1 2 3 1 2 3 4 x 3 1 x 3 4 x 3
  44. 44 How Numo Treats Broadcast Example) 1000 x 3 array

    + 1 x 3 array user loop: loop for 3 narray loop: loop for 1000 int nd = 1; int shape[] = {1000}; for (int i=0; i<nd;++i) { for (int j=0; j<shape[i]; ++j) { (*(nf->func))(&(lp->user)); } } int size = 3; for (int i=0; i<size;++i) { p3[i] = p1[i] + p[2]; } #SPBEDBTUPQFSBUJPOTXFSFTMPX
  45. 45 Launches Many CUDA Kernels user loop: loop for 3

    narray loop: loop for 1000 int nd = 1; int shape[] = {1000}; for (int i=0; i<nd;++i) { for (int j=0; j<shape[i]; ++j) { (*(nf->func))(&(lp->user)); } } __global__ void my_kernel( int* p3, int* p2, int* p1) { int i = blockIdx.x * blockDim.x + threadIdx.x; p3[i] = p1[i] + p2[i]; } Launches CUDA Kernels 1000 times. • In first implementation of Cumo, modified user loop implementation to CUDA kernels #SPBEDBTUPQFSBUJPOTXFSFTMPX
  46. 46 How Slow launching CUDA kernels Type Time(%) Time Calls

    Avg Min Max Name GPU activities: 99.89% 19.439ms 1000 19.439us 18.880us 21.312us cumo_sfloat_add API calls: 27.23% 330.78ms 13 25.445ms 35.083us 68.418ms cudaDeviceSynchronize 26.34% 319.98ms 1 319.98ms 319.98ms 319.98ms cuCtxCreate 25.32% 307.66ms 1477 208.30us 13.408us 275.62ms cudaMallocManaged 2.58% 18.703ms 1002 18.665us 16.184us 216.70us cudaLaunch nvprof • 18 micro second • Time to take cudaLaunch is almost equivalent with adding two arrays of 500,000 elements. • Also, there is a limit of CUDA queue size, e.g., 1,024. #SPBEDBTUPQFSBUJPOTXFSFTMPX
  47. 47 • Finally, stopped using ndloop Solution: Stop using ndloop

    /EMPPQJTTMPXGPS$VNP #SPBEDBTUPQFSBUJPOTXFSFTMPX a = Cumo::DFloat.ones(1000, 768)
 b = Cumo::DFloat.ones(1000, 1)
 a + b 56 times faster
  48. Outline 48 • Overview of Scientific Computing in Ruby •

    Why Cumo is needed • CUDA Programming Basics • Cumo Features • Notices (or Difficulties) I met • Feature Proposals to Ruby
  49. Feature Proposals to Ruby • Inplace math operations • Temporary

    variable 49
  50. Inplace Math Operations • a += b is an abridged

    notation of a = a + b • Imagine a is a large matrix requiring 1GB. • a += b needs to allocate a new 1GB matrix. • Want to redefine for Cumo::NArray objects. • Current compromise: • Python allows to redefine +=. 50 https://bugs.ruby-lang.org/issues/14701 a.inplace + b 'FBUVSF1SPQPTBMTUP3VCZ
  51. Temporary Variable • In python, we can find a variable

    is a temporary or not by seeing reference counts • In NumPy, • is faster than • because (x + 1) is a temporary variable and new memory is not required to compute (x + 1) + 1 51 https://bugs.ruby-lang.org/issues/14710 y = x + 1 + 1 y = x + 1 y + 1 'FBUVSF1SPQPTBMTUP3VCZ
  52. At the end 52

  53. Future Works • Support cuDNN for high performance convolutional networks

    • Support more functions (28/80 files are left) • User-defined kernel (not yet completed) • Kernel fusion • Conversion between Numo::NArray and Cumo::NArray 53
  54. Help Wanted • Resolve future works • support more functions

    (28/80 files are left) • GPU CI server .... 54
  55. Talk Summary 55 • Cumo is CUDA version of Ruby/Numo,

    and easy to switch vice versa • GPU is fast and essential in deep learning • CUDA is like a job queue • Red-chainer mnist example worked 75 times faster with Cumo than Numo • Help wanted: GPU CI server …
  56. Acknowledgements • DeNA Co., Ltd for supporting business trip •

    Ruby Association for Grant • Money - GPU machines cost much • Time keeper • Motivation • @mrkn for his mentoring on the grant • @masa16 for answering my questions about Numo • @hatappi and @naitoh for their work of red-chainer • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers (including me) of Chainer/CuPy for reference implementation • And, my wife for giving time to develop 56