simplifies branch prediction and out-of-order mechanism instead. • GPU is suitable for matrix computation 7 • GPU is fast, and recently essential for Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU 1SPKFDU*OUSPEVDUJPO
into Cumo to leverage power of GPU 15 pOEOBNF SCca YBSHTTFEJa FT/VNP$VNPHa FTOVNPDVNPH JGHQV SFRVJSFDVNPOBSSBZ 9VNP$VNP FMTF SFRVJSFOVNPOBSSBZ 9VNP/VNP FOE B9VNP%'MPBU[FSPT C9VNP%'MPBUPOFT DB C $VNP'FBUVSFT
than reduction • NVIDIA's cuBLAS library supports it as GEMM (GEneral matrix-matrix mulitplication) and fast • However, cuBLAS supports only f-contiguous (column major) although we write CRuby extensions in C (c-contiguous, raw-major) 20 1 2 3 4 5 6 7 8 9 1 4 7 2 5 8 3 6 9 C-contiguous F (Fortran) $VNP'FBUVSFT
Pop 512 2048 512 1024 1536 2048 2560 …. Push (1) (2) Split next prev use 1. Round up memory size by 512 2. cudaMalloc if no block is available 3. Push to arena intead of cudaFree 4. Pop from arena if a free block is available instead of cudaMalloc Implemented Best-fit with Coalescing (BFC), which is the one used in malloc(3) $6%".FNPSZ1PPM
? ? ? a = Xumo::Float32.ones(size) b = Xumo::Float32.ones(size) a + b 40 times faster for size of 10^8 28 Smaller is better UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
? ? ? a = Xumo::Float32.ones(100, size/100) b = Xumo::Float32.ones(size/100, 100) a.dot(b) 2800 times faster for size of 10^8 UIJT ※ Numo without Numo/Linalg Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) Smaller is better 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
• Difficulties compiling CUDA kernels • Lack of mkmf features • Incompatibility with Numo is required in reduction kernels for performance • Broadcast operations were slow 32
in Ruby is main memory usage (malloc_limit) • GPU memory usage is not taken into account • In the case of CuPy, because Python uses reference counting, we could release GPU memory immediately after the array object is not referenced anymore. 33 def add a = Cumo::DFloat.ones(3, 5) b = Cumo::DFloat.ones(3, 5) a + b end c = add a and b are not immediately freed (16VOGSJFOEOFTTXJUI($
NArray#free to release memory to GPU on user-desired timing • Future work? • Something like NSAutoreleasePool to release all (or restricted) objects created inside a scope. 34 def add a = Cumo::DFloat.ones(3, 5) b = Cumo::DFloat.ones(3, 5) c = a + b a.free; b.free c end c = add a and b are immediately freed NSAutoreleasePool *pool = \ [[NSAutoreleasePool alloc] init]; NSObject *obj = \ [[[NSObject alloc] init] autorelease]; .... [pool release]; (16VOGSJFOEOFTTXJUI($
of gcc to compile CUDA kernels. • However, mkmf supports to specify only CC and CXX compilers (no .cu file) • Solution: Made a wrapper ruby script • For files with .cu extensions, use nvcc • For files with .c extensions, use gcc Lack of mkmf features %J⒏DVMUJFTDPNQJMJOH$6%"LFSOFMT
kernels (for cases of 0-dimensional NArray). • In Cumo, needs to copy GPU memory to host memory to create a Ruby nemeric object. • It results in synchronization with CPU. • Solution: Introduced partial incompatibility with Numo to return 0-dimensional NArray. Incompatibility with Numo is required in reduction for performance Numo::Int64.ones(2, 3).sum #=> 6 Cumo::Int64.ones(2, 3).sum #=> Cumo::Int64#shape=[] 6 Returns a 0-dimensional NArray instead of a Ruby numeric object to avoid CPU and GPU synchronization. *ODPNQBUJCJMJUZXUJI/VNPJTSFRVJSFEJO3FEVDUJPO
Avg Min Max Name GPU activities: 99.89% 19.439ms 1000 19.439us 18.880us 21.312us cumo_sfloat_add API calls: 27.23% 330.78ms 13 25.445ms 35.083us 68.418ms cudaDeviceSynchronize 26.34% 319.98ms 1 319.98ms 319.98ms 319.98ms cuCtxCreate 25.32% 307.66ms 1477 208.30us 13.408us 275.62ms cudaMallocManaged 2.58% 18.703ms 1002 18.665us 16.184us 216.70us cudaLaunch nvprof • 18 micro second • Time to take cudaLaunch is almost equivalent with adding two arrays of 500,000 elements. • Also, there is a limit of CUDA queue size, e.g., 1,024. #SPBEDBTUPQFSBUJPOTXFSFTMPX
notation of a = a + b • Imagine a is a large matrix requiring 1GB. • a += b needs to allocate a new 1GB matrix. • Want to redefine for Cumo::NArray objects. • Current compromise: • Python allows to redefine +=. 44 https://bugs.ruby-lang.org/issues/14701 a.inplace + b 'FBUVSF1SPQPTBMTUP3VCZ
is a temporary or not by seeing reference counts • In NumPy, • is faster than • because (x + 1) is a temporary variable and new memory is not required to compute (x + 1) + 1 45 https://bugs.ruby-lang.org/issues/14710 y = x + 1 + 1 y = x + 1 y + 1 'FBUVSF1SPQPTBMTUP3VCZ
machines cost much • Time keeper • Motivation • @mrkn for his mentoring on the grant • @masa16 for answering my questions about Numo • @hatappi and @naitoh for their work of red-chainer • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers (including me) of Chainer/CuPy for reference implementation • And, my wife for giving time to develop 51