CuPy improvments around memory

1 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $V1Z *NQSPWFNFOUT
BSPVOENFNPSZ 2017/10/24

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 4FMG*OUSPEVDUJPO 2 ・Naotoshi
Seo @sonots ・AI System Dept. ・Analytics Infra Group ・Tools Team (Supervisor) ・Cloud Infra (Supervisor) ・Fluentd & Ruby Committer ・Recently, “出向” to PFN ・Chainer/CuPy core dev

3 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. • I
do not talk • Why we use GPU • What is CuPy • I talk • Basic CUDA programming • CuPy improvements %JTDMBJNFS

.FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 4 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN 0 #BTJD$6%"1SPHSBNNJOH

5 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Typically,
CUDA programs are written as 1. Allocate GPU memory 2. Copy CPU memory to GPU memory 3. Launch CUDA kernel (a function executed by GPU) 4. Copy GPU memory to CPU memory 5. Free GPU memory Basic CUDA programming

6 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Basic
CUDA programming __global__ void sumArraysOnGPU(float *A, float *B, float *C, const int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main(int argc, char **argv) { // malloc device global memory float *d_A, *d_B, *d_C; cudaMalloc((float**)&d_A, nBytes); cudaMalloc((float**)&d_B, nBytes); cudaMalloc((float**)&d_C, nBytes); // transfer data from host to device cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice); // invoke kernel sumArraysOnGPU<<<grid, block>>>(d_A, d_B, d_C, nElem); // copy kernel result back to host side cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost); // free device global memory CHECK(cudaFree(d_A)); CHECK(cudaFree(d_B)); CHECK(cudaFree(d_C)); }

7 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Basic
CUDA programming CPU GPU Kernel1 Idle • Kernel execution is asynchronous • cudaMalloc, cudaMemcpy, cudaFree are synchronous Launch kernel1 cudaMalloc cudaMemcpy (H2D) cudaFree cudaMemcpy (D2H)

1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN

9 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IZXFOFFE.FNPSZQPPM •
cudaMalloc / cudaFree are slow • memory allocation / free themselves are slow • cudaMalloc synchronizes CPU and GPU Memory pool improvement CPU GPU Free Kernel1 cudaFree Launch synchronize Idle Kernel2 something Idle Launch cudaMalloc synchronize Malloc

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPM Memory pool
improvement • Letʼs cache to memory pool • Avoid cudaMalloc / cudaFree as much as possible

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUBMHPSJUIN Memory
pool improvement 512 1024 1536 2048 2560 …. free bins or arena chunks size free_list or bin 1. Round up memory size by 512 2. cudaMalloc, and use it 3. Push to arena intead of cudaFree 4. Pop from arena if a block of exactly same size is available in arena instead of cudaMalloc

12 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTPOCFTUpUBMHPSJUIN •
Cache miss occurs even if there exist memory blocks of larger sizes than the required size (best-ﬁt). • The cache miss typically occurs for Natural Language Processing applications whose input data size are varying. Memory pool improvement 512 1024 1536 2048 2560 …. size Want How about using cache of larger size (2560)?

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUXJUIDPBMFTDJOH #'$
Memory pool improvement 512 1024 1536 2048 2560 … 1. Pop a chunk if larger size than required size is available 2. Split and use only necessary size. Push back a chunk of remained size Split Pop 512 2048 512 1024 1536 2048 2560 …. Push (1) (2) Split and merge 4QMJU Split next prev want to use

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUXJUIDPBMFTDJOH #'$
Memory pool improvement 1. Merge the chunk to free with next or prev chunks in free lists 2. Push back the merged chunk Merge 512 2048 512 1024 1536 2048 2560 …. Push .FSHF next prev

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 5IFF⒎FDUPG#'$ Memory pool
improvement • For situations: • 1. Previous input data size is large • 2. Next input data size is smaller • We can still use a cached memory

16 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 5IFF⒎FDUPG#'$ hQps://github.com/cupy/cupy/pull/168
Memory pool improvement 1/4 memory usage! And, 1.3 times faster

1 2 3 CuPy 2.0 Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN

18 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IBUPGUFOIBQQFOT cupy.cuda.runUme.CUDARunUmeError:
cudaErrorMemoryAllocaUon: out of memory • How to investigate it? • Use GPU memory proﬁler GPU memory profiler

19 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. (16$IBJOFS'VODUJPO.FNPSZ1SPpMFS BWBJMBCMFJO$IBJOFS
from chainer.func.on_hooks import CupyMemoryProﬁleHook hook = CupyMemoryProﬁleHook() with hook: trainer.run() hook.print_report() Bytes used from CuPy mem pool Bytes acquired from GPU device GPU memory profiler

20 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. import cupy.cuda.memory_hooks
hook = cupy.cuda.memory_hooks.LineProﬁleHook() with hook: # some cupy codes hook.print_report() _root (4.00KB, 4.00KB) lib/python3.6/runpy.py:193:_run_module_as_main (4.00KB, 4.00KB) lib/python3.6/runpy.py:85:_run_code (4.00KB, 4.00KB) lib/python3.6/uniQest/__main__.py:18:<module> (4.00KB, 4.00KB) lib/python3.6/uniQest/main.py:94:__init__ (4.00KB, 4.00KB) lib/python3.6/uniQest/main.py:255:runTests (4.00KB, 4.00KB) $V1Z-JOF.FNPSZ1SPpMFS BWBJMBCMFJO$V1ZB Bytes used from CuPy mem pool Bytes acquired from GPU device GPU memory profiler

21 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*$IBJOFS'VODUJPO5JNF1SPpMFS BWBJMBCMFJO$IBJOFS
from chainer.func.on_hooks import TimerHook hook = TimerHook() with hook: trainer.run() hook.print_report() FunctionName ElapsedTime Occurrence LinearFunction 1.24sec 3900 ReLU 593.05ms 2600 SoftmaxCrossEntropy 824.11ms 1300 Accuracy 176.54ms 700 GPU memory profiler

22 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*-JOF5JNF1SPpMFS QZUIPOQSPpMFSJTBWBJMBCMF
import cProfile, pstats, io pr = cProfile.Profile() pr.enable() # ... do something ... pr.disable() s = io.StringIO() sortby = 'cumulaUve' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print(s.getvalue()) ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.001 0.001 re.py:212(compile) 1 0.000 0.000 0.001 0.001 re.py:268(_compile) 1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset) 1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset) 4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction) 3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile) hQps://docs.python.org/3/library/profile.html GPU memory profiler

23 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*(161SPpMFS OWQSPGBOEOWWQJTBWBJMBCMF
$ /usr/local/cuda/bin/nvprof python examples/cusolver.py ==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py ==27986== Profiling application: python examples/stream/cusolver.py ==27986== Profiling result: Time(%) Time Calls Avg Min Max Name 41.70% 125.73us 4 31.431us 30.336us 33.312us void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>) 21.94% 66.144us 36 1.8370us 1.7600us 2.1760us [CUDA memcpy DtoH] 13.77% 41.536us 48 865ns 800ns 1.4400us [CUDA memcpy HtoD] 0.98% 2.9440us 2 1.4720us 1.2160us 1.7280us void reset_diagonal_real<double, int=8>(int, double*, int) 0.98% 2.9440us 4 736ns 736ns 736ns [CUDA memset] ==27986== API calls: Time(%) Time Calls Avg Min Max Name 60.34% 408.55ms 9 45.395ms 4.8480us 407.94ms cudaMalloc 37.60% 254.60ms 2 127.30ms 556ns 254.60ms cudaFree 0.94% 6.3542ms 712 8.9240us 119ns 428.32us cuDeviceGetAttribute 0.72% 4.8747ms 8 609.33us 320.37us 885.26us cuDeviceTotalMem 0.10% 693.60us 82 8.4580us 2.8370us 72.004us cudaMemcpyAsync 0.08% 511.79us 1 511.79us 511.79us 511.79us cudaHostAlloc 0.08% 511.75us 8 63.969us 41.317us 99.232us cuDeviceGetName 0.05% 310.04us 1 310.04us 310.04us 310.04us cuModuleLoadData GPU memory profiler

24 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*(161SPpMFS OWQSPGBOEOWWQJTBWBJMBCMF
$ nvprof -o prof.nvvp python examples/cusolver.py $ /Developer/NVIDIA/CUDA-9.0/bin/nvvp prof.nvvp I install nvvp on Mac OSX, and scp prof.nvvp from GPU machine hQps://developer.nvidia.com/cuda-downloads GPU memory profiler

1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN

26 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IBUJT$6%"TUSFBN hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
• The ability to perform multiple CUDA operations • CUDA Kernel • cudaMemcpyAsync (HostToDevice) • cudaMemcpyAsync (DeviceToHost) • simultaneously Support CUDA stream

27 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $PODVSSFODZFYBNQMF hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
• Overlap memcpy and kernel execution Support CUDA stream

28 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $PODVSSFODZFYBNQMF Support
CUDA stream • NNs with branches, e.g., inception module of GoogLeNet • 4 streams (concurrencies) for 4 branches would improve performance.

29 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTUPTVQQPSU$6%"TUSFBN •
NFNPSZQPPMXJUIPOFTUSFBN CPU stream1 Kernel1 Launch kernel2 Launch kernel1 malloc free malloc free Kernel2 • Returns memory to mem pool before kernel execution ﬁnishes • It was ﬁne because it is sure that kernel2 is ran after kernel1. Support CUDA stream

30 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTUPTVQQPSU$6%"TUSFBN •
CSPLFONFNPSZQPPMXJUIUXPTUSFBNT CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free • kernel2 may use memory blocks which kernel1 is still using. Support CUDA stream

32 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 4FQFSBUFENFNPSZQPPMGPSFBDITUSFBN
Support CUDA stream Mem pool1 CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free Mem pool2 • Drawback: stream2 can not reuse cached memory of stream1

33 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 6TFDVEBTUSFBNDBMMCBDL
CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free callback to call free callback to call free Support CUDA stream • kernel2 does not touch memory which kernel1 is using • Drawback: Registering callback to all kernels would degrade performance

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPTVQQPSU$6%"TUSFBN • (1)
Create a separated memory pool for each stream • (2) Use cuda stream callback? cudaStreamAddCallback synchronizes CPU and GPU hQps://gist.github.com/sonots/e98a95aaceae65a15d2b59a81bem023 Support CUDA stream So, we chose (1)

35 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$V1Z import
cupy x = cupy.array([1]) y = cupy.array([1]) stream1 = cupy.cuda.stream.Stream() stream2 = cupy.cuda.stream.Stream() with stream1: z1 = x + y with stream2: z2 = x * y # Default stream waits unUl all stream’s operaUons ﬁnish as default z = z1 + z2 Support CUDA stream example.py

36 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$IBJOFS import
chainer import chainer.funcUons as F Import chainer.links as L class MyAwesomeNet(chainer.chain): def __init(self): super(MyAwesomeNet, self).__init__() with self.init_scope(): self.stream1 = chainer.cuda.stream.Stream() self.stream2 = chainer.cuda.stream.Stream() self.conv1 = L.ConvoluUon2D(None, 384, 3, pad=1) def __call__(self, x, t): with stream1: h1 = self.conv1(x) with stream2: h2 = self.conv2(x) example.py Support CUDA stream

37 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$V1Z $
nvprof --print-gpu-trace python example.py Start Duration [omit] Stream Name 617.88ms 1.4400us [omit] 7 [CUDA memcpy HtoD] 840.73ms 2.8480us [omit] 13 cupy_copy [415] 841.58ms 7.2640us [omit] 13 cupy_power [424] 842.33ms 3.5840us [omit] 13 cupy_sum [433] 842.99ms 2.9760us [omit] 13 cupy_sqrt [440] 843.10ms 2.3040us [omit] 14 cupy_copy [446] 843.19ms 6.5600us [omit] 14 cupy_power [452] 843.27ms 3.2320us [omit] 14 cupy_sum [458] 843.33ms 2.6880us [omit] 14 cupy_sqrt [462] Support CUDA stream

.FNPSZQPPMXJUICFTUpU XJUIDPBMFTDJOH 38 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
1 2 3 Conclusion: CuPy Improvements $IBJOFSGVODUJPO(16 .FNPSZQSPpMFS 4VQQPSU$6%"TUSFBN CuPy 2.0.0 Chainer 3.0.0 CuPy 2.1.0 CuPy 4.0.0b1 $V1ZMJOFNFNPSZ 1SPpMFS

CuPy improvments around memory

CuPy improvments around memory

More Decks by Naotoshi Seo

Other Decks in Programming

Featured

Transcript