Slide 1

Slide 1 text

1 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $V1Z *NQSPWFNFOUT BSPVOENFNPSZ 2017/10/24

Slide 2

Slide 2 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 4FMG*OUSPEVDUJPO 2 ・Naotoshi Seo @sonots ・AI System Dept. ・Analytics Infra Group ・Tools Team (Supervisor) ・Cloud Infra (Supervisor) ・Fluentd & Ruby Committer ・Recently, “出向” to PFN ・Chainer/CuPy core dev

Slide 3

Slide 3 text

3 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. • I do not talk • Why we use GPU • What is CuPy • I talk • Basic CUDA programming • CuPy improvements %JTDMBJNFS

Slide 4

Slide 4 text

.FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 4 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN 0 #BTJD$6%"1SPHSBNNJOH

Slide 5

Slide 5 text

5 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Typically, CUDA programs are written as 1. Allocate GPU memory 2. Copy CPU memory to GPU memory 3. Launch CUDA kernel (a function executed by GPU) 4. Copy GPU memory to CPU memory 5. Free GPU memory Basic CUDA programming

Slide 6

Slide 6 text

6 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Basic CUDA programming __global__ void sumArraysOnGPU(float *A, float *B, float *C, const int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main(int argc, char **argv) { // malloc device global memory float *d_A, *d_B, *d_C; cudaMalloc((float**)&d_A, nBytes); cudaMalloc((float**)&d_B, nBytes); cudaMalloc((float**)&d_C, nBytes); // transfer data from host to device cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice); // invoke kernel sumArraysOnGPU<<>>(d_A, d_B, d_C, nElem); // copy kernel result back to host side cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost); // free device global memory CHECK(cudaFree(d_A)); CHECK(cudaFree(d_B)); CHECK(cudaFree(d_C)); }

Slide 7

Slide 7 text

7 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Basic CUDA programming CPU GPU Kernel1 Idle • Kernel execution is asynchronous • cudaMalloc, cudaMemcpy, cudaFree are synchronous Launch kernel1 cudaMalloc cudaMemcpy (H2D) cudaFree cudaMemcpy (D2H)

Slide 8

Slide 8 text

.FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 8 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN

Slide 9

Slide 9 text

9 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IZXFOFFE.FNPSZQPPM • cudaMalloc / cudaFree are slow • memory allocation / free themselves are slow • cudaMalloc synchronizes CPU and GPU Memory pool improvement CPU GPU Free Kernel1 cudaFree Launch synchronize Idle Kernel2 something Idle Launch cudaMalloc synchronize Malloc

Slide 10

Slide 10 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPM Memory pool improvement • Letʼs cache to memory pool • Avoid cudaMalloc / cudaFree as much as possible

Slide 11

Slide 11 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUBMHPSJUIN Memory pool improvement 512 1024 1536 2048 2560 …. free bins or arena chunks size free_list or bin 1. Round up memory size by 512 2. cudaMalloc, and use it 3. Push to arena intead of cudaFree 4. Pop from arena if a block of exactly same size is available in arena instead of cudaMalloc

Slide 12

Slide 12 text

12 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTPOCFTUpUBMHPSJUIN • Cache miss occurs even if there exist memory blocks of larger sizes than the required size (best-fit). • The cache miss typically occurs for Natural Language Processing applications whose input data size are varying. Memory pool improvement 512 1024 1536 2048 2560 …. size Want How about using cache of larger size (2560)?

Slide 13

Slide 13 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUXJUIDPBMFTDJOH #'$ Memory pool improvement 512 1024 1536 2048 2560 … 1. Pop a chunk if larger size than required size is available 2. Split and use only necessary size. Push back a chunk of remained size Split Pop 512 2048 512 1024 1536 2048 2560 …. Push (1) (2) Split and merge 4QMJU Split next prev want to use

Slide 14

Slide 14 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUXJUIDPBMFTDJOH #'$ Memory pool improvement 1. Merge the chunk to free with next or prev chunks in free lists 2. Push back the merged chunk Merge 512 2048 512 1024 1536 2048 2560 …. Push .FSHF next prev

Slide 15

Slide 15 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 5IFF⒎FDUPG#'$ Memory pool improvement • For situations: • 1. Previous input data size is large • 2. Next input data size is smaller • We can still use a cached memory

Slide 16

Slide 16 text

16 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 5IFF⒎FDUPG#'$ hQps://github.com/cupy/cupy/pull/168 Memory pool improvement 1/4 memory usage! And, 1.3 times faster

Slide 17

Slide 17 text

.FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 17 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 2 3 CuPy 2.0 Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN

Slide 18

Slide 18 text

18 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IBUPGUFOIBQQFOT cupy.cuda.runUme.CUDARunUmeError: cudaErrorMemoryAllocaUon: out of memory • How to investigate it? • Use GPU memory profiler GPU memory profiler

Slide 19

Slide 19 text

19 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. (16$IBJOFS'VODUJPO.FNPSZ1SPpMFS BWBJMBCMFJO$IBJOFS from chainer.func.on_hooks import CupyMemoryProfileHook hook = CupyMemoryProfileHook() with hook: trainer.run() hook.print_report() Bytes used from CuPy mem pool Bytes acquired from GPU device GPU memory profiler

Slide 20

Slide 20 text

20 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. import cupy.cuda.memory_hooks hook = cupy.cuda.memory_hooks.LineProfileHook() with hook: # some cupy codes hook.print_report() _root (4.00KB, 4.00KB) lib/python3.6/runpy.py:193:_run_module_as_main (4.00KB, 4.00KB) lib/python3.6/runpy.py:85:_run_code (4.00KB, 4.00KB) lib/python3.6/uniQest/__main__.py:18: (4.00KB, 4.00KB) lib/python3.6/uniQest/main.py:94:__init__ (4.00KB, 4.00KB) lib/python3.6/uniQest/main.py:255:runTests (4.00KB, 4.00KB) $V1Z-JOF.FNPSZ1SPpMFS BWBJMBCMFJO$V1ZB Bytes used from CuPy mem pool Bytes acquired from GPU device GPU memory profiler

Slide 21

Slide 21 text

21 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*$IBJOFS'VODUJPO5JNF1SPpMFS BWBJMBCMFJO$IBJOFS from chainer.func.on_hooks import TimerHook hook = TimerHook() with hook: trainer.run() hook.print_report() FunctionName ElapsedTime Occurrence LinearFunction 1.24sec 3900 ReLU 593.05ms 2600 SoftmaxCrossEntropy 824.11ms 1300 Accuracy 176.54ms 700 GPU memory profiler

Slide 22

Slide 22 text

22 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*-JOF5JNF1SPpMFS QZUIPOQSPpMFSJTBWBJMBCMF import cProfile, pstats, io pr = cProfile.Profile() pr.enable() # ... do something ... pr.disable() s = io.StringIO() sortby = 'cumulaUve' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print(s.getvalue()) ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.001 0.001 :1() 1 0.000 0.000 0.001 0.001 re.py:212(compile) 1 0.000 0.000 0.001 0.001 re.py:268(_compile) 1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset) 1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset) 4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction) 3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile) hQps://docs.python.org/3/library/profile.html GPU memory profiler

Slide 23

Slide 23 text

23 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*(161SPpMFS OWQSPGBOEOWWQJTBWBJMBCMF $ /usr/local/cuda/bin/nvprof python examples/cusolver.py ==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py ==27986== Profiling application: python examples/stream/cusolver.py ==27986== Profiling result: Time(%) Time Calls Avg Min Max Name 41.70% 125.73us 4 31.431us 30.336us 33.312us void nrm2_kernel(cublasNrm2Params) 21.94% 66.144us 36 1.8370us 1.7600us 2.1760us [CUDA memcpy DtoH] 13.77% 41.536us 48 865ns 800ns 1.4400us [CUDA memcpy HtoD] 0.98% 2.9440us 2 1.4720us 1.2160us 1.7280us void reset_diagonal_real(int, double*, int) 0.98% 2.9440us 4 736ns 736ns 736ns [CUDA memset] ==27986== API calls: Time(%) Time Calls Avg Min Max Name 60.34% 408.55ms 9 45.395ms 4.8480us 407.94ms cudaMalloc 37.60% 254.60ms 2 127.30ms 556ns 254.60ms cudaFree 0.94% 6.3542ms 712 8.9240us 119ns 428.32us cuDeviceGetAttribute 0.72% 4.8747ms 8 609.33us 320.37us 885.26us cuDeviceTotalMem 0.10% 693.60us 82 8.4580us 2.8370us 72.004us cudaMemcpyAsync 0.08% 511.79us 1 511.79us 511.79us 511.79us cudaHostAlloc 0.08% 511.75us 8 63.969us 41.317us 99.232us cuDeviceGetName 0.05% 310.04us 1 310.04us 310.04us 310.04us cuModuleLoadData GPU memory profiler

Slide 24

Slide 24 text

24 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*(161SPpMFS OWQSPGBOEOWWQJTBWBJMBCMF $ nvprof -o prof.nvvp python examples/cusolver.py $ /Developer/NVIDIA/CUDA-9.0/bin/nvvp prof.nvvp I install nvvp on Mac OSX, and scp prof.nvvp from GPU machine hQps://developer.nvidia.com/cuda-downloads GPU memory profiler

Slide 25

Slide 25 text

.FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 25 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN

Slide 26

Slide 26 text

26 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IBUJT$6%"TUSFBN hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf • The ability to perform multiple CUDA operations • CUDA Kernel • cudaMemcpyAsync (HostToDevice) • cudaMemcpyAsync (DeviceToHost) • simultaneously Support CUDA stream

Slide 27

Slide 27 text

27 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $PODVSSFODZFYBNQMF hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf • Overlap memcpy and kernel execution Support CUDA stream

Slide 28

Slide 28 text

28 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $PODVSSFODZFYBNQMF Support CUDA stream • NNs with branches, e.g., inception module of GoogLeNet • 4 streams (concurrencies) for 4 branches would improve performance.

Slide 29

Slide 29 text

29 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTUPTVQQPSU$6%"TUSFBN • NFNPSZQPPMXJUIPOFTUSFBN CPU stream1 Kernel1 Launch kernel2 Launch kernel1 malloc free malloc free Kernel2 • Returns memory to mem pool before kernel execution finishes • It was fine because it is sure that kernel2 is ran after kernel1. Support CUDA stream

Slide 30

Slide 30 text

30 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTUPTVQQPSU$6%"TUSFBN • CSPLFONFNPSZQPPMXJUIUXPTUSFBNT CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free • kernel2 may use memory blocks which kernel1 is still using. Support CUDA stream

Slide 31

Slide 31 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPTVQQPSU$6%"TUSFBN • (1) Create separated memory pools for each stream • (2) Use cuda stream callback? Support CUDA stream

Slide 32

Slide 32 text

32 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 4FQFSBUFENFNPSZQPPMGPSFBDITUSFBN Support CUDA stream Mem pool1 CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free Mem pool2 • Drawback: stream2 can not reuse cached memory of stream1

Slide 33

Slide 33 text

33 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 6TFDVEBTUSFBNDBMMCBDL CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free callback to call free callback to call free Support CUDA stream • kernel2 does not touch memory which kernel1 is using • Drawback: Registering callback to all kernels would degrade performance

Slide 34

Slide 34 text

Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPTVQQPSU$6%"TUSFBN • (1) Create a separated memory pool for each stream • (2) Use cuda stream callback? cudaStreamAddCallback synchronizes CPU and GPU hQps://gist.github.com/sonots/e98a95aaceae65a15d2b59a81bem023 Support CUDA stream So, we chose (1)

Slide 35

Slide 35 text

35 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$V1Z import cupy x = cupy.array([1]) y = cupy.array([1]) stream1 = cupy.cuda.stream.Stream() stream2 = cupy.cuda.stream.Stream() with stream1: z1 = x + y with stream2: z2 = x * y # Default stream waits unUl all stream’s operaUons finish as default z = z1 + z2 Support CUDA stream example.py

Slide 36

Slide 36 text

36 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$IBJOFS import chainer import chainer.funcUons as F Import chainer.links as L class MyAwesomeNet(chainer.chain): def __init(self): super(MyAwesomeNet, self).__init__() with self.init_scope(): self.stream1 = chainer.cuda.stream.Stream() self.stream2 = chainer.cuda.stream.Stream() self.conv1 = L.ConvoluUon2D(None, 384, 3, pad=1) def __call__(self, x, t): with stream1: h1 = self.conv1(x) with stream2: h2 = self.conv2(x) example.py Support CUDA stream

Slide 37

Slide 37 text

37 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$V1Z $ nvprof --print-gpu-trace python example.py Start Duration [omit] Stream Name 617.88ms 1.4400us [omit] 7 [CUDA memcpy HtoD] 840.73ms 2.8480us [omit] 13 cupy_copy [415] 841.58ms 7.2640us [omit] 13 cupy_power [424] 842.33ms 3.5840us [omit] 13 cupy_sum [433] 842.99ms 2.9760us [omit] 13 cupy_sqrt [440] 843.10ms 2.3040us [omit] 14 cupy_copy [446] 843.19ms 6.5600us [omit] 14 cupy_power [452] 843.27ms 3.2320us [omit] 14 cupy_sum [458] 843.33ms 2.6880us [omit] 14 cupy_sqrt [462] Support CUDA stream

Slide 38

Slide 38 text

.FNPSZQPPMXJUICFTUpU XJUIDPBMFTDJOH 38 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 1 2 3 Conclusion: CuPy Improvements $IBJOFSGVODUJPO(16 .FNPSZQSPpMFS 4VQQPSU$6%"TUSFBN CuPy 2.0.0 Chainer 3.0.0 CuPy 2.1.0 CuPy 4.0.0b1 $V1ZMJOFNFNPSZ 1SPpMFS