CuPy improvments around memory

Aef92c3acc29ad8543e04135687fc4f1?s=47 Naotoshi Seo
February 28, 2018

CuPy improvments around memory

Aef92c3acc29ad8543e04135687fc4f1?s=128

Naotoshi Seo

February 28, 2018
Tweet

Transcript

  1. 1 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $V1Z *NQSPWFNFOUT

    BSPVOENFNPSZ 2017/10/24
  2. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 4FMG*OUSPEVDUJPO 2 ・Naotoshi

    Seo @sonots ・AI System Dept. ・Analytics Infra Group ・Tools Team (Supervisor) ・Cloud Infra (Supervisor) ・Fluentd & Ruby Committer ・Recently, “出向” to PFN ・Chainer/CuPy core dev
  3. 3 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. • I

    do not talk • Why we use GPU • What is CuPy • I talk • Basic CUDA programming • CuPy improvements %JTDMBJNFS
  4. .FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 4 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.

    1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN 0 #BTJD$6%"1SPHSBNNJOH
  5. 5 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Typically,

    CUDA programs are written as 1. Allocate GPU memory 2. Copy CPU memory to GPU memory 3. Launch CUDA kernel (a function executed by GPU) 4. Copy GPU memory to CPU memory 5. Free GPU memory Basic CUDA programming
  6. 6 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Basic

    CUDA programming __global__ void sumArraysOnGPU(float *A, float *B, float *C, const int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; } int main(int argc, char **argv) { // malloc device global memory float *d_A, *d_B, *d_C; cudaMalloc((float**)&d_A, nBytes); cudaMalloc((float**)&d_B, nBytes); cudaMalloc((float**)&d_C, nBytes); // transfer data from host to device cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice); // invoke kernel sumArraysOnGPU<<<grid, block>>>(d_A, d_B, d_C, nElem); // copy kernel result back to host side cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost); // free device global memory CHECK(cudaFree(d_A)); CHECK(cudaFree(d_B)); CHECK(cudaFree(d_C)); }
  7. 7 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. #BTJD$6%"QSPHSBNNJOH Basic

    CUDA programming CPU GPU Kernel1 Idle • Kernel execution is asynchronous • cudaMalloc, cudaMemcpy, cudaFree are synchronous Launch kernel1 cudaMalloc cudaMemcpy (H2D) cudaFree cudaMemcpy (D2H)
  8. .FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 8 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.

    1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN
  9. 9 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IZXFOFFE.FNPSZQPPM •

    cudaMalloc / cudaFree are slow • memory allocation / free themselves are slow • cudaMalloc synchronizes CPU and GPU Memory pool improvement CPU GPU Free Kernel1 cudaFree Launch synchronize Idle Kernel2 something Idle Launch cudaMalloc synchronize Malloc
  10. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPM Memory pool

    improvement • Letʼs cache to memory pool • Avoid cudaMalloc / cudaFree as much as possible
  11. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUBMHPSJUIN Memory

    pool improvement 512 1024 1536 2048 2560 …. free bins or arena chunks size free_list or bin 1. Round up memory size by 512 2. cudaMalloc, and use it 3. Push to arena intead of cudaFree 4. Pop from arena if a block of exactly same size is available in arena instead of cudaMalloc
  12. 12 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTPOCFTUpUBMHPSJUIN •

    Cache miss occurs even if there exist memory blocks of larger sizes than the required size (best-fit). • The cache miss typically occurs for Natural Language Processing applications whose input data size are varying. Memory pool improvement 512 1024 1536 2048 2560 …. size Want How about using cache of larger size (2560)?
  13. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUXJUIDPBMFTDJOH #'$

     Memory pool improvement 512 1024 1536 2048 2560 … 1. Pop a chunk if larger size than required size is available 2. Split and use only necessary size. Push back a chunk of remained size Split Pop 512 2048 512 1024 1536 2048 2560 …. Push (1) (2) Split and merge 4QMJU Split next prev want to use
  14. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. .FNPSZQPPMPG$V1Z #FTUpUXJUIDPBMFTDJOH #'$

     Memory pool improvement 1. Merge the chunk to free with next or prev chunks in free lists 2. Push back the merged chunk Merge 512 2048 512 1024 1536 2048 2560 …. Push .FSHF next prev
  15. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 5IFF⒎FDUPG#'$ Memory pool

    improvement • For situations: • 1. Previous input data size is large • 2. Next input data size is smaller • We can still use a cached memory
  16. 16 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 5IFF⒎FDUPG#'$ hQps://github.com/cupy/cupy/pull/168

    Memory pool improvement 1/4 memory usage! And, 1.3 times faster
  17. .FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 17 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.

    1 2 3 CuPy 2.0 Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN
  18. 18 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IBUPGUFOIBQQFOT cupy.cuda.runUme.CUDARunUmeError:

    cudaErrorMemoryAllocaUon: out of memory • How to investigate it? • Use GPU memory profiler GPU memory profiler
  19. 19 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. (16$IBJOFS'VODUJPO.FNPSZ1SPpMFS BWBJMBCMFJO$IBJOFS

    from chainer.func.on_hooks import CupyMemoryProfileHook hook = CupyMemoryProfileHook() with hook: trainer.run() hook.print_report() Bytes used from CuPy mem pool Bytes acquired from GPU device GPU memory profiler
  20. 20 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. import cupy.cuda.memory_hooks

    hook = cupy.cuda.memory_hooks.LineProfileHook() with hook: # some cupy codes hook.print_report() _root (4.00KB, 4.00KB) lib/python3.6/runpy.py:193:_run_module_as_main (4.00KB, 4.00KB) lib/python3.6/runpy.py:85:_run_code (4.00KB, 4.00KB) lib/python3.6/uniQest/__main__.py:18:<module> (4.00KB, 4.00KB) lib/python3.6/uniQest/main.py:94:__init__ (4.00KB, 4.00KB) lib/python3.6/uniQest/main.py:255:runTests (4.00KB, 4.00KB) $V1Z-JOF.FNPSZ1SPpMFS BWBJMBCMFJO$V1ZB Bytes used from CuPy mem pool Bytes acquired from GPU device GPU memory profiler
  21. 21 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*$IBJOFS'VODUJPO5JNF1SPpMFS BWBJMBCMFJO$IBJOFS

    from chainer.func.on_hooks import TimerHook hook = TimerHook() with hook: trainer.run() hook.print_report() FunctionName ElapsedTime Occurrence LinearFunction 1.24sec 3900 ReLU 593.05ms 2600 SoftmaxCrossEntropy 824.11ms 1300 Accuracy 176.54ms 700 GPU memory profiler
  22. 22 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*-JOF5JNF1SPpMFS QZUIPOQSPpMFSJTBWBJMBCMF

    import cProfile, pstats, io pr = cProfile.Profile() pr.enable() # ... do something ... pr.disable() s = io.StringIO() sortby = 'cumulaUve' ps = pstats.Stats(pr, stream=s).sort_stats(sortby) ps.print_stats() print(s.getvalue()) ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.001 0.001 re.py:212(compile) 1 0.000 0.000 0.001 0.001 re.py:268(_compile) 1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset) 1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset) 4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction) 3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile) hQps://docs.python.org/3/library/profile.html GPU memory profiler
  23. 23 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*(161SPpMFS OWQSPGBOEOWWQJTBWBJMBCMF

    $ /usr/local/cuda/bin/nvprof python examples/cusolver.py ==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py ==27986== Profiling application: python examples/stream/cusolver.py ==27986== Profiling result: Time(%) Time Calls Avg Min Max Name 41.70% 125.73us 4 31.431us 30.336us 33.312us void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>) 21.94% 66.144us 36 1.8370us 1.7600us 2.1760us [CUDA memcpy DtoH] 13.77% 41.536us 48 865ns 800ns 1.4400us [CUDA memcpy HtoD] 0.98% 2.9440us 2 1.4720us 1.2160us 1.7280us void reset_diagonal_real<double, int=8>(int, double*, int) 0.98% 2.9440us 4 736ns 736ns 736ns [CUDA memset] ==27986== API calls: Time(%) Time Calls Avg Min Max Name 60.34% 408.55ms 9 45.395ms 4.8480us 407.94ms cudaMalloc 37.60% 254.60ms 2 127.30ms 556ns 254.60ms cudaFree 0.94% 6.3542ms 712 8.9240us 119ns 428.32us cuDeviceGetAttribute 0.72% 4.8747ms 8 609.33us 320.37us 885.26us cuDeviceTotalMem 0.10% 693.60us 82 8.4580us 2.8370us 72.004us cudaMemcpyAsync 0.08% 511.79us 1 511.79us 511.79us 511.79us cudaHostAlloc 0.08% 511.75us 8 63.969us 41.317us 99.232us cuDeviceGetName 0.05% 310.04us 1 310.04us 310.04us 310.04us cuModuleLoadData GPU memory profiler
  24. 24 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. ':*(161SPpMFS OWQSPGBOEOWWQJTBWBJMBCMF

    $ nvprof -o prof.nvvp python examples/cusolver.py $ /Developer/NVIDIA/CUDA-9.0/bin/nvvp prof.nvvp I install nvvp on Mac OSX, and scp prof.nvvp from GPU machine hQps://developer.nvidia.com/cuda-downloads GPU memory profiler
  25. .FNPSZQPPMJNQSPWFNFOUT XJUICFTUpUXJUIDPBMFTDJOH 25 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.

    1 2 3 CuPy Improvements (16NFNPSZQSPpMFSXJUI $V1ZNFNPSZIPPL 4VQQPSU$6%"TUSFBN
  26. 26 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. 8IBUJT$6%"TUSFBN hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

    • The ability to perform multiple CUDA operations • CUDA Kernel • cudaMemcpyAsync (HostToDevice) • cudaMemcpyAsync (DeviceToHost) • simultaneously Support CUDA stream
  27. 27 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $PODVSSFODZFYBNQMF hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

    • Overlap memcpy and kernel execution Support CUDA stream
  28. 28 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. $PODVSSFODZFYBNQMF Support

    CUDA stream • NNs with branches, e.g., inception module of GoogLeNet • 4 streams (concurrencies) for 4 branches would improve performance.
  29. 29 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTUPTVQQPSU$6%"TUSFBN •

    NFNPSZQPPMXJUIPOFTUSFBN CPU stream1 Kernel1 Launch kernel2 Launch kernel1 malloc free malloc free Kernel2 • Returns memory to mem pool before kernel execution finishes • It was fine because it is sure that kernel2 is ran after kernel1. Support CUDA stream
  30. 30 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. *TTVFTUPTVQQPSU$6%"TUSFBN •

    CSPLFONFNPSZQPPMXJUIUXPTUSFBNT CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free • kernel2 may use memory blocks which kernel1 is still using. Support CUDA stream
  31. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPTVQQPSU$6%"TUSFBN • (1)

    Create separated memory pools for each stream • (2) Use cuda stream callback? Support CUDA stream
  32. 32 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.  4FQFSBUFENFNPSZQPPMGPSFBDITUSFBN

    Support CUDA stream Mem pool1 CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free Mem pool2 • Drawback: stream2 can not reuse cached memory of stream1
  33. 33 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.  6TFDVEBTUSFBNDBMMCBDL

    CPU stream1 Kernel1 Launch kernel2 Launch kernel1 Kernel2 stream2 malloc free malloc free callback to call free callback to call free Support CUDA stream • kernel2 does not touch memory which kernel1 is using • Drawback: Registering callback to all kernels would degrade performance
  34. Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPTVQQPSU$6%"TUSFBN • (1)

    Create a separated memory pool for each stream • (2) Use cuda stream callback? cudaStreamAddCallback synchronizes CPU and GPU hQps://gist.github.com/sonots/e98a95aaceae65a15d2b59a81bem023 Support CUDA stream So, we chose (1)
  35. 35 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$V1Z import

    cupy x = cupy.array([1]) y = cupy.array([1]) stream1 = cupy.cuda.stream.Stream() stream2 = cupy.cuda.stream.Stream() with stream1: z1 = x + y with stream2: z2 = x * y # Default stream waits unUl all stream’s operaUons finish as default z = z1 + z2 Support CUDA stream example.py
  36. 36 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$IBJOFS import

    chainer import chainer.funcUons as F Import chainer.links as L class MyAwesomeNet(chainer.chain): def __init(self): super(MyAwesomeNet, self).__init__() with self.init_scope(): self.stream1 = chainer.cuda.stream.Stream() self.stream2 = chainer.cuda.stream.Stream() self.conv1 = L.ConvoluUon2D(None, 384, 3, pad=1) def __call__(self, x, t): with stream1: h1 = self.conv1(x) with stream2: h2 = self.conv2(x) example.py Support CUDA stream
  37. 37 Copyright (C) DeNA Co.,Ltd. All Rights Reserved. )PXUPVTFTUSFBNTPO$V1Z $

    nvprof --print-gpu-trace python example.py Start Duration [omit] Stream Name 617.88ms 1.4400us [omit] 7 [CUDA memcpy HtoD] 840.73ms 2.8480us [omit] 13 cupy_copy [415] 841.58ms 7.2640us [omit] 13 cupy_power [424] 842.33ms 3.5840us [omit] 13 cupy_sum [433] 842.99ms 2.9760us [omit] 13 cupy_sqrt [440] 843.10ms 2.3040us [omit] 14 cupy_copy [446] 843.19ms 6.5600us [omit] 14 cupy_power [452] 843.27ms 3.2320us [omit] 14 cupy_sum [458] 843.33ms 2.6880us [omit] 14 cupy_sqrt [462] Support CUDA stream
  38. .FNPSZQPPMXJUICFTUpU XJUIDPBMFTDJOH 38 Copyright (C) DeNA Co.,Ltd. All Rights Reserved.

    1 2 3 Conclusion: CuPy Improvements $IBJOFSGVODUJPO(16 .FNPSZQSPpMFS 4VQQPSU$6%"TUSFBN CuPy 2.0.0 Chainer 3.0.0 CuPy 2.1.0 CuPy 4.0.0b1 $V1ZMJOFNFNPSZ 1SPpMFS