Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CuPy improvments around memory

Naotoshi Seo
February 28, 2018

CuPy improvments around memory

Naotoshi Seo

February 28, 2018
Tweet

More Decks by Naotoshi Seo

Other Decks in Programming

Transcript

  1. 1
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    $V1Z
    *NQSPWFNFOUT
    BSPVOENFNPSZ

    2017/10/24

    View Slide

  2. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    4FMG*OUSPEVDUJPO
    2
    ・Naotoshi Seo @sonots
    ・AI System Dept.
    ・Analytics Infra Group
    ・Tools Team (Supervisor)
    ・Cloud Infra (Supervisor)
    ・Fluentd & Ruby Committer
    ・Recently, “出向” to PFN
    ・Chainer/CuPy core dev

    View Slide

  3. 3
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    • I do not talk
    • Why we use GPU
    • What is CuPy
    • I talk
    • Basic CUDA programming
    • CuPy improvements
    %JTDMBJNFS

    View Slide

  4. .FNPSZQPPMJNQSPWFNFOUT
    XJUICFTUpUXJUIDPBMFTDJOH
    4
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    1
    2
    3
    CuPy Improvements
    (16NFNPSZQSPpMFSXJUI
    $V1ZNFNPSZIPPL
    4VQQPSU$6%"TUSFBN
    0 #BTJD$6%"1SPHSBNNJOH

    View Slide

  5. 5
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    #BTJD$6%"QSPHSBNNJOH
    Typically, CUDA programs are written as
    1. Allocate GPU memory
    2. Copy CPU memory to GPU memory
    3. Launch CUDA kernel (a function executed by
    GPU)
    4. Copy GPU memory to CPU memory
    5. Free GPU memory
    Basic CUDA programming

    View Slide

  6. 6
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    #BTJD$6%"QSPHSBNNJOH
    Basic CUDA programming
    __global__ void sumArraysOnGPU(float *A, float *B, float *C, const int N)
    {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    C[i] = A[i] + B[i];
    }
    int main(int argc, char **argv)
    {
    // malloc device global memory
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, nBytes);
    cudaMalloc((float**)&d_B, nBytes);
    cudaMalloc((float**)&d_C, nBytes);
    // transfer data from host to device
    cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice);
    // invoke kernel
    sumArraysOnGPU<<>>(d_A, d_B, d_C, nElem);
    // copy kernel result back to host side
    cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost);
    // free device global memory
    CHECK(cudaFree(d_A));
    CHECK(cudaFree(d_B));
    CHECK(cudaFree(d_C));
    }

    View Slide

  7. 7
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    #BTJD$6%"QSPHSBNNJOH
    Basic CUDA programming
    CPU
    GPU Kernel1
    Idle
    • Kernel execution is asynchronous
    • cudaMalloc, cudaMemcpy, cudaFree are
    synchronous
    Launch kernel1
    cudaMalloc cudaMemcpy (H2D) cudaFree
    cudaMemcpy (D2H)

    View Slide

  8. .FNPSZQPPMJNQSPWFNFOUT
    XJUICFTUpUXJUIDPBMFTDJOH
    8
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    1
    2
    3
    CuPy Improvements
    (16NFNPSZQSPpMFSXJUI
    $V1ZNFNPSZIPPL
    4VQQPSU$6%"TUSFBN

    View Slide

  9. 9
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    8IZXFOFFE.FNPSZQPPM
    • cudaMalloc / cudaFree are slow
    • memory allocation / free themselves are slow
    • cudaMalloc synchronizes CPU and GPU
    Memory pool improvement
    CPU
    GPU Free
    Kernel1
    cudaFree
    Launch
    synchronize
    Idle
    Kernel2
    something
    Idle
    Launch
    cudaMalloc
    synchronize
    Malloc

    View Slide

  10. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    .FNPSZQPPM
    Memory pool improvement
    • Letʼs cache to memory pool
    • Avoid cudaMalloc / cudaFree as much as possible

    View Slide

  11. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    .FNPSZQPPMPG$V1Z
    #FTUpUBMHPSJUIN
    Memory pool improvement
    512 1024 1536 2048 2560 ….
    free bins or arena
    chunks
    size
    free_list or bin
    1. Round up memory size by
    512
    2. cudaMalloc, and use it
    3. Push to arena intead of
    cudaFree
    4. Pop from arena if a block
    of exactly same size is
    available in arena instead
    of cudaMalloc

    View Slide

  12. 12
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    *TTVFTPOCFTUpUBMHPSJUIN
    • Cache miss occurs even if there exist memory blocks
    of larger sizes than the required size (best-fit).
    • The cache miss typically occurs for Natural Language
    Processing applications whose input data size are
    varying.
    Memory pool improvement
    512 1024 1536 2048 2560 ….
    size
    Want
    How about using cache of larger size (2560)?

    View Slide

  13. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    .FNPSZQPPMPG$V1Z
    #FTUpUXJUIDPBMFTDJOH #'$

    Memory pool improvement
    512 1024 1536 2048 2560 …
    1. Pop a chunk if larger size
    than required size is
    available
    2. Split and use only
    necessary size. Push back
    a chunk of remained size
    Split Pop
    512 2048
    512 1024 1536 2048 2560 ….
    Push
    (1)
    (2)
    Split and merge
    4QMJU
    Split
    next
    prev
    want to use

    View Slide

  14. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    .FNPSZQPPMPG$V1Z
    #FTUpUXJUIDPBMFTDJOH #'$

    Memory pool improvement
    1. Merge the chunk to free
    with next or prev chunks
    in free lists
    2. Push back the merged
    chunk
    Merge
    512 2048
    512 1024 1536 2048 2560 ….
    Push
    .FSHF
    next
    prev

    View Slide

  15. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    5IFF⒎FDUPG#'$
    Memory pool improvement
    • For situations:
    • 1. Previous input data size is large
    • 2. Next input data size is smaller
    • We can still use a cached memory

    View Slide

  16. 16
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    5IFF⒎FDUPG#'$
    hQps://github.com/cupy/cupy/pull/168
    Memory pool improvement
    1/4 memory
    usage!
    And, 1.3 times faster

    View Slide

  17. .FNPSZQPPMJNQSPWFNFOUT
    XJUICFTUpUXJUIDPBMFTDJOH
    17
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    1
    2
    3
    CuPy 2.0 Improvements
    (16NFNPSZQSPpMFSXJUI
    $V1ZNFNPSZIPPL
    4VQQPSU$6%"TUSFBN

    View Slide

  18. 18
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    8IBUPGUFOIBQQFOT
    cupy.cuda.runUme.CUDARunUmeError:
    cudaErrorMemoryAllocaUon: out of memory
    • How to investigate it?
    • Use GPU memory profiler
    GPU memory profiler

    View Slide

  19. 19
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    (16$IBJOFS'VODUJPO.FNPSZ1SPpMFS
    BWBJMBCMFJO$IBJOFS
    from chainer.func.on_hooks import CupyMemoryProfileHook
    hook = CupyMemoryProfileHook()
    with hook:
    trainer.run()
    hook.print_report()
    Bytes used from
    CuPy mem pool
    Bytes acquired
    from GPU device
    GPU memory profiler

    View Slide

  20. 20
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    import cupy.cuda.memory_hooks
    hook = cupy.cuda.memory_hooks.LineProfileHook()
    with hook:
    # some cupy codes
    hook.print_report()
    _root (4.00KB, 4.00KB)
    lib/python3.6/runpy.py:193:_run_module_as_main (4.00KB, 4.00KB)
    lib/python3.6/runpy.py:85:_run_code (4.00KB, 4.00KB)
    lib/python3.6/uniQest/__main__.py:18: (4.00KB, 4.00KB)
    lib/python3.6/uniQest/main.py:94:__init__ (4.00KB, 4.00KB)
    lib/python3.6/uniQest/main.py:255:runTests (4.00KB, 4.00KB)
    $V1Z-JOF.FNPSZ1SPpMFS
    BWBJMBCMFJO$V1ZB
    Bytes used from
    CuPy mem pool
    Bytes acquired
    from GPU device
    GPU memory profiler

    View Slide

  21. 21
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    ':*$IBJOFS'VODUJPO5JNF1SPpMFS
    BWBJMBCMFJO$IBJOFS
    from chainer.func.on_hooks import TimerHook
    hook = TimerHook()
    with hook:
    trainer.run()
    hook.print_report()
    FunctionName ElapsedTime Occurrence
    LinearFunction 1.24sec 3900
    ReLU 593.05ms 2600
    SoftmaxCrossEntropy 824.11ms 1300
    Accuracy 176.54ms 700
    GPU memory profiler

    View Slide

  22. 22
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    ':*-JOF5JNF1SPpMFS
    QZUIPOQSPpMFSJTBWBJMBCMF
    import cProfile, pstats, io
    pr = cProfile.Profile()
    pr.enable()
    # ... do something ...
    pr.disable()
    s = io.StringIO()
    sortby = 'cumulaUve'
    ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
    ps.print_stats()
    print(s.getvalue())
    ncalls tottime percall cumtime percall filename:lineno(function)
    1 0.000 0.000 0.001 0.001 :1()
    1 0.000 0.000 0.001 0.001 re.py:212(compile)
    1 0.000 0.000 0.001 0.001 re.py:268(_compile)
    1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset)
    1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset)
    4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction)
    3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile)
    hQps://docs.python.org/3/library/profile.html
    GPU memory profiler

    View Slide

  23. 23
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    ':*(161SPpMFS
    OWQSPGBOEOWWQJTBWBJMBCMF
    $ /usr/local/cuda/bin/nvprof python examples/cusolver.py
    ==27986== NVPROF is profiling process 27986, command: python examples/stream/cusolver.py
    ==27986== Profiling application: python examples/stream/cusolver.py
    ==27986== Profiling result:
    Time(%) Time Calls Avg Min Max Name
    41.70% 125.73us 4 31.431us 30.336us 33.312us void nrm2_kerneldouble, int=0, int=0, int=128, int=0>(cublasNrm2Params)
    21.94% 66.144us 36 1.8370us 1.7600us 2.1760us [CUDA memcpy DtoH]
    13.77% 41.536us 48 865ns 800ns 1.4400us [CUDA memcpy HtoD]
    0.98% 2.9440us 2 1.4720us 1.2160us 1.7280us void reset_diagonal_realint=8>(int, double*, int)
    0.98% 2.9440us 4 736ns 736ns 736ns [CUDA memset]
    ==27986== API calls:
    Time(%) Time Calls Avg Min Max Name
    60.34% 408.55ms 9 45.395ms 4.8480us 407.94ms cudaMalloc
    37.60% 254.60ms 2 127.30ms 556ns 254.60ms cudaFree
    0.94% 6.3542ms 712 8.9240us 119ns 428.32us cuDeviceGetAttribute
    0.72% 4.8747ms 8 609.33us 320.37us 885.26us cuDeviceTotalMem
    0.10% 693.60us 82 8.4580us 2.8370us 72.004us cudaMemcpyAsync
    0.08% 511.79us 1 511.79us 511.79us 511.79us cudaHostAlloc
    0.08% 511.75us 8 63.969us 41.317us 99.232us cuDeviceGetName
    0.05% 310.04us 1 310.04us 310.04us 310.04us cuModuleLoadData
    GPU memory profiler

    View Slide

  24. 24
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    ':*(161SPpMFS
    OWQSPGBOEOWWQJTBWBJMBCMF
    $ nvprof -o prof.nvvp python examples/cusolver.py
    $ /Developer/NVIDIA/CUDA-9.0/bin/nvvp prof.nvvp
    I install nvvp on Mac OSX, and scp prof.nvvp from GPU machine
    hQps://developer.nvidia.com/cuda-downloads
    GPU memory profiler

    View Slide

  25. .FNPSZQPPMJNQSPWFNFOUT
    XJUICFTUpUXJUIDPBMFTDJOH
    25
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    1
    2
    3
    CuPy Improvements
    (16NFNPSZQSPpMFSXJUI
    $V1ZNFNPSZIPPL
    4VQQPSU$6%"TUSFBN

    View Slide

  26. 26
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    8IBUJT$6%"TUSFBN
    hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
    • The ability to perform multiple CUDA operations
    • CUDA Kernel
    • cudaMemcpyAsync (HostToDevice)
    • cudaMemcpyAsync (DeviceToHost)
    • simultaneously
    Support CUDA stream

    View Slide

  27. 27
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    $PODVSSFODZFYBNQMF
    hQp://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf
    • Overlap memcpy and kernel execution
    Support CUDA stream

    View Slide

  28. 28
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    $PODVSSFODZFYBNQMF
    Support CUDA stream
    • NNs with branches, e.g., inception module of
    GoogLeNet
    • 4 streams (concurrencies) for 4 branches would
    improve performance.

    View Slide

  29. 29
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    *TTVFTUPTVQQPSU$6%"TUSFBN
    • NFNPSZQPPMXJUIPOFTUSFBN
    CPU
    stream1 Kernel1
    Launch kernel2
    Launch kernel1
    malloc free malloc free
    Kernel2
    • Returns memory to mem pool before kernel execution finishes
    • It was fine because it is sure that kernel2 is ran after kernel1.
    Support CUDA stream

    View Slide

  30. 30
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    *TTVFTUPTVQQPSU$6%"TUSFBN
    • CSPLFONFNPSZQPPMXJUIUXPTUSFBNT
    CPU
    stream1 Kernel1
    Launch kernel2
    Launch kernel1
    Kernel2
    stream2
    malloc free malloc free
    • kernel2 may use memory blocks which kernel1 is still using.
    Support CUDA stream

    View Slide

  31. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    )PXUPTVQQPSU$6%"TUSFBN
    • (1) Create separated memory pools for each stream
    • (2) Use cuda stream callback?
    Support CUDA stream

    View Slide

  32. 32
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.

    4FQFSBUFENFNPSZQPPMGPSFBDITUSFBN
    Support CUDA stream
    Mem pool1
    CPU
    stream1 Kernel1
    Launch kernel2
    Launch kernel1
    Kernel2
    stream2
    malloc free malloc free
    Mem pool2
    • Drawback: stream2 can not reuse cached memory of stream1

    View Slide

  33. 33
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.

    6TFDVEBTUSFBNDBMMCBDL
    CPU
    stream1 Kernel1
    Launch kernel2
    Launch kernel1
    Kernel2
    stream2
    malloc free malloc free
    callback to call free
    callback to call free
    Support CUDA stream
    • kernel2 does not touch memory which kernel1 is using
    • Drawback: Registering callback to all kernels would degrade
    performance

    View Slide

  34. Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    )PXUPTVQQPSU$6%"TUSFBN
    • (1) Create a separated memory pool for each stream
    • (2) Use cuda stream callback?
    cudaStreamAddCallback
    synchronizes CPU and GPU
    hQps://gist.github.com/sonots/e98a95aaceae65a15d2b59a81bem023
    Support CUDA stream
    So, we chose (1)

    View Slide

  35. 35
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    )PXUPVTFTUSFBNTPO$V1Z
    import cupy
    x = cupy.array([1])
    y = cupy.array([1])
    stream1 = cupy.cuda.stream.Stream()
    stream2 = cupy.cuda.stream.Stream()
    with stream1:
    z1 = x + y
    with stream2:
    z2 = x * y
    # Default stream waits unUl all stream’s operaUons finish as default
    z = z1 + z2
    Support CUDA stream
    example.py

    View Slide

  36. 36
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    )PXUPVTFTUSFBNTPO$IBJOFS
    import chainer
    import chainer.funcUons as F
    Import chainer.links as L
    class MyAwesomeNet(chainer.chain):
    def __init(self):
    super(MyAwesomeNet, self).__init__()
    with self.init_scope():
    self.stream1 = chainer.cuda.stream.Stream()
    self.stream2 = chainer.cuda.stream.Stream()
    self.conv1 = L.ConvoluUon2D(None, 384, 3, pad=1)
    def __call__(self, x, t):
    with stream1:
    h1 = self.conv1(x)
    with stream2:
    h2 = self.conv2(x)
    example.py
    Support CUDA stream

    View Slide

  37. 37
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    )PXUPVTFTUSFBNTPO$V1Z
    $ nvprof --print-gpu-trace python example.py
    Start Duration [omit] Stream Name
    617.88ms 1.4400us [omit] 7 [CUDA memcpy HtoD]
    840.73ms 2.8480us [omit] 13 cupy_copy [415]
    841.58ms 7.2640us [omit] 13 cupy_power [424]
    842.33ms 3.5840us [omit] 13 cupy_sum [433]
    842.99ms 2.9760us [omit] 13 cupy_sqrt [440]
    843.10ms 2.3040us [omit] 14 cupy_copy [446]
    843.19ms 6.5600us [omit] 14 cupy_power [452]
    843.27ms 3.2320us [omit] 14 cupy_sum [458]
    843.33ms 2.6880us [omit] 14 cupy_sqrt [462]
    Support CUDA stream

    View Slide

  38. .FNPSZQPPMXJUICFTUpU
    XJUIDPBMFTDJOH
    38
    Copyright (C) DeNA Co.,Ltd. All Rights Reserved.
    1
    2
    3
    Conclusion: CuPy Improvements
    $IBJOFSGVODUJPO(16
    .FNPSZQSPpMFS
    4VQQPSU$6%"TUSFBN
    CuPy 2.0.0
    Chainer 3.0.0
    CuPy 2.1.0
    CuPy 4.0.0b1
    $V1ZMJOFNFNPSZ
    1SPpMFS

    View Slide