Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementation of Cumo, a CUDA-aware version of Ruby/Numo

Implementation of Cumo, a CUDA-aware version of Ruby/Numo

A report about Cumo project at Ruby Association Grant 2017.

Naotoshi Seo

July 07, 2018
Tweet

More Decks by Naotoshi Seo

Other Decks in Programming

Transcript

  1. Implementation of Ruby/Cumo,
    a CUDA-aware version of
    Ruby/Numo
    Naotoshi Seo
    July 07, 2018
    Grant 2017 Report
    https://github.com/sonots/cumo

    View Slide

  2. Self Introduction
    • Naotoshi Seo @sonots
    • DeNA Co., Ltd.
    • CRuby committer
    • Recently working on
    development of DNN
    framework at Preferred
    Networks, Inc (出向)
    2

    View Slide

  3. Outline
    3
    • Project Introduction

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    • Project Achievement

    View Slide

  4. Project Introduction
    4

    View Slide

  5. 5
    1SPKFDU*OUSPEVDUJPO
    Project Proposal

    View Slide

  6. 6
    https://ruby-numo.github.io/narray/
    1SPKFDU*OUSPEVDUJPO

    View Slide

  7. Why GPU?
    • GPU is bad at branching

    • GPU simplifies branch prediction
    and out-of-order mechanism
    instead.

    • GPU is suitable for matrix
    computation
    7
    • GPU is fast, and recently essential for Deep Learning

    • GPU is good at parallel computation

    • Order of magnitude is like 24 cores with CPU

    • 3,000 ~ 4,000 cores with GPU
    1SPKFDU*OUSPEVDUJPO

    View Slide

  8. Position of Cumo
    in Scientific Computing in Ruby
    8

    View Slide

  9. 4DJFOUJpD$PNQVUJOHJO1ZUIPO
    9
    NumPy CuPy
    PyCUDA
    Chainer TensorFlow MXNet
    Cython
    pybind11
    DNN
    Tensor
    CUDA binding
    Useful tools for
    writing bindings
    C++ C++
    Python
    1PTJUJPOPG$VNP

    View Slide

  10. 4DJFOUJpD$PNQVUJOHJO3VCZ
    10
    Numo/NArray Cumo
    RbCUDA
    Red-chainer TensorFlow.rb MXNet.rb
    Rubex
    /"
    DNN
    Tensor
    CUDA binding
    Useful tools for
    writing bindings
    (or NMatrix)
    PyCall
    Binding to Python
    C++ C++
    Ruby
    1PTJUJPOPG$VNP

    View Slide

  11. Outline
    11
    • Project Proposal

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    • Project Achievement

    View Slide

  12. 12

    View Slide

  13. 13

    View Slide

  14. Cumo Features
    • Highly compatible with Ruby/Numo

    • Element-wise operations

    • Reduction operations

    • Dot operation using cuBLAS

    • CUDA memory pool

    • JIT compilation of user-defined functions
    14

    View Slide

  15. Highly Compatible with Numo
    • Ruby/Numo users can easily switch into
    Cumo to leverage power of GPU
    15
    pOEOBNFSCca
    YBSHTTFEJa
    FT/VNP$VNPHa
    FTOVNPDVNPH
    JGHQV

    SFRVJSFDVNPOBSSBZ

    9VNP$VNP

    FMTF

    SFRVJSFOVNPOBSSBZ

    9VNP/VNP

    FOE

    B9VNP%'MPBU[FSPT

    C9VNP%'MPBUPOFT

    DBC
    $VNP'FBUVSFT

    View Slide

  16. Element-wise opeations
    • Element-wise is like
    matrix additions

    • All elements are
    independent

    • Easy to perform in
    parallel
    16
    1 2 3 4 5 6
    2 3 4 5 6 7
    +
    A
    B
    3 5 7 9 11 13
    =
    C

    Thread

    IDs
    $VNP'FBUVSFT

    View Slide

  17. Reduction opeations
    • Like sum

    • sum([1,2,3,4]) #=> 10

    • Elements are not independent

    • Not so easy to perform in parallel
    17
    $VNP'FBUVSFT

    View Slide

  18. 18
    http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
    3FEVDUJPOPQFSBUJPOT

    View Slide

  19. 19
    3FEVDUJPOPQFSBUJPOT

    View Slide

  20. Dot product (GEMM) using
    cuBLAS
    • Dot is more complicated than reduction

    • NVIDIA's cuBLAS library supports it as GEMM
    (GEneral matrix-matrix mulitplication) and fast

    • However, cuBLAS supports only f-contiguous
    (column major) although we write CRuby
    extensions in C (c-contiguous, raw-major)
    20
    1 2 3
    4 5 6
    7 8 9
    1 4 7
    2 5 8
    3 6 9
    C-contiguous F (Fortran)
    $VNP'FBUVSFT

    View Slide

  21. Using cuBLAS with c-
    contiguous data
    21
    A = [1, 2, 3, 4, 5, 6]
    B = [1, 2, 3, 4, 5, 6]
    C = [9, 12, 15, 19, 26, 33, 29, 40, 51]
    A = [1, 2, 3, 4, 5, 6]
    B = [1, 2, 3, 4, 5, 6]
    C = [9, 12, 15, 19, 26, 33, 29, 40, 51]
    $DPOUJVHPVT 3PXNBKPS

    'DPOUJVHPVT $PMVNONBKPS

    No data copy, changing only attributes (shape)
    https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html
    %PUQSPEVDU

    View Slide

  22. CUDA Memory Pool
    22

    View Slide

  23. Why We Need Memory Pool
    23
    $6%".FNPSZ1PPM
    • cudaMalloc / cudaFree makes slow
    • memory allocation / free themselves are slow
    • cudaMalloc synchronizes CPU and GPU
    CPU
    GPU Free
    Kernel1
    cudaFree
    synchronize
    Idle
    Kernel2
    something
    Idle
    cudaMalloc
    synchronize
    Malloc
    Launch
    Launch

    View Slide

  24. Memory Pool
    24
    $6%".FNPSZ1PPM
    • Cache to memory pool
    • Avoid cudaMalloc / Free as much as possible

    View Slide

  25. CUDA Memory Pool
    25
    512 1024 1536 2048 2560 ….
    Pop
    512 2048
    512 1024 1536 2048 2560 ….
    Push
    (1)
    (2)
    Split
    next
    prev
    use
    1. Round up memory size by 512
    2. cudaMalloc if no block is
    available
    3. Push to arena intead of
    cudaFree
    4. Pop from arena if a free block
    is available instead of
    cudaMalloc
    Implemented Best-fit with
    Coalescing (BFC), which is
    the one used in malloc(3)

    $6%".FNPSZ1PPM

    View Slide

  26. 26
    • Cumo supports users to write their own
    CUDA kernel on Ruby

    • JIT compile using NVRTC (NVIDIA Runtime
    Compilation), and caches it on file system.
    JIT compiling user-defined
    functions (planned)
    kernel = Cumo::ElementwiseKernel.new(

    'float32 x, float32 y, float32 z',

    'float32 w', # output type

    'w = (x * y) + z;', # CUDA code

    'my_kernel')

    w = kernel.call(x, y, z)
    $VNP'FBUVSFT

    View Slide

  27. Performance Comparison
    with Numo
    27

    View Slide

  28. Element-wise kernel
    4J[F /VNP $VNP
    ?
    ?
    ?
    ?
    ?
    a = Xumo::Float32.ones(size)
    b = Xumo::Float32.ones(size)
    a + b
    40 times faster for size of 10^8
    28
    Smaller is better
    UIJT
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  29. Dot product
    29
    4J[F /VNP $VNP
    ?
    ?
    ?
    ?
    ?
    a = Xumo::Float32.ones(100, size/100)
    b = Xumo::Float32.ones(size/100, 100)
    a.dot(b)
    2800 times faster for size of 10^8
    UIJT
    ※ Numo without Numo/Linalg
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    Smaller is better
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  30. red-chainer mnist example
    30
    • 20 times faster w/o memory pool

    • 75 times faster w/ memory pool
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  31. Outline
    31
    • Project Proposal

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    • Project Achievement

    View Slide

  32. Notices (or Difficulties) I met
    • GPU unfriendness with GC

    • Difficulties compiling CUDA kernels

    • Lack of mkmf features

    • Incompatibility with Numo is required in
    reduction kernels for performance

    • Broadcast operations were slow
    32

    View Slide

  33. GPU unfriendness with GC
    • One criteria to perform GC
    in Ruby is main memory
    usage (malloc_limit)

    • GPU memory usage is not
    taken into account

    • In the case of CuPy,
    because Python uses
    reference counting, we
    could release GPU memory
    immediately after the array
    object is not referenced
    anymore.
    33
    def add

    a = Cumo::DFloat.ones(3, 5)

    b = Cumo::DFloat.ones(3, 5)

    a + b
    end
    c = add
    a and b are not immediately freed
    (16VOGSJFOEOFTTXJUI($

    View Slide

  34. GPU unfriendness with GC (2)
    • (Partial) Solution

    • Added NArray#free to
    release memory to GPU
    on user-desired timing

    • Future work?

    • Something like
    NSAutoreleasePool to
    release all (or restricted)
    objects created inside a
    scope.
    34
    def add

    a = Cumo::DFloat.ones(3, 5)

    b = Cumo::DFloat.ones(3, 5)

    c = a + b
    a.free; b.free
    c
    end
    c = add
    a and b are immediately freed
    NSAutoreleasePool *pool = \
    [[NSAutoreleasePool alloc] init];
    NSObject *obj = \
    [[[NSObject alloc] init] autorelease];
    ....
    [pool release];
    (16VOGSJFOEOFTTXJUI($

    View Slide

  35. 35
    • Need to use nvcc (NVIDIA CUDA Compiler)
    instead of gcc to compile CUDA kernels.

    • However, mkmf supports to specify only CC
    and CXX compilers (no .cu file)

    • Solution: Made a wrapper ruby script

    • For files with .cu extensions, use nvcc

    • For files with .c extensions, use gcc
    Lack of mkmf features
    %J⒏DVMUJFTDPNQJMJOH$6%"LFSOFMT

    View Slide

  36. 36
    • Numo returns a Ruby numeric
    object for reduction kernels (for
    cases of 0-dimensional NArray).

    • In Cumo, needs to copy GPU
    memory to host memory to
    create a Ruby nemeric object.

    • It results in synchronization with
    CPU.

    • Solution: Introduced partial
    incompatibility with Numo to
    return 0-dimensional NArray.
    Incompatibility with Numo is required
    in reduction for performance
    Numo::Int64.ones(2, 3).sum
    #=> 6

    Cumo::Int64.ones(2, 3).sum
    #=>
    Cumo::Int64#shape=[]
    6
    Returns a 0-dimensional NArray
    instead of a Ruby numeric object to
    avoid CPU and GPU
    synchronization.
    *ODPNQBUJCJMJUZXUJI/VNPJTSFRVJSFEJO3FEVDUJPO

    View Slide

  37. 1 2 3
    37
    • Broadcast
    Broadcast operations were slow
    #SPBEDBTUPQFSBUJPOTXFSFTMPX
    1 2 3
    4 5 6
    7 8 9
    10 11 12
    +
    1 2 3
    =
    2 4 6
    5 7 9
    8 10 12
    11 13 15
    1 2 3
    1 2 3
    4 x 3 1 x 3 4 x 3

    View Slide

  38. 38
    How Numo Treats Broadcast
    Example) 1000 x 3 array + 1 x 3 array
    user loop: loop for 3
    narray loop: loop for 1000
    int nd = 1;
    int shape[] = {1000};
    for (int i=0; ifor (int j=0; j(*(nf->func))(&(lp->user));
    }
    }
    int size = 3;
    for (int i=0; ip3[i] = p1[i] + p[2];
    }
    #SPBEDBTUPQFSBUJPOTXFSFTMPX

    View Slide

  39. 39
    Launches Many CUDA Kernels
    user loop: loop for 3
    narray loop: loop for 1000
    int nd = 1;
    int shape[] = {1000};
    for (int i=0; ifor (int j=0; j(*(nf->func))(&(lp->user));
    }
    }
    __global__ void my_kernel(
    int* p3,
    int* p2,
    int* p1)
    {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    p3[i] = p1[i] + p2[i];
    }
    Launches CUDA Kernels 1000 times.
    • In first implementation of Cumo, modified
    user loop implementation to CUDA kernels
    #SPBEDBTUPQFSBUJPOTXFSFTMPX

    View Slide

  40. 40
    How Slow launching CUDA kernels
    Type Time(%) Time Calls Avg Min Max Name
    GPU activities: 99.89% 19.439ms 1000 19.439us 18.880us 21.312us cumo_sfloat_add
    API calls: 27.23% 330.78ms 13 25.445ms 35.083us 68.418ms cudaDeviceSynchronize
    26.34% 319.98ms 1 319.98ms 319.98ms 319.98ms cuCtxCreate
    25.32% 307.66ms 1477 208.30us 13.408us 275.62ms cudaMallocManaged
    2.58% 18.703ms 1002 18.665us 16.184us 216.70us cudaLaunch
    nvprof
    • 18 micro second

    • Time to take cudaLaunch is almost equivalent with
    adding two arrays of 500,000 elements.
    • Also, there is a limit of CUDA queue size, e.g., 1,024.
    #SPBEDBTUPQFSBUJPOTXFSFTMPX

    View Slide

  41. 41
    • Finally, stopped using ndloop
    Solution: Stop using ndloop
    /EMPPQJTTMPXGPS$VNP
    #SPBEDBTUPQFSBUJPOTXFSFTMPX
    a = Cumo::DFloat.ones(1000, 768)

    b = Cumo::DFloat.ones(1000, 1)

    a + b
    56 times faster

    View Slide

  42. Outline
    42
    • Project Proposal

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    • Project Achievement

    View Slide

  43. Feature Proposals to Ruby
    • Inplace math operations

    • Temporary variable
    43

    View Slide

  44. Inplace Math Operations
    • a += b is an abridged notation of a = a + b

    • Imagine a is a large matrix requiring 1GB.

    • a += b needs to allocate a new 1GB matrix.

    • Want to redefine for Cumo::NArray objects.

    • Current compromise:

    • Python allows to redefine +=.
    44
    https://bugs.ruby-lang.org/issues/14701
    a.inplace + b
    'FBUVSF1SPQPTBMTUP3VCZ

    View Slide

  45. Temporary Variable
    • In python, we can find a variable is a
    temporary or not by seeing reference counts

    • In NumPy,

    • is faster than

    • because (x + 1) is a temporary variable and
    new memory is not required to compute (x + 1)
    + 1
    45
    https://bugs.ruby-lang.org/issues/14710
    y = x + 1 + 1
    y = x + 1
    y + 1
    'FBUVSF1SPQPTBMTUP3VCZ

    View Slide

  46. Outline
    46
    • Project Proposal

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    • Project Achievement

    View Slide

  47. 47
    All Achieved

    View Slide

  48. Future Works
    • Support cuDNN for high performance convolutional networks

    • Support Float16

    • Conversion between Numo::NArray and Cumo::NArray

    • CI server ...
    48

    View Slide

  49. Supported Functions List
    49
    4VQQPSUFE'VODUJPOT-JTU
    - << atan2 eq floor log10 min_index rms stddev
    [email protected] >> atanh erf ge (>=) log1p minimum round store
    [] | cbrt erfc gemm log2 mulsum seq sum
    []= ~ ceil exp gt (>) logseq ne sign tan
    * acos coerce_cast exp10 hypot lt (** acosh conj exp2 im max poly sin trunc
    / allocate copysign expm1 inspect max_index prod sinc var
    & asin cos extract ldexp maximum ptp sinh
    % asinh cosh eye le (<=) mean reciprocal sqrt
    ^ atan divmod fill log min rint square
    * 88 methods
    Int8, Int16, Int32, Int64, Uint8, Uint16, Uint32, Uint64,

    SFloat (float), DFloat (double), SComplex, DComplex mixed

    View Slide

  50. Not Yet
    50
    4VQQPSUFE'VODUJPOT-JTU
    abs isnan set_real
    arg isneginf sort
    bincount isposinf sort_index
    clip median
    cumprod minmax
    cumsum modf
    frexp rand
    imag rand_norm
    isfinite real
    isinf set_imag
    [] count_false
    []= count_true
    & eq
    ^ extract
    | fill
    ~ mask
    all? none?
    any? store
    coerce_cast where
    copy where2
    * 20 methods (most of all)
    IntXX, FloatXX, ComplexXX mixed Bit
    * 23 methods

    View Slide

  51. Acknowledgements
    • Ruby Association for Grant

    • Money - GPU machines cost much

    • Time keeper

    • Motivation

    • @mrkn for his mentoring on the grant

    • @masa16 for answering my questions about Numo

    • @hatappi and @naitoh for their work of red-chainer

    • red-data-tools org and Speee, Inc for hosting meetup.

    • Preferred Networks, Inc and developers (including me) of
    Chainer/CuPy for reference implementation

    • And, my wife for giving time to develop
    51

    View Slide