Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast Numerical Computing and Deep Learning in Ruby with Cumo

Fast Numerical Computing and Deep Learning in Ruby with Cumo

Naotoshi Seo

May 31, 2018
Tweet

More Decks by Naotoshi Seo

Other Decks in Programming

Transcript

  1. Fast Numerical Computing
    and Deep Learning in Ruby
    with Cumo
    Naotoshi Seo
    RubyKaigi 2018 May 31

    View Slide

  2. This Talk
    • About Cumo (pronounced like “koomo”) which I
    am recently working personally with Ruby
    Association Grant 2017

    • Cumo is a high speed numerical computing library
    for Ruby using CUDA (GPU)

    • Explains Cumo inside
    • Also, explains CUDA programming basics

    • GOAL: To increase people contributing to Ruby in
    Scientific Computing (Deep Learning) field
    2
    https://github.com/sonots/cumo

    View Slide

  3. Self Introduction
    • Naotoshi Seo @sonots
    • DeNA Co., Ltd.
    • CRuby committer
    • Recently working on
    development of DNN
    framework at Preferred
    Networks, Inc (出向)
    3

    View Slide

  4. Outline
    4
    • Overview of Scientific Computing in Ruby

    • Why Cumo is needed

    • CUDA Programming Basics

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    View Slide

  5. Why Cumo is need
    5
    ( Overview of Scientific Computing in Ruby )

    View Slide

  6. 4DJFOUJpD$PNQVUJOHJO1ZUIPO
    6
    NumPy CuPy
    PyCUDA
    Chainer TensorFlow MXNet
    Cython
    pybind11
    DNN
    Tensor
    CUDA binding
    Useful tools for
    writing bindings
    C++ C++
    Python

    View Slide

  7. 4DJFOUJpD$PNQVUJOHJO3VCZ
    7
    Numo/NArray Cumo
    RbCUDA
    Red-chainer TensorFlow.rb MXNet.rb
    Rubex
    /"
    DNN
    Tensor
    CUDA binding
    Useful tools for
    writing bindings
    (or NMatrix)
    PyCall
    Binding to Python
    C++ C++
    Ruby

    View Slide

  8. CUDA Programming Basics
    8
    ( Why GPU is required )

    View Slide

  9. Why GPU is required
    • Because it is faster

    • ex) 30 days in CPU -> 4 days in GPU
    9
    $6%"1SPHSBNNJOH#BTJDT

    View Slide

  10. Characteristics of GPU
    • GPU is bad at branching

    • GPU simplifies branch
    prediction and out-of-order
    mechanism instead.

    • GPU is suitable for matrix
    computation
    10
    • GPU is good at parallel computation

    • Order of magnitude is like 24 cores with CPU

    • 3,000 ~ 4,000 cores with GPU
    $6%"1SPHSBNNJOH#BTJDT

    View Slide

  11. CUDA architecture
    1. Allocate GPU memory

    2. Transfer data from host (CPU) to device (GPU)

    3. Process at GPU

    4. Transfer result from device (GPU) to host (CPU)

    5. Free GPU memory
    11
    • Typically, CUDA programs are written as
    $6%"1SPHSBNNJOH#BTJDT

    View Slide

  12. Basic CUDA codes
    12
    __global__ void addArraysOnGPU(float *A, float *B, float *C, const int N)
    {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    C[i] = A[i] + B[i];
    }
    int main(int argc, char **argv)
    {
    // malloc device global memory
    float *d_A, *d_B, *d_C;
    cudaMalloc((float**)&d_A, nBytes);
    cudaMalloc((float**)&d_B, nBytes);
    cudaMalloc((float**)&d_C, nBytes);
    // transfer data from host to device
    cudaMemcpy(d_A, h_A, nBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, nBytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, gpuRef, nBytes, cudaMemcpyHostToDevice);
    // invoke kernel
    addArraysOnGPU<<>>(d_A, d_B, d_C, nElem);
    // copy kernel result back to host side
    cudaMemcpy(gpuRef, d_C, nBytes, cudaMemcpyDeviceToHost);
    // free device global memory
    CHECK(cudaFree(d_A));
    CHECK(cudaFree(d_B));
    CHECK(cudaFree(d_C));
    }
    $6%"1SPHSBNNJOH#BTJDT

    View Slide

  13. Asynchrounous
    13
    CPU
    GPU Kernel1
    Idle
    • Kernel execution is asynchronous
    • cudaMalloc, cudaMemcpy, cudaFree are
    synchronous
    Launch kernel1
    cudaMalloc cudaMemcpy (H2D) cudaFree
    cudaMemcpy (D2H)
    $6%"1SPHSBNNJOH#BTJDT

    View Slide

  14. GPU is like a job queue
    • You may assume GPU is a
    job queue

    • CPU enqueues a job to
    GPU (launches a kernel)

    • The point is that CPU
    latency is hidden as long as
    it finishes earlier than GPU

    • Please note that
    cudaMalloc/Free requires to
    synchronize GPU and CPU
    (I will talk how to avoid later)
    14
    launch
    sync
    GPU
    CPU
    Idle
    get results
    $6%"1SPHSBNNJOH#BTJDT

    View Slide

  15. Outline
    15
    • Overview of Scientific Computing in Ruby

    • Why Cumo is needed

    • CUDA Programming Basics

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    View Slide

  16. Cumo Features
    • Highly compatible with Ruby/Numo

    • Element-wise operations

    • Reduction operations

    • Dot operation using cuBLAS

    • CUDA memory pool

    • JIT compilation of user-defined functions
    16

    View Slide

  17. Highly Compatible with Numo
    • Ruby/Numo users can easily switch into
    Cumo to leverage power of GPU
    17
    pOEOBNFSCca
    YBSHTTFEJa
    FT/VNP$VNPHa
    FTOVNPDVNPH
    JGHQV

    SFRVJSFDVNPOBSSBZ

    9VNP$VNP

    FMTF

    SFRVJSFOVNPOBSSBZ

    9VNP/VNP

    FOE

    B9VNP%'MPBU[FSPT

    C9VNP%'MPBUPOFT

    DBC
    )JHIMZDPNQBUJCMFXJUI/VNP

    View Slide

  18. Element-wise opeations
    • Element-wise is like
    matrix additions

    • All elements are
    independent

    • Easy to perform in
    parallel
    18
    1 2 3 4 5 6
    2 3 4 5 6 7
    +
    A
    B
    3 5 7 9 11 13
    =
    C

    Thread

    IDs
    &MFNFOUXJTFPQFSBUJPOT

    View Slide

  19. Reduction opeations
    • Like sum

    • sum([1,2,3,4]) #=> 10

    • Elements are not independent

    • Not so easy to perform in parallel
    19
    3FEVDUJPOPQFSBUJPOT

    View Slide

  20. 20
    http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
    3FEVDUJPODPNQVUBUJPOT

    View Slide

  21. 21
    3FEVDUJPODPNQVUBUJPOT

    View Slide

  22. Dot product using cuBLAS
    • Dot is more complicated than reduction

    • NVIDIA's cuBLAS library supports it as GEMM
    (GEneral matrix-matrix mulitplication) and fast

    • However, cuBLAS supports only f-contiguous
    (column major) although we write CRuby
    extensions in C (c-contiguous, raw-major)
    22
    %PUQSPEVDU
    1 2 3
    4 5 6
    7 8 9
    1 4 7
    2 5 8
    3 6 9
    C-contiguous F (Fortran)

    View Slide

  23. Using cuBLAS with c-
    contiguous data
    23
    A = [1, 2, 3, 4, 5, 6]
    B = [1, 2, 3, 4, 5, 6]
    C = [9, 12, 15, 19, 26, 33, 29, 40, 51]
    A = [1, 2, 3, 4, 5, 6]
    B = [1, 2, 3, 4, 5, 6]
    C = [9, 12, 15, 19, 26, 33, 29, 40, 51]
    $DPOUJVHPVT 3PXNBKPS

    'DPOUJVHPVT $PMVNONBKPS

    No data copy, changing only attributes (shape)
    https://www.christophlassner.de/using-blas-from-c-with-row-major-data.html
    %PUQSPEVDU

    View Slide

  24. CUDA memory pool
    24

    View Slide

  25. Why We Need Memory Pool
    25
    $6%".FNPSZ1PPM
    • cudaMalloc / cudaFree makes slow
    • memory allocation / free themselves are slow
    • cudaMalloc synchronizes CPU and GPU
    CPU
    GPU Free
    Kernel1
    cudaFree
    Launch
    synchronize
    Idle
    Kernel2
    something
    Idle
    Launch
    cudaMalloc
    synchronize
    Malloc

    View Slide

  26. Memory Pool
    26
    $6%".FNPSZ1PPM
    • Cache to memory pool
    • Avoid cudaMalloc / Free as much as possible

    View Slide

  27. (Maybe) The Simplest Way
    27
    $6%".FNPSZ1PPM
    1. Round up memory size by 512
    2. Append into a list (free list)
    3. Search the list to reuse
    chunk
    512
    2048
    free_list
    1024
    0 /

    View Slide

  28. Best-fit algorithm
    28
    $6%".FNPSZ1PPM
    512 1024 1536 2048 2560 ….
    free bins or arena
    chunks
    size
    free_list or bin
    1. Round up memory size by
    512
    2. cudaMalloc, and use it
    3. Push to arena intead of
    cudaFree
    4. Pop from arena if a block
    of exactly same size is
    available in arena instead
    of cudaMalloc
    O(1) to find a
    chunk of 2560

    View Slide

  29. Issues on Best-Fit
    29
    $6%".FNPSZ1PPM
    • Cache miss occurs even if there exist memory blocks
    of larger sizes than the required size (best-fit).
    • The cache miss typically occurs for Natural Language
    Processing applications whose input data size are
    varying.
    512 1024 1536 2048 2560 ….
    size
    Want
    How about using cache of larger size (2560)?

    View Slide

  30. Best-fit with coalesing (BFC)
    30
    $6%".FNPSZ1PPM
    512 1024 1536 2048 2560 ….
    1. Pop a chunk if larger size
    than required size is
    available
    2. Split and use only
    necessary size. Push back
    a chunk of remained size
    Split
    Pop
    512 2048
    512 1024 1536 2048 2560 ….
    Push
    (1)
    (2)
    Split and merge
    4QMJU
    Split
    next
    prev
    want to use
    The one glibc malloc uses

    View Slide

  31. Best-fit with coalesing (BFC)
    31
    $6%".FNPSZ1PPM
    1. Merge the chunk to free
    with next or prev chunks
    in free lists
    2. Push back the merged
    chunk
    Merge
    512 2048
    512 1024 1536 2048 2560 ….
    Push
    .FSHF
    next
    prev

    View Slide

  32. 32
    • Cumo supports users to write their own
    CUDA kernel on Ruby

    • JIT compile using NVRTC (NVIDIA Runtime
    Compilation), and caches it on file system.
    JIT compiling user-defined
    functions
    kernel = Cumo::ElementwiseKernel.new(

    'float32 x, float32 y, float32 z',

    'float32 w', # output type

    'w = (x * y) + z;', # CUDA code

    'my_kernel')

    w = kernel.call(x, y, z)
    +*5DPNQJMJOHVTFSEFpOFEGVODUJPOT

    View Slide

  33. Performance Comparison
    with Numo
    33

    View Slide

  34. Element-wise kernel
    4J[F /VNP $VNP
    ?
    ?
    ?
    ?
    ?
    a = Xumo::Float32.ones(size)
    b = Xumo::Float32.ones(size)
    a + b
    40 times faster for size of 10^8
    34
    Smaller is better
    UIJT
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  35. Dot product
    35
    4J[F /VNP $VNP
    ?
    ?
    ?
    ?
    ?
    a = Xumo::Float32.ones(100, size/100)
    b = Xumo::Float32.ones(size/100, 100)
    a.dot(b)
    2800 times faster for size of 10^8
    UIJT
    ※ Numo without Numo/Linalg
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    Smaller is better
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  36. red-chainer mnist example
    36
    • 20 times faster w/o memory pool

    • 75 times faster w/ memory pool
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  37. Outline
    37
    • Overview of Scientific Computing in Ruby

    • Why Cumo is needed

    • CUDA Programming Basics

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    View Slide

  38. Notices (or Difficulties) I met
    • GPU unfriendness with GC

    • Difficulties compiling CUDA kernels

    • Lack of mkmf features

    • Reduction kernels synchronize with CPU

    • Broadcast operations were slow
    38

    View Slide

  39. GPU unfriendness with GC
    • One criteria to perform GC
    in Ruby is main memory
    usage (malloc_limit)

    • GPU memory usage is not
    taken into account

    • In the case of CuPy,
    because Python uses
    reference counting, we
    could release GPU memory
    immediately after the array
    object is not referenced
    anymore.
    39
    def add

    a = Cumo::DFloat.ones(3, 5)

    b = Cumo::DFloat.ones(3, 5)

    a + b
    end
    c = add
    a and b are not immediately freed
    (16VOGSJFOEOFTTXJUI($

    View Slide

  40. GPU unfriendness with GC (2)
    • (Partial) Solution

    • Added NArray#free to
    release memory to GPU
    on user-desired timing

    • Future work?

    • Something like
    NSAutoreleasePool to
    release all (or restricted)
    objects created inside a
    scope.
    40
    def add

    a = Cumo::DFloat.ones(3, 5)

    b = Cumo::DFloat.ones(3, 5)

    c = a + b
    a.free; b.free
    c
    end
    c = add
    a and b are immediately freed
    NSAutoreleasePool *pool = \
    [[NSAutoreleasePool alloc] init];
    NSObject *obj = \
    [[[NSObject alloc] init] autorelease];
    ....
    [pool release];
    (16VOGSJFOEOFTTXJUI($

    View Slide

  41. 41
    • Need to use nvcc (NVIDIA CUDA Compiler)
    instead of gcc to compile CUDA kernels.

    • However, mkmf supports to specify only CC
    and CXX compilers (no .cu file)

    • Solution: Made a wrapper ruby script

    • For files with .cu extensions, use nvcc

    • For files with .c extensions, use gcc
    Lack of mkmf features
    %J⒏DVMUJFTDPNQJMJOH$6%"LFSOFMT

    View Slide

  42. 42
    • Numo returns a Ruby numeric
    object for reduction kernels (for
    cases of 0-dimensional NArray).

    • In Cumo, needs to copy GPU
    memory to host memory to
    create a Ruby nemeric object.

    • It results in synchronization with
    CPU.

    • Solution: Introduced partial
    incompatibility with Numo to
    return 0-dimensional NArray.
    Reduction Kernels
    Synchronize with CPU
    Numo::Int64.ones(2, 3).sum
    #=> 6

    Cumo::Int64.ones(2, 3).sum
    #=>
    Cumo::Int64#shape=[]
    6
    Returns a 0-dimensional NArray
    instead of a Ruby numeric object to
    avoid CPU and GPU
    synchronization.
    3FEVDUJPO,FSOFMT4ZODISPOJ[FXJUI$16

    View Slide

  43. 1 2 3
    43
    • Broadcast
    Broadcast operations were slow
    #SPBEDBTUPQFSBUJPOTXFSFTMPX
    1 2 3
    4 5 6
    7 8 9
    10 11 12
    +
    1 2 3
    =
    2 4 6
    5 7 9
    8 10 12
    11 13 15
    1 2 3
    1 2 3
    4 x 3 1 x 3 4 x 3

    View Slide

  44. 44
    How Numo Treats Broadcast
    Example) 1000 x 3 array + 1 x 3 array
    user loop: loop for 3
    narray loop: loop for 1000
    int nd = 1;
    int shape[] = {1000};
    for (int i=0; ifor (int j=0; j(*(nf->func))(&(lp->user));
    }
    }
    int size = 3;
    for (int i=0; ip3[i] = p1[i] + p[2];
    }
    #SPBEDBTUPQFSBUJPOTXFSFTMPX

    View Slide

  45. 45
    Launches Many CUDA Kernels
    user loop: loop for 3
    narray loop: loop for 1000
    int nd = 1;
    int shape[] = {1000};
    for (int i=0; ifor (int j=0; j(*(nf->func))(&(lp->user));
    }
    }
    __global__ void my_kernel(
    int* p3,
    int* p2,
    int* p1)
    {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    p3[i] = p1[i] + p2[i];
    }
    Launches CUDA Kernels 1000 times.
    • In first implementation of Cumo, modified
    user loop implementation to CUDA kernels
    #SPBEDBTUPQFSBUJPOTXFSFTMPX

    View Slide

  46. 46
    How Slow launching CUDA kernels
    Type Time(%) Time Calls Avg Min Max Name
    GPU activities: 99.89% 19.439ms 1000 19.439us 18.880us 21.312us cumo_sfloat_add
    API calls: 27.23% 330.78ms 13 25.445ms 35.083us 68.418ms cudaDeviceSynchronize
    26.34% 319.98ms 1 319.98ms 319.98ms 319.98ms cuCtxCreate
    25.32% 307.66ms 1477 208.30us 13.408us 275.62ms cudaMallocManaged
    2.58% 18.703ms 1002 18.665us 16.184us 216.70us cudaLaunch
    nvprof
    • 18 micro second

    • Time to take cudaLaunch is almost equivalent with
    adding two arrays of 500,000 elements.
    • Also, there is a limit of CUDA queue size, e.g., 1,024.
    #SPBEDBTUPQFSBUJPOTXFSFTMPX

    View Slide

  47. 47
    • Finally, stopped using ndloop
    Solution: Stop using ndloop
    /EMPPQJTTMPXGPS$VNP
    #SPBEDBTUPQFSBUJPOTXFSFTMPX
    a = Cumo::DFloat.ones(1000, 768)

    b = Cumo::DFloat.ones(1000, 1)

    a + b
    56 times faster

    View Slide

  48. Outline
    48
    • Overview of Scientific Computing in Ruby

    • Why Cumo is needed

    • CUDA Programming Basics

    • Cumo Features

    • Notices (or Difficulties) I met

    • Feature Proposals to Ruby

    View Slide

  49. Feature Proposals to Ruby
    • Inplace math operations

    • Temporary variable
    49

    View Slide

  50. Inplace Math Operations
    • a += b is an abridged notation of a = a + b

    • Imagine a is a large matrix requiring 1GB.

    • a += b needs to allocate a new 1GB matrix.

    • Want to redefine for Cumo::NArray objects.

    • Current compromise:

    • Python allows to redefine +=.
    50
    https://bugs.ruby-lang.org/issues/14701
    a.inplace + b
    'FBUVSF1SPQPTBMTUP3VCZ

    View Slide

  51. Temporary Variable
    • In python, we can find a variable is a
    temporary or not by seeing reference counts

    • In NumPy,

    • is faster than

    • because (x + 1) is a temporary variable and
    new memory is not required to compute (x + 1)
    + 1
    51
    https://bugs.ruby-lang.org/issues/14710
    y = x + 1 + 1
    y = x + 1
    y + 1
    'FBUVSF1SPQPTBMTUP3VCZ

    View Slide

  52. At the end
    52

    View Slide

  53. Future Works
    • Support cuDNN for high performance
    convolutional networks

    • Support more functions (28/80 files are left)

    • User-defined kernel (not yet completed)

    • Kernel fusion

    • Conversion between Numo::NArray and
    Cumo::NArray
    53

    View Slide

  54. Help Wanted
    • Resolve future works

    • support more functions (28/80 files are left)

    • GPU CI server ....
    54

    View Slide

  55. Talk Summary
    55
    • Cumo is CUDA version of Ruby/Numo, and
    easy to switch vice versa

    • GPU is fast and essential in deep learning

    • CUDA is like a job queue

    • Red-chainer mnist example worked 75 times
    faster with Cumo than Numo

    • Help wanted: GPU CI server …

    View Slide

  56. Acknowledgements
    • DeNA Co., Ltd for supporting business trip

    • Ruby Association for Grant

    • Money - GPU machines cost much

    • Time keeper

    • Motivation

    • @mrkn for his mentoring on the grant

    • @masa16 for answering my questions about Numo

    • @hatappi and @naitoh for their work of red-chainer

    • red-data-tools org and Speee, Inc for hosting meetup.

    • Preferred Networks, Inc and developers (including me) of
    Chainer/CuPy for reference implementation

    • And, my wife for giving time to develop
    56

    View Slide