Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction of Cumo, and Integration to Red Chainer

Naotoshi Seo
November 17, 2018

Introduction of Cumo, and Integration to Red Chainer

A slide which I talked at RubyData Tokyo Meetup 2018, Nov 17.

Naotoshi Seo

November 17, 2018
Tweet

More Decks by Naotoshi Seo

Other Decks in Programming

Transcript

  1. Introduction of Cumo, and
    integration to Red Chainer
    Naotoshi Seo
    Nov 17, 2018
    https://github.com/sonots/cumo
    RubyData Tokyo Meetup

    View Slide

  2. Self Introduction
    • Naotoshi Seo @sonots
    • DeNA Co., Ltd.
    • CRuby committer
    • Recently working on
    development of DNN
    framework at Preferred
    Networks, Inc (出向)
    2

    View Slide

  3. Outline
    3
    • Project Introduction

    • Integration to Red Chainer

    • Future Works

    View Slide

  4. Project Introduction
    4

    View Slide

  5. 5
    1SPKFDU*OUSPEVDUJPO
    What is Cumo?
    • (NVIDIA) GPU version of Ruby/Numo

    • Pronounced like koo-mo

    View Slide

  6. 6
    https://ruby-numo.github.io/numo-narray/
    1SPKFDU*OUSPEVDUJPO

    View Slide

  7. Why GPU?
    • GPU is bad at branching

    • GPU simplifies branch prediction
    and out-of-order mechanism
    instead.

    • GPU is suitable for matrix
    computation
    7
    • GPU is fast, and recently essential for Deep Learning

    • GPU is good at parallel computation

    • Order of magnitude is like 24 cores with CPU

    • 3,000 ~ 4,000 cores with GPU
    1SPKFDU*OUSPEVDUJPO

    View Slide

  8. Element-wise operation
    4J[F /VNP $VNP
    ?
    ?
    ?
    ?
    ?
    a = xm::Float32.ones(size)
    b = xm::Float32.ones(size)
    a + b
    40 times faster for size of 10^8
    8
    Smaller is better
    UIJT
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  9. Dot product
    9
    4J[F /VNP /VNP#-"4 $VNP
    ?
    ?
    ?
    ?
    ?
    a = xm::Float32.ones(100, size/100)
    b = xm::Float32.ones(size/100, 100)
    a.dot(b)
    831 times faster than Numo w/ BLAS for size of 10^8
    UIJT
    Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    NVIDIA Volta v100
    (AWS p3 xlarge)
    Smaller is better
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  10. red-chainer mnist example
    10
    • 380 sec/epoch → 5 sec/epoch (AWS p3.2xlarge)
    NVIDIA Volta v100
    Intel(R) Xeon(R) CPU E5-2686 [email protected]
    75 Times Faster !!
    1FSGPSNBODF$PNQBSJTPOXJUI/VNP

    View Slide

  11. 11
    https://github.com/sonots/cumo-logo
    Generated by https://hatchful.shopify.com/

    View Slide

  12. Outline
    12
    • Project Introduction

    • Integration to Red Chainer

    • Future Works

    View Slide

  13. Red Chainer?
    13
    • Ruby port of Chainer

    • See @hatappi slides today
    https://github.com/red-data-tools/red-chainer

    View Slide

  14. Integrate Cumo into
    Red Chainer
    14
    https://github.com/red-data-tools/red-chainer/pull/67
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  15. Previous Way
    15
    • Cumo is highly compatible with Numo, so sed
    makes it work a Numo application with Cumo.
    TFEJFT/VNP$VNPHFTOVNPDVNPHSC
    • It works, but it is nonsense that let users of
    red-chainer to convert red-chainer itself.
    require 'numo/narray'
    a = Numo::SFloat.zeros((2,3))
    b = Numo::SFloat.ones((2,3))
    a + b
    require 'cumo/narray'
    a = Cumo::SFloat.zeros((2,3))
    b = Cumo::SFloat.ones((2,3))
    a + b
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  16. 16
    require 'chainer'
    gpu = 0
    device = Chainer::Device.create(gpu) #=> GpuDevice
    xm = device.xm #=> Cumo
    a = xm::SFloat.zeros((2,3))
    b = xm::SFloat.ones((2,3))
    a + b
    New Programmable Way
    Chainer::CUDA.available?(gpu) #=> true
    Chainer.get_array_module(a) #=> Cumo
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  17. CUDA APIs of red-chainer
    17
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  18. Backend APIs of red-chainer
    18
    Chainer
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  19. Device APIs of red-chainer
    19
    "CTUSBDU%FWJDF
    $QV%FWJDF (QV%FWJDF
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  20. Device APIs of red-chainer
    20
    Chainer::Device
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  21. Function CPU/GPU Branching
    21
    class Convolution2DFunction < Chainer::Function
    def forward_cpu(inputs)
    x, w, b = inputs
    kh, kw = w.shape[2], w.shape[3]
    @col = Chainer::Utils::Conv.im2col(x, ...)
    y = Chainer::Utils::Math.tensordot(@col, ...)
    y += b if b
    [y.transpose(0, 3, 1, 2)]
    end
    def forward_gpu(inputs)
    x, w, b = inputs
    [Cumo::NArray.conv(x, w, b, ...)]
    end
    end
    if Cumo supports cuDNN, using it makes fast (not yet though)
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  22. Function APIs of red-chainer
    22
    Chainer::Function
    *OUFHSBUJPOUP3FE$IBJOFS

    View Slide

  23. Outline
    23
    • Project Introduction

    • Integration to Red Chainer

    • Future Works

    View Slide

  24. Future Works around
    Backend/Device APIs
    24
    Chainer::Device.get_from_array(array)

    Cumo::NArray has to hold a located GPU ID.

    Chainer::Variable.to_gpu

    Chainer::Variable.to_cpu

    Cumo itself needs to support Numo/Cumo conversion.
    'VUVSF8PSLT

    View Slide

  25. Future Works
    Performance Improvement
    (1) Improve performance of Reduction

    by compacting dimensions of NArray

    (2) Use user-defined kernel to fuse CUDA ops

    But, Cumo does not support it yet
    25
    Chainer/CuPy is still faster than Red Chainer/Cumo
    kernel = Cumo::ElementwiseKernel.new(

    'float32 x, float32 y, float32 z',

    'float32 w', # output type

    'w = (x * y) + z;', # CUDA code

    'my_kernel')

    w = kernel.call(x, y, z)
    'VUVSF8PSLT

    View Slide

  26. 26
    >>> a = numpy.arange(6).reshape(3,2)*2
    >>> a
    array([[0, 2 ],
    [4, 6 ],
    [8, 10]])
    >>> a.argmax(axis=1)
    array([1, 1, 1])
    irb> a = Numo::SFloat.new(3,2).seq*2
    => Numo::SFloat#shape=[3,2]
    [[0, 2 ],
    [4, 6 ],
    [8, 10]]
    irb> b = a.max_index(axis: 1)
    => Numo::Int32#shape=[3]
    [1, 3, 5]
    irb> b.to_a.map.with_index {|v, i|
    v - a.shape[1] * i }
    => [1, 1, 1]
    Future Works
    Performance Improvement
    Red-chainer handles NumPy/Numo API differences in Ruby
    (1) numpy.argmax vs Numo::NArray#max_index

    View Slide

  27. 27
    >>> a = numpy.arange(9).reshape(3,3)
    >>> a
    array([[0, 1, 2],
    [3, 4, 5],
    [6, 7, 8]])
    >>> a[[0,1],[0,1]]
    array([0, 4])
    irb> a = Numo::SFloat.new(3,3).seq
    => Numo::SFloat#shape=[3,3]
    [[0, 1, 2],
    [3, 4, 5],
    [6, 7, 8]]
    irb> a[[0,1],[0,1]]
    => Numo::SFloat(view)#shape=[2,2]
    [[0, 1],
    [3, 4]]

    irb> a[[0,1],[0,1]].diagonal

    => Numo::SFloat(view)#shape=[2]
    [0, 4]
    Future Works
    Performance Improvement
    (2) Difference in Advanced Indexing

    (3) Fix any other places using to_a, each, or map in red chainer.

    View Slide

  28. More Future Works
    • Support cuDNN for high performance convolutional networks

    • Support Float16

    • Conversion between Numo::NArray and Cumo::NArray
    28
    class Convolution2DFunction < Chainer::Function
    def forward_cpu(inputs)
    x, w, b = inputs
    kh, kw = w.shape[2], w.shape[3]
    @col = Chainer::Utils::Conv.im2col(x, ...)
    y = Chainer::Utils::Math.tensordot(@col, ...)
    y += b if b
    [y.transpose(0, 3, 1, 2)]
    end
    def forward_gpu(inputs)
    x, w, b = inputs
    [Cumo::NArray.conv(x, w, b, ...)]
    end
    end
    cuDNN
    'VUVSF8PSLT

    View Slide

  29. Supported Functions List
    29
    - << atan2 eq floor log10 min_index rms stddev
    [email protected] >> atanh erf ge (>=) log1p minimum round store
    [] | cbrt erfc gemm log2 mulsum seq sum
    []= ~ ceil exp gt (>) logseq ne sign tan
    * acos coerce_cast exp10 hypot lt (** acosh conj exp2 im max poly sin trunc
    / allocate copysign expm1 inspect max_index prod sinc var
    & asin cos extract ldexp maximum ptp sinh
    % asinh cosh eye le (<=) mean reciprocal sqrt
    ^ atan divmod fill log min rint square
    * 88 methods
    Int8, Int16, Int32, Int64, Uint8, Uint16, Uint32, Uint64,

    SFloat (float), DFloat (double), SComplex, DComplex mixed
    'VUVSF8PSLT

    View Slide

  30. Not Yet
    30
    abs isnan set_real
    arg isneginf sort
    bincount isposinf sort_index
    clip median
    cumprod minmax
    cumsum modf
    frexp rand
    imag rand_norm
    isfinite real
    isinf set_imag
    [] count_false
    []= count_true
    & eq
    ^ extract
    | fill
    ~ mask
    all? none?
    any? store
    coerce_cast where
    copy where2
    * 20 methods (most of all)
    IntXX, FloatXX, ComplexXX mixed Bit
    * 23 methods
    'VUVSF8PSLT

    View Slide

  31. end
    • Introduction to Cumo

    • New Backend/Device APIs on red-chainer

    • Future works
    31
    Contributions are welcome!

    View Slide