Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction of Cumo, and Integration to Red Ch...

Introduction of Cumo, and Integration to Red Chainer

A slide which I talked at RubyData Tokyo Meetup 2018, Nov 17.

Naotoshi Seo

November 17, 2018
Tweet

More Decks by Naotoshi Seo

Other Decks in Programming

Transcript

  1. Introduction of Cumo, and integration to Red Chainer Naotoshi Seo

    Nov 17, 2018 https://github.com/sonots/cumo RubyData Tokyo Meetup
  2. Self Introduction • Naotoshi Seo @sonots • DeNA Co., Ltd.

    • CRuby committer • Recently working on development of DNN framework at Preferred Networks, Inc (出向) 2
  3. Why GPU? • GPU is bad at branching • GPU

    simplifies branch prediction and out-of-order mechanism instead. • GPU is suitable for matrix computation 7 • GPU is fast, and recently essential for Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU 1SPKFDU*OUSPEVDUJPO
  4. Element-wise operation 4J[F /VNP $VNP ?   ? 

     ?   ?   ?   a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b 40 times faster for size of 10^8 8 Smaller is better UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  5. Dot product 9 4J[F /VNP /VNP#-"4 $VNP ?  

     ?    ?    ?    ?    a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) 831 times faster than Numo w/ BLAS for size of 10^8 UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) Smaller is better 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  6. red-chainer mnist example 10 • 380 sec/epoch → 5 sec/epoch

    (AWS p3.2xlarge) NVIDIA Volta v100 Intel(R) Xeon(R) CPU E5-2686 [email protected] 75 Times Faster !! 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  7. Red Chainer? 13 • Ruby port of Chainer • See

    @hatappi slides today https://github.com/red-data-tools/red-chainer
  8. Previous Way 15 • Cumo is highly compatible with Numo,

    so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPH SC • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself. require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b *OUFHSBUJPOUP3FE$IBJOFS
  9. 16 require 'chainer' gpu = 0 device = Chainer::Device.create(gpu) #=>

    GpuDevice xm = device.xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b New Programmable Way Chainer::CUDA.available?(gpu) #=> true Chainer.get_array_module(a) #=> Cumo *OUFHSBUJPOUP3FE$IBJOFS
  10. Function CPU/GPU Branching 21 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs)

    x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end if Cumo supports cuDNN, using it makes fast (not yet though) *OUFHSBUJPOUP3FE$IBJOFS
  11. Future Works around Backend/Device APIs 24 Chainer::Device.get_from_array(array) Cumo::NArray has to

    hold a located GPU ID. Chainer::Variable.to_gpu Chainer::Variable.to_cpu Cumo itself needs to support Numo/Cumo conversion. 'VUVSF8PSLT
  12. Future Works Performance Improvement (1) Improve performance of Reduction by

    compacting dimensions of NArray (2) Use user-defined kernel to fuse CUDA ops But, Cumo does not support it yet 25 Chainer/CuPy is still faster than Red Chainer/Cumo kernel = Cumo::ElementwiseKernel.new(
 'float32 x, float32 y, float32 z',
 'float32 w', # output type
 'w = (x * y) + z;', # CUDA code
 'my_kernel')
 w = kernel.call(x, y, z) 'VUVSF8PSLT
  13. 26 >>> a = numpy.arange(6).reshape(3,2)*2 >>> a array([[0, 2 ],

    [4, 6 ], [8, 10]]) >>> a.argmax(axis=1) array([1, 1, 1]) irb> a = Numo::SFloat.new(3,2).seq*2 => Numo::SFloat#shape=[3,2] [[0, 2 ], [4, 6 ], [8, 10]] irb> b = a.max_index(axis: 1) => Numo::Int32#shape=[3] [1, 3, 5] irb> b.to_a.map.with_index {|v, i| v - a.shape[1] * i } => [1, 1, 1] Future Works Performance Improvement Red-chainer handles NumPy/Numo API differences in Ruby (1) numpy.argmax vs Numo::NArray#max_index
  14. 27 >>> a = numpy.arange(9).reshape(3,3) >>> a array([[0, 1, 2],

    [3, 4, 5], [6, 7, 8]]) >>> a[[0,1],[0,1]] array([0, 4]) irb> a = Numo::SFloat.new(3,3).seq => Numo::SFloat#shape=[3,3] [[0, 1, 2], [3, 4, 5], [6, 7, 8]] irb> a[[0,1],[0,1]] => Numo::SFloat(view)#shape=[2,2] [[0, 1], [3, 4]]
 irb> a[[0,1],[0,1]].diagonal
 => Numo::SFloat(view)#shape=[2] [0, 4] Future Works Performance Improvement (2) Difference in Advanced Indexing (3) Fix any other places using to_a, each, or map in red chainer.
  15. More Future Works • Support cuDNN for high performance convolutional

    networks • Support Float16 • Conversion between Numo::NArray and Cumo::NArray 28 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end cuDNN 'VUVSF8PSLT
  16. Supported Functions List 29 - << atan2 eq floor log10

    min_index rms stddev -@ >> atanh erf ge (>=) log1p minimum round store [] | cbrt erfc gemm log2 mulsum seq sum []= ~ ceil exp gt (>) logseq ne sign tan * acos coerce_cast exp10 hypot lt (<) nearly_eq signbit tanh ** acosh conj exp2 im max poly sin trunc / allocate copysign expm1 inspect max_index prod sinc var & asin cos extract ldexp maximum ptp sinh % asinh cosh eye le (<=) mean reciprocal sqrt ^ atan divmod fill log min rint square * 88 methods Int8, Int16, Int32, Int64, Uint8, Uint16, Uint32, Uint64,
 SFloat (float), DFloat (double), SComplex, DComplex mixed 'VUVSF8PSLT
  17. Not Yet 30 abs isnan set_real arg isneginf sort bincount

    isposinf sort_index clip median cumprod minmax cumsum modf frexp rand imag rand_norm isfinite real isinf set_imag [] count_false []= count_true & eq ^ extract | fill ~ mask all? none? any? store coerce_cast where copy where2 * 20 methods (most of all) IntXX, FloatXX, ComplexXX mixed Bit * 23 methods 'VUVSF8PSLT
  18. end • Introduction to Cumo • New Backend/Device APIs on

    red-chainer • Future works 31 Contributions are welcome!