Introduction of Cumo, and Integration to Red Chainer

Aef92c3acc29ad8543e04135687fc4f1?s=47 Naotoshi Seo
November 17, 2018

Introduction of Cumo, and Integration to Red Chainer

A slide which I talked at RubyData Tokyo Meetup 2018, Nov 17.

Aef92c3acc29ad8543e04135687fc4f1?s=128

Naotoshi Seo

November 17, 2018
Tweet

Transcript

  1. Introduction of Cumo, and integration to Red Chainer Naotoshi Seo

    Nov 17, 2018 https://github.com/sonots/cumo RubyData Tokyo Meetup
  2. Self Introduction • Naotoshi Seo @sonots • DeNA Co., Ltd.

    • CRuby committer • Recently working on development of DNN framework at Preferred Networks, Inc (出向) 2
  3. Outline 3 • Project Introduction • Integration to Red Chainer

    • Future Works
  4. Project Introduction 4

  5. 5 1SPKFDU*OUSPEVDUJPO What is Cumo? • (NVIDIA) GPU version of

    Ruby/Numo • Pronounced like koo-mo
  6. 6 https://ruby-numo.github.io/numo-narray/ 1SPKFDU*OUSPEVDUJPO

  7. Why GPU? • GPU is bad at branching • GPU

    simplifies branch prediction and out-of-order mechanism instead. • GPU is suitable for matrix computation 7 • GPU is fast, and recently essential for Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU 1SPKFDU*OUSPEVDUJPO
  8. Element-wise operation 4J[F /VNP $VNP ?   ? 

     ?   ?   ?   a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b 40 times faster for size of 10^8 8 Smaller is better UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  9. Dot product 9 4J[F /VNP /VNP#-"4 $VNP ?  

     ?    ?    ?    ?    a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) 831 times faster than Numo w/ BLAS for size of 10^8 UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) Smaller is better 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  10. red-chainer mnist example 10 • 380 sec/epoch → 5 sec/epoch

    (AWS p3.2xlarge) NVIDIA Volta v100 Intel(R) Xeon(R) CPU E5-2686 v4@2.30GHz 75 Times Faster !! 1FSGPSNBODF$PNQBSJTPOXJUI/VNP
  11. 11 https://github.com/sonots/cumo-logo Generated by https://hatchful.shopify.com/

  12. Outline 12 • Project Introduction • Integration to Red Chainer

    • Future Works
  13. Red Chainer? 13 • Ruby port of Chainer • See

    @hatappi slides today https://github.com/red-data-tools/red-chainer
  14. Integrate Cumo into Red Chainer 14 https://github.com/red-data-tools/red-chainer/pull/67 *OUFHSBUJPOUP3FE$IBJOFS

  15. Previous Way 15 • Cumo is highly compatible with Numo,

    so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPH SC • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself. require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b *OUFHSBUJPOUP3FE$IBJOFS
  16. 16 require 'chainer' gpu = 0 device = Chainer::Device.create(gpu) #=>

    GpuDevice xm = device.xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b New Programmable Way Chainer::CUDA.available?(gpu) #=> true Chainer.get_array_module(a) #=> Cumo *OUFHSBUJPOUP3FE$IBJOFS
  17. CUDA APIs of red-chainer 17 *OUFHSBUJPOUP3FE$IBJOFS

  18. Backend APIs of red-chainer 18 Chainer *OUFHSBUJPOUP3FE$IBJOFS

  19. Device APIs of red-chainer 19 "CTUSBDU%FWJDF $QV%FWJDF (QV%FWJDF *OUFHSBUJPOUP3FE$IBJOFS

  20. Device APIs of red-chainer 20 Chainer::Device *OUFHSBUJPOUP3FE$IBJOFS

  21. Function CPU/GPU Branching 21 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs)

    x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end if Cumo supports cuDNN, using it makes fast (not yet though) *OUFHSBUJPOUP3FE$IBJOFS
  22. Function APIs of red-chainer 22 Chainer::Function *OUFHSBUJPOUP3FE$IBJOFS

  23. Outline 23 • Project Introduction • Integration to Red Chainer

    • Future Works
  24. Future Works around Backend/Device APIs 24 Chainer::Device.get_from_array(array) Cumo::NArray has to

    hold a located GPU ID. Chainer::Variable.to_gpu Chainer::Variable.to_cpu Cumo itself needs to support Numo/Cumo conversion. 'VUVSF8PSLT
  25. Future Works Performance Improvement (1) Improve performance of Reduction by

    compacting dimensions of NArray (2) Use user-defined kernel to fuse CUDA ops But, Cumo does not support it yet 25 Chainer/CuPy is still faster than Red Chainer/Cumo kernel = Cumo::ElementwiseKernel.new(
 'float32 x, float32 y, float32 z',
 'float32 w', # output type
 'w = (x * y) + z;', # CUDA code
 'my_kernel')
 w = kernel.call(x, y, z) 'VUVSF8PSLT
  26. 26 >>> a = numpy.arange(6).reshape(3,2)*2 >>> a array([[0, 2 ],

    [4, 6 ], [8, 10]]) >>> a.argmax(axis=1) array([1, 1, 1]) irb> a = Numo::SFloat.new(3,2).seq*2 => Numo::SFloat#shape=[3,2] [[0, 2 ], [4, 6 ], [8, 10]] irb> b = a.max_index(axis: 1) => Numo::Int32#shape=[3] [1, 3, 5] irb> b.to_a.map.with_index {|v, i| v - a.shape[1] * i } => [1, 1, 1] Future Works Performance Improvement Red-chainer handles NumPy/Numo API differences in Ruby (1) numpy.argmax vs Numo::NArray#max_index
  27. 27 >>> a = numpy.arange(9).reshape(3,3) >>> a array([[0, 1, 2],

    [3, 4, 5], [6, 7, 8]]) >>> a[[0,1],[0,1]] array([0, 4]) irb> a = Numo::SFloat.new(3,3).seq => Numo::SFloat#shape=[3,3] [[0, 1, 2], [3, 4, 5], [6, 7, 8]] irb> a[[0,1],[0,1]] => Numo::SFloat(view)#shape=[2,2] [[0, 1], [3, 4]]
 irb> a[[0,1],[0,1]].diagonal
 => Numo::SFloat(view)#shape=[2] [0, 4] Future Works Performance Improvement (2) Difference in Advanced Indexing (3) Fix any other places using to_a, each, or map in red chainer.
  28. More Future Works • Support cuDNN for high performance convolutional

    networks • Support Float16 • Conversion between Numo::NArray and Cumo::NArray 28 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end cuDNN 'VUVSF8PSLT
  29. Supported Functions List 29 - << atan2 eq floor log10

    min_index rms stddev -@ >> atanh erf ge (>=) log1p minimum round store [] | cbrt erfc gemm log2 mulsum seq sum []= ~ ceil exp gt (>) logseq ne sign tan * acos coerce_cast exp10 hypot lt (<) nearly_eq signbit tanh ** acosh conj exp2 im max poly sin trunc / allocate copysign expm1 inspect max_index prod sinc var & asin cos extract ldexp maximum ptp sinh % asinh cosh eye le (<=) mean reciprocal sqrt ^ atan divmod fill log min rint square * 88 methods Int8, Int16, Int32, Int64, Uint8, Uint16, Uint32, Uint64,
 SFloat (float), DFloat (double), SComplex, DComplex mixed 'VUVSF8PSLT
  30. Not Yet 30 abs isnan set_real arg isneginf sort bincount

    isposinf sort_index clip median cumprod minmax cumsum modf frexp rand imag rand_norm isfinite real isinf set_imag [] count_false []= count_true & eq ^ extract | fill ~ mask all? none? any? store coerce_cast where copy where2 * 20 methods (most of all) IntXX, FloatXX, ComplexXX mixed Bit * 23 methods 'VUVSF8PSLT
  31. end • Introduction to Cumo • New Backend/Device APIs on

    red-chainer • Future works 31 Contributions are welcome!