Introduction of Cumo, and Integration to Red Chainer

Introduction of Cumo, and integration to Red Chainer Naotoshi Seo
Nov 17, 2018 https://github.com/sonots/cumo RubyData Tokyo Meetup

Self Introduction • Naotoshi Seo @sonots • DeNA Co., Ltd.
• CRuby committer • Recently working on development of DNN framework at Preferred Networks, Inc (出向) 2

Outline 3 • Project Introduction • Integration to Red Chainer
• Future Works

Project Introduction 4

5 1SPKFDU*OUSPEVDUJPO What is Cumo? • (NVIDIA) GPU version of
Ruby/Numo • Pronounced like koo-mo

6 https://ruby-numo.github.io/numo-narray/ 1SPKFDU*OUSPEVDUJPO

Why GPU? • GPU is bad at branching • GPU
simpliﬁes branch prediction and out-of-order mechanism instead. • GPU is suitable for matrix computation 7 • GPU is fast, and recently essential for Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU 1SPKFDU*OUSPEVDUJPO

Element-wise operation 4J[F /VNP $VNP ? ?
? ? ? a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b 40 times faster for size of 10^8 8 Smaller is better UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP

Dot product 9 4J[F /VNP /VNP#-"4 $VNP ?
? ? ? ? a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) 831 times faster than Numo w/ BLAS for size of 10^8 UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) Smaller is better 1FSGPSNBODF$PNQBSJTPOXJUI/VNP

red-chainer mnist example 10 • 380 sec/epoch → 5 sec/epoch
(AWS p3.2xlarge) NVIDIA Volta v100 Intel(R) Xeon(R) CPU E5-2686 [email protected] 75 Times Faster !! 1FSGPSNBODF$PNQBSJTPOXJUI/VNP

11 https://github.com/sonots/cumo-logo Generated by https://hatchful.shopify.com/

• Future Works

Red Chainer? 13 • Ruby port of Chainer • See
@hatappi slides today https://github.com/red-data-tools/red-chainer

Integrate Cumo into Red Chainer 14 https://github.com/red-data-tools/red-chainer/pull/67 *OUFHSBUJPOUP3FE$IBJOFS

Previous Way 15 • Cumo is highly compatible with Numo,
so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPH SC • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself. require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b *OUFHSBUJPOUP3FE$IBJOFS

16 require 'chainer' gpu = 0 device = Chainer::Device.create(gpu) #=>
GpuDevice xm = device.xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b New Programmable Way Chainer::CUDA.available?(gpu) #=> true Chainer.get_array_module(a) #=> Cumo *OUFHSBUJPOUP3FE$IBJOFS

CUDA APIs of red-chainer 17 *OUFHSBUJPOUP3FE$IBJOFS

Backend APIs of red-chainer 18 Chainer *OUFHSBUJPOUP3FE$IBJOFS

Device APIs of red-chainer 19 "CTUSBDU%FWJDF $QV%FWJDF (QV%FWJDF *OUFHSBUJPOUP3FE$IBJOFS

Device APIs of red-chainer 20 Chainer::Device *OUFHSBUJPOUP3FE$IBJOFS

Function CPU/GPU Branching 21 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs)
x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end if Cumo supports cuDNN, using it makes fast (not yet though) *OUFHSBUJPOUP3FE$IBJOFS

Function APIs of red-chainer 22 Chainer::Function *OUFHSBUJPOUP3FE$IBJOFS

• Future Works

Future Works around Backend/Device APIs 24 Chainer::Device.get_from_array(array) Cumo::NArray has to
hold a located GPU ID. Chainer::Variable.to_gpu Chainer::Variable.to_cpu Cumo itself needs to support Numo/Cumo conversion. 'VUVSF8PSLT

Future Works Performance Improvement (1) Improve performance of Reduction by
compacting dimensions of NArray (2) Use user-deﬁned kernel to fuse CUDA ops But, Cumo does not support it yet 25 Chainer/CuPy is still faster than Red Chainer/Cumo kernel = Cumo::ElementwiseKernel.new(  'float32 x, float32 y, float32 z',  'float32 w', # output type  'w = (x * y) + z;', # CUDA code  'my_kernel')  w = kernel.call(x, y, z) 'VUVSF8PSLT

26 >>> a = numpy.arange(6).reshape(3,2)*2 >>> a array([[0, 2 ],
[4, 6 ], [8, 10]]) >>> a.argmax(axis=1) array([1, 1, 1]) irb> a = Numo::SFloat.new(3,2).seq*2 => Numo::SFloat#shape=[3,2] [[0, 2 ], [4, 6 ], [8, 10]] irb> b = a.max_index(axis: 1) => Numo::Int32#shape=[3] [1, 3, 5] irb> b.to_a.map.with_index {|v, i| v - a.shape[1] * i } => [1, 1, 1] Future Works Performance Improvement Red-chainer handles NumPy/Numo API diﬀerences in Ruby (1) numpy.argmax vs Numo::NArray#max_index

27 >>> a = numpy.arange(9).reshape(3,3) >>> a array([[0, 1, 2],
[3, 4, 5], [6, 7, 8]]) >>> a[[0,1],[0,1]] array([0, 4]) irb> a = Numo::SFloat.new(3,3).seq => Numo::SFloat#shape=[3,3] [[0, 1, 2], [3, 4, 5], [6, 7, 8]] irb> a[[0,1],[0,1]] => Numo::SFloat(view)#shape=[2,2] [[0, 1], [3, 4]]  irb> a[[0,1],[0,1]].diagonal  => Numo::SFloat(view)#shape=[2] [0, 4] Future Works Performance Improvement (2) Diﬀerence in Advanced Indexing (3) Fix any other places using to_a, each, or map in red chainer.

More Future Works • Support cuDNN for high performance convolutional
networks • Support Float16 • Conversion between Numo::NArray and Cumo::NArray 28 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end cuDNN 'VUVSF8PSLT

Supported Functions List 29 - << atan2 eq floor log10
min_index rms stddev -@ >> atanh erf ge (>=) log1p minimum round store [] | cbrt erfc gemm log2 mulsum seq sum []= ~ ceil exp gt (>) logseq ne sign tan * acos coerce_cast exp10 hypot lt (<) nearly_eq signbit tanh ** acosh conj exp2 im max poly sin trunc / allocate copysign expm1 inspect max_index prod sinc var & asin cos extract ldexp maximum ptp sinh % asinh cosh eye le (<=) mean reciprocal sqrt ^ atan divmod fill log min rint square * 88 methods Int8, Int16, Int32, Int64, Uint8, Uint16, Uint32, Uint64,  SFloat (ﬂoat), DFloat (double), SComplex, DComplex mixed 'VUVSF8PSLT

Not Yet 30 abs isnan set_real arg isneginf sort bincount
isposinf sort_index clip median cumprod minmax cumsum modf frexp rand imag rand_norm isfinite real isinf set_imag [] count_false []= count_true & eq ^ extract | fill ~ mask all? none? any? store coerce_cast where copy where2 * 20 methods (most of all) IntXX, FloatXX, ComplexXX mixed Bit * 23 methods 'VUVSF8PSLT

end • Introduction to Cumo • New Backend/Device APIs on
red-chainer • Future works 31 Contributions are welcome!

Introduction of Cumo, and Integration to Red Ch...

Introduction of Cumo, and Integration to Red Chainer

Naotoshi Seo

More Decks by Naotoshi Seo

Other Decks in Programming

Featured

Transcript