Introduction of Cumo, and Integration to Red Chainer

Slide 1

Slide 1 text

Introduction of Cumo, and integration to Red Chainer Naotoshi Seo Nov 17, 2018 https://github.com/sonots/cumo RubyData Tokyo Meetup

Slide 2

Slide 2 text

Self Introduction • Naotoshi Seo @sonots • DeNA Co., Ltd. • CRuby committer • Recently working on development of DNN framework at Preferred Networks, Inc (出向) 2

Slide 3

Slide 3 text

Outline 3 • Project Introduction • Integration to Red Chainer • Future Works

Slide 4

Slide 4 text

Project Introduction 4

Slide 5

Slide 5 text

5 1SPKFDU*OUSPEVDUJPO What is Cumo? • (NVIDIA) GPU version of Ruby/Numo • Pronounced like koo-mo

Slide 6

Slide 6 text

6 https://ruby-numo.github.io/numo-narray/ 1SPKFDU*OUSPEVDUJPO

Slide 7

Slide 7 text

Why GPU? • GPU is bad at branching • GPU simpliﬁes branch prediction and out-of-order mechanism instead. • GPU is suitable for matrix computation 7 • GPU is fast, and recently essential for Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU 1SPKFDU*OUSPEVDUJPO

Slide 8

Slide 8 text

Element-wise operation 4J[F /VNP $VNP ? ? ? ? ? a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b 40 times faster for size of 10^8 8 Smaller is better UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) 1FSGPSNBODF$PNQBSJTPOXJUI/VNP

Slide 9

Slide 9 text

Dot product 9 4J[F /VNP /VNP#-"4 $VNP ? ? ? ? ? a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) 831 times faster than Numo w/ BLAS for size of 10^8 UIJT Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 (AWS p3 xlarge) Smaller is better 1FSGPSNBODF$PNQBSJTPOXJUI/VNP

Slide 10

Slide 10 text

red-chainer mnist example 10 • 380 sec/epoch → 5 sec/epoch (AWS p3.2xlarge) NVIDIA Volta v100 Intel(R) Xeon(R) CPU E5-2686 [email protected] 75 Times Faster !! 1FSGPSNBODF$PNQBSJTPOXJUI/VNP

Slide 11

Slide 11 text

11 https://github.com/sonots/cumo-logo Generated by https://hatchful.shopify.com/

Slide 12

Slide 12 text

Outline 12 • Project Introduction • Integration to Red Chainer • Future Works

Slide 13

Slide 13 text

Red Chainer? 13 • Ruby port of Chainer • See @hatappi slides today https://github.com/red-data-tools/red-chainer

Slide 14

Slide 14 text

Integrate Cumo into Red Chainer 14 https://github.com/red-data-tools/red-chainer/pull/67 *OUFHSBUJPOUP3FE$IBJOFS

Slide 15

Slide 15 text

Previous Way 15 • Cumo is highly compatible with Numo, so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPHSC • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself. require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b *OUFHSBUJPOUP3FE$IBJOFS

Slide 16

Slide 16 text

16 require 'chainer' gpu = 0 device = Chainer::Device.create(gpu) #=> GpuDevice xm = device.xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b New Programmable Way Chainer::CUDA.available?(gpu) #=> true Chainer.get_array_module(a) #=> Cumo *OUFHSBUJPOUP3FE$IBJOFS

Slide 17

Slide 17 text

CUDA APIs of red-chainer 17 *OUFHSBUJPOUP3FE$IBJOFS

Slide 18

Slide 18 text

Backend APIs of red-chainer 18 Chainer *OUFHSBUJPOUP3FE$IBJOFS

Slide 19

Slide 19 text

Device APIs of red-chainer 19 "CTUSBDU%FWJDF $QV%FWJDF (QV%FWJDF *OUFHSBUJPOUP3FE$IBJOFS

Slide 20

Slide 20 text

Device APIs of red-chainer 20 Chainer::Device *OUFHSBUJPOUP3FE$IBJOFS

Slide 21

Slide 21 text

Function CPU/GPU Branching 21 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end if Cumo supports cuDNN, using it makes fast (not yet though) *OUFHSBUJPOUP3FE$IBJOFS

Slide 22

Slide 22 text

Function APIs of red-chainer 22 Chainer::Function *OUFHSBUJPOUP3FE$IBJOFS

Slide 23

Slide 23 text

Outline 23 • Project Introduction • Integration to Red Chainer • Future Works

Slide 24

Slide 24 text

Future Works around Backend/Device APIs 24 Chainer::Device.get_from_array(array) Cumo::NArray has to hold a located GPU ID. Chainer::Variable.to_gpu Chainer::Variable.to_cpu Cumo itself needs to support Numo/Cumo conversion. 'VUVSF8PSLT

Slide 25

Slide 25 text

Future Works Performance Improvement (1) Improve performance of Reduction by compacting dimensions of NArray (2) Use user-deﬁned kernel to fuse CUDA ops But, Cumo does not support it yet 25 Chainer/CuPy is still faster than Red Chainer/Cumo kernel = Cumo::ElementwiseKernel.new(  'float32 x, float32 y, float32 z',  'float32 w', # output type  'w = (x * y) + z;', # CUDA code  'my_kernel')  w = kernel.call(x, y, z) 'VUVSF8PSLT

Slide 26

Slide 26 text

26 >>> a = numpy.arange(6).reshape(3,2)*2 >>> a array([[0, 2 ], [4, 6 ], [8, 10]]) >>> a.argmax(axis=1) array([1, 1, 1]) irb> a = Numo::SFloat.new(3,2).seq*2 => Numo::SFloat#shape=[3,2] [[0, 2 ], [4, 6 ], [8, 10]] irb> b = a.max_index(axis: 1) => Numo::Int32#shape=[3] [1, 3, 5] irb> b.to_a.map.with_index {|v, i| v - a.shape[1] * i } => [1, 1, 1] Future Works Performance Improvement Red-chainer handles NumPy/Numo API diﬀerences in Ruby (1) numpy.argmax vs Numo::NArray#max_index

Slide 27

Slide 27 text

27 >>> a = numpy.arange(9).reshape(3,3) >>> a array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) >>> a[[0,1],[0,1]] array([0, 4]) irb> a = Numo::SFloat.new(3,3).seq => Numo::SFloat#shape=[3,3] [[0, 1, 2], [3, 4, 5], [6, 7, 8]] irb> a[[0,1],[0,1]] => Numo::SFloat(view)#shape=[2,2] [[0, 1], [3, 4]]  irb> a[[0,1],[0,1]].diagonal  => Numo::SFloat(view)#shape=[2] [0, 4] Future Works Performance Improvement (2) Diﬀerence in Advanced Indexing (3) Fix any other places using to_a, each, or map in red chainer.

Slide 28

Slide 28 text

More Future Works • Support cuDNN for high performance convolutional networks • Support Float16 • Conversion between Numo::NArray and Cumo::NArray 28 class Convolution2DFunction < Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [Cumo::NArray.conv(x, w, b, ...)] end end cuDNN 'VUVSF8PSLT

Slide 29

Slide 29 text

Supported Functions List 29 - << atan2 eq floor log10 min_index rms stddev -@ >> atanh erf ge (>=) log1p minimum round store [] | cbrt erfc gemm log2 mulsum seq sum []= ~ ceil exp gt (>) logseq ne sign tan * acos coerce_cast exp10 hypot lt (<) nearly_eq signbit tanh ** acosh conj exp2 im max poly sin trunc / allocate copysign expm1 inspect max_index prod sinc var & asin cos extract ldexp maximum ptp sinh % asinh cosh eye le (<=) mean reciprocal sqrt ^ atan divmod fill log min rint square * 88 methods Int8, Int16, Int32, Int64, Uint8, Uint16, Uint32, Uint64,  SFloat (ﬂoat), DFloat (double), SComplex, DComplex mixed 'VUVSF8PSLT

Slide 30

Slide 30 text

Not Yet 30 abs isnan set_real arg isneginf sort bincount isposinf sort_index clip median cumprod minmax cumsum modf frexp rand imag rand_norm isfinite real isinf set_imag [] count_false []= count_true & eq ^ extract | fill ~ mask all? none? any? store coerce_cast where copy where2 * 20 methods (most of all) IntXX, FloatXX, ComplexXX mixed Bit * 23 methods 'VUVSF8PSLT

Slide 31

Slide 31 text

end • Introduction to Cumo • New Backend/Device APIs on red-chainer • Future works 31 Contributions are welcome!