Red Chainer and Cumo: Practical Deep Learning in Ruby at RubyKaigi 2019

Red Chainer and Cumo: Practical Deep Learning in Ruby RubyKaigi
2019 Naotoshi Seo & Yusaku Hatanaka

Outline • Current status of DNN and Scientiﬁc Computing in
Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo

self.introduction • Yusaku Hatanaka (@hatappi) • Red Data Tools Member.
• I'm creating a DNN Framework for Ruby! • Merpay, Inc from Jan 2019

Python Chainer DNN TensorFlow MXNet NumPy Tensor CuPy

Ruby Red Chainer DNN TensorFlow.rb MXNet.rb menoh-ruby Numo::NArray Tensor Cumo

About Red Chainer • Deep Learning (DNN) Framework for Ruby.
• This is created with Red Data Tools (https://red-data-tools.github.io/) • Red Data Tools is a project that provides data processing tools for Ruby. • Ported Chainer (Python) in Ruby. • I want you to do fun Deep Learning in Ruby

structure of Red Chainer Red Chainer Numo::NArray Cumo CPU GPU

MNIST 28 x 28 ɾ ɾ ɾ 784 ɾ ɾ
ɾ 1000 unit ɾ ɾ ɾ Fully Connected Relu ɾ ɾ ɾ 1000 unit ɾ ɾ ɾ ɾ ɾ ɾ 10 unit ɾ ɾ ɾ 10 unit softmax cross entropy -8.644561 -10.105622 2.354139 Fully Connected Relu Fully Connected

Red Chainer’s history 2017/10 First release 2017/08 First Commit 2019/03
Correspondence to Chainer v3 2018/05 Convolutional Neural Network

Red Chainer's problem • Speed • Collaboration with other DNN
frameworks 

Red Chainer's problem • Speed <= sonots san • Collaboration
with other DNN frameworks <= me!!! 

Collaboration with other DNN frameworks • For example, there are
models and learned parameters in Chainer. • But you cannot use them in Red Chainer.

What’s ONNX • ONNX is Open Neural Network Exchange Format.
• community project created by Facebook and Microsoft. • ONNX goal is to make it possible for developers to use the right combinations of tools for their project. • Contents are expressed in Protocol Buﬀers.

Protocol Buffers • released to the open source community by
Google in 2008 • language-neutral, platform-neutral extensible mechanism for serializing structured data.

protoc --ruby_out=. proﬁle.proto protoc --go_out=. proﬁle.proto

GET / serialzed data

Chainer output ONNX ﬁle • github.com/chainer/onnx-chainer • ONNX support by
Chainer • output ONNX ﬁle from Chainer Model

ONNX Intermediate Representation • ONNX contains a list of parameters
that make up the graph and a list of each compute node. • Learned parameters are stored in binary • Can be converted to Numo::NArray with Numo::NArray.from_binary • detail: https://github.com/onnx/onnx/blob/ master/docs/IR.md

ONNX visualization • You can also visualize models from ONNX
ﬁles! • Netron is a viewer for neural network, deep learning and machine learning models.

Using ONNX with Ruby • menoh-ruby • Menoh (C++) is
DNN inference library. • you can inference in Ruby using ONNX!

menoh-ruby

What does Red Chainer use ONNX for?

Automatic generation of Ruby code for Red Chainer

What’s Automatic generation of Ruby code • github.com/hatappi/onnx-red-chainer • Output
Ruby code of model and learned parameters for Red Chainer from ONNX ﬁle • use models and learned parameters with Red Chainer when inferring • you may change the model yourself and learn anew!

onnx-red-chainer

onnx-red-chainer + learned parameters

Why Automatically Generate Ruby Code? • Red Chainer is for
fun and deep learning in Ruby. • I want you to write a DNN model in Red Chainer using an existing model. • Of course you can do porting manually. • You can easily get started by creating it automatically.

Future of onnx-red-chainer • Increase the corresponding operator. • Providing
export function of ONNX ﬁle.

Summary (Red Chainer) • Red Chainer is a framework for
having fun and deep learning in Ruby. • Currently supports Chainer v3, but v4 will continue to be developed. • Model and learned parameters in other frameworks can be used with Red Chainer by using onnx-red-chainer

Self Introduction • Naotoshi Seo @sonots • The author of
Cumo, CUDA aware numerical library for Ruby. • CRuby, and Chainer committer • ZOZO Technologies, Inc from Jan 2019. • Started MLOps team from this April. !38

Outline • Project Introduction of Cumo (Review of Last Year
Presentation) • What's new to Cumo • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • Introduction of ChainerX !39

Project Introduction (Review of Last Year Presentation) !40

What is Cumo? • (NVIDIA) GPU version of Ruby/Numo •
Pronounced like koo-mo • Ruby Association Grant 2017 !41 https://www.ruby.or.jp/en/news/20171206 Project Introduction https://github.com/sonots/cumo-logo

Why GPU? • GPU is fast, and recently essential for
Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU !42 Project Introduction • GPU is bad at branching • GPU simpliﬁes branch prediction and out- of-order mechanism instead. • GPU is suitable for matrix computation

!43 Project Introduction CUDA Memory Pool 1. Round up memory
size by 512 2. cudaMalloc if no block is available 3. Push to arena intead of cudaFree 4. Pop from arena if a free block is available instead of cudaMalloc Implemented Best-ﬁt with Coalescing (BFC), which is the one used in malloc(3)

Element-wise Operation !44 Review of Last Year 40 times faster
for size of 10^8 4J[F /VNP NT $VNP NT ? ? ? ? ? a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b UIJT  SFE (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 faster

Dot product !45 831 times faster than Numo w/ BLAS
for size of 10^8 UIJT (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT /VNP#-"4 $VNP NT ? ? ? ? ? a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) UIJT  ZFMMPX Review of Last Year faster

Red-chainer mnist example !46 380 sec/epoch → 5 sec/epoch 75
Times Faster !! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Review of Last Year faster

Red Chainer Integration !47

Last Year's Way !48 Red Chainer Integration • Cumo is
highly compatible with Numo, so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPH SC require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself.

New Programmable Way !49 Red Chainer Integration require 'chainer' gpu
= Chainer::CUDA.available? ? 0 : -1 xm = Chainer::Device.create(gpu).xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b    Chainer.get_array_module(a) #=> Cumo

CUDA APIs of red-chainer !50 Red Chainer Integration

Backend APIs of red-chainer !51 Red Chainer Integration Chainer

Device APIs of red-chainer !52 Red Chainer Integration Chainer::Device

Function CPU/GPU Branching !53 Red Chainer Integration class Convolution2DFunction <
Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [x.conv(w, b, ...)] end end

Support Faster Convolutional Neural Networks (CNN) with cuDNN !54

Convolutional Neural Networks (CNN) • In the 2012 Image Recognition
Competition ImageNet Large Scale Visual Recognition Competition (ILSVRC), the method using a CNN called AlexNet won the ﬁrst place, and DNN became famous. • It is necessary to support fast Convolution by Red Chainer, otherwise, you can tell Red Chainer is useless. !55 https://qiita.com/yu4u/items/7e93c454c9410c4b5427 Fast CNN with cuDNN

What is cuDNN • The NVIDIA CUDA® Deep Neural Network
library (cuDNN) is a GPU- accelerated library of primitives for deep neural networks. • Support highly tuned Convolution, Batch Normalization, Pooling, etc. • cuDNN accelerates widely used deep learning frameworks, including Caﬀe,Caﬀe2, Chainer, Keras,MATLAB, MxNet, TensorFlow, and PyTorch. • And, Red Chainer now. !56 • https://developer.nvidia.com/cudnn Fast CNN with cuDNN

Cumo supports • conv(x, w, b, stride, pad) • conv_transpose
• conv_grad_w • batch_norm(x, w, b, stride, pad) • batch_norm_backward • (max|avg)_pool • (max|avg)_pool_backward !57 Fast CNN with cuDNN

Convolution Layer !58 https://speakerdeck.com/nineties/convolutionfalseshu-li-toarugorizumu • In addition, Batch, Input channel,
Output channel, Bias. Convolution

Direct Algorithm !59 https://speakerdeck.com/nineties/convolutionfalseshu-li-toarugorizumu Convolution

More Algorithms !60 • Direct • Im2col • FFT •
Winograd • It depends on input tensor size and available memory Which is best? Convolution

Auto tune • cuDNN supports auto-tuning of algorithm • cudnnFindConvolutionForwardAlgorithm
• cudnnFindConvolutionBackward(Data|Filter)Algorithm • They try all algorithms for the input data, and ﬁnd the fastest one. • Cumo's convolution calls it on the ﬁrst-call and caches results. !61 Convolution

Convolution !62 3906 times faster than Numo for size of
2^10 (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT $VNP NT ? ? ? ? ? x = xm::Float32.ones(32, 3, size, size) w = xm::Float32.ones(2, 3, 3, 3) b = xm::Float32.ones(2) y = F.convolution_2d( x, w, b, stride: 2, pad: 1) Convolution UIJT  SFE 3BUJP faster faster

Red-chainer cifar example (resnet-18) !63 0.12 iters/sec → 3.8 iters/sec 
23 days → 17 hours to ﬁnish 32 Times Faster ! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Fast CNN with cuDNN TODO: red-chainer currently has logics computing in Ruby.  Remove them, then we must be able achieve better performance. faster

ChainerX !64

ChainerX = Numpy-Like ndarray + autograd in C++ • Written
in C++ w/ a thin python binding • = far less host-side overhead • With pluggable device backends • = open to quickly add new device support • With pure C++ API • = available for Python-free apps !65 Speed Environment Support Quick Deployment ChainerX

!66 ChainerX

!67 ChainerX

ChainerX Python API !68 Import chainerx as chx x =
chx.ones((2, 3), dtype=chx.ﬂoat32, device='cuda:0') y = (x + 1).require_grad() z = chx.exp(y).sum() z.backward() • Numpy-like API • Provides NN functions such as • conv, batch_norm • Multiple device supports • Be diﬀerentiable by require_grad() ChainerX

ChainerX Ruby API !69 require 'chainerx' x = ChainerX.ones([2, 3],
dtype=ChainerX::Float32, device='cuda:0') y = (x + 1).require_grad z = ChaienrX.exp(y).sum z.backward • Numpy-like API to Ruby • Reuse core codes ChainerX

GSoC 2019 on Chainer !70 https://github.com/chainer/chainer/wiki/GSoC-2019-Project-Ideas ChainerX

Summary (Cumo) !71 • Project Introduction of Cumo (Review of
Last Year Presentation) • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • 32 times faster! • Introduction of ChainerX • Ruby binding implementation is welcome!

Acknowledgements !72 • Ruby Association • 2017 Grant and GPU
server • My company, ZOZO Technologies, for travel support. • @hatappi and @naitoh for their work of red-chainer, Numo, and Cumo • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers of Chainer/CuPy/ChainerX (including me) as a reference implementation • And, my wife for giving time to develop

Red Chainer and Cumo: Practical Deep Learning i...

Red Chainer and Cumo: Practical Deep Learning in Ruby at RubyKaigi 2019

More Decks by Naotoshi Seo

Other Decks in Programming

Featured

Transcript