Red Chainer and Cumo: Practical Deep Learning in Ruby at RubyKaigi 2019

by Naotoshi Seo

Slide 1

Slide 1 text

Red Chainer and Cumo: Practical Deep Learning in Ruby RubyKaigi 2019 Naotoshi Seo & Yusaku Hatanaka

Slide 2

Slide 2 text

Outline • Current status of DNN and Scientiﬁc Computing in Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo

Slide 3

Slide 3 text

Red Chainer and Cumo: Practical Deep Learning in Ruby RubyKaigi 2019 Naotoshi Seo & Yusaku Hatanaka

Slide 4

Slide 4 text

self.introduction • Yusaku Hatanaka (@hatappi) • Red Data Tools Member. • I'm creating a DNN Framework for Ruby! • Merpay, Inc from Jan 2019

Slide 5

Slide 5 text

Outline • Current status of DNN and Scientiﬁc Computing in Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo

Slide 6

Slide 6 text

Python Chainer DNN TensorFlow MXNet NumPy Tensor CuPy

Slide 7

Slide 7 text

Ruby Red Chainer DNN TensorFlow.rb MXNet.rb menoh-ruby Numo::NArray Tensor Cumo

Slide 8

Slide 8 text

Outline • Current status of DNN and Scientiﬁc Computing in Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo

Slide 9

Slide 9 text

About Red Chainer • Deep Learning (DNN) Framework for Ruby. • This is created with Red Data Tools (https://red-data-tools.github.io/) • Red Data Tools is a project that provides data processing tools for Ruby. • Ported Chainer (Python) in Ruby. • I want you to do fun Deep Learning in Ruby

Slide 10

Slide 10 text

structure of Red Chainer Red Chainer Numo::NArray Cumo CPU GPU

Slide 11

Slide 11 text

MNIST 28 x 28 ɾ ɾ ɾ 784 ɾ ɾ ɾ 1000 unit ɾ ɾ ɾ Fully Connected Relu ɾ ɾ ɾ 1000 unit ɾ ɾ ɾ ɾ ɾ ɾ 10 unit ɾ ɾ ɾ 10 unit softmax cross entropy -8.644561 -10.105622 2.354139 Fully Connected Relu Fully Connected

Slide 12

Slide 12 text

MNIST

Slide 13

Slide 13 text

MNIST

Slide 14

Slide 14 text

Red Chainer’s history 2017/10 First release 2017/08 First Commit 2019/03 Correspondence to Chainer v3 2018/05 Convolutional Neural Network

Slide 15

Slide 15 text

Outline • Current status of DNN and Scientiﬁc Computing in Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo

Slide 16

Slide 16 text

Red Chainer's problem • Speed • Collaboration with other DNN frameworks 

Slide 17

Slide 17 text

Red Chainer's problem • Speed <= sonots san • Collaboration with other DNN frameworks <= me!!! 

Slide 18

Slide 18 text

Collaboration with other DNN frameworks • For example, there are models and learned parameters in Chainer. • But you cannot use them in Red Chainer.

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

What’s ONNX • ONNX is Open Neural Network Exchange Format. • community project created by Facebook and Microsoft. • ONNX goal is to make it possible for developers to use the right combinations of tools for their project. • Contents are expressed in Protocol Buﬀers.

Slide 21

Slide 21 text

Protocol Buffers • released to the open source community by Google in 2008 • language-neutral, platform-neutral extensible mechanism for serializing structured data.

Slide 22

Slide 22 text

protoc --ruby_out=. proﬁle.proto protoc --go_out=. proﬁle.proto

Slide 23

Slide 23 text

GET / serialzed data

Slide 24

Slide 24 text

Chainer output ONNX ﬁle • github.com/chainer/onnx-chainer • ONNX support by Chainer • output ONNX ﬁle from Chainer Model

Slide 25

Slide 25 text

ONNX Intermediate Representation • ONNX contains a list of parameters that make up the graph and a list of each compute node. • Learned parameters are stored in binary • Can be converted to Numo::NArray with Numo::NArray.from_binary • detail: https://github.com/onnx/onnx/blob/ master/docs/IR.md

Slide 26

Slide 26 text

ONNX visualization • You can also visualize models from ONNX ﬁles! • Netron is a viewer for neural network, deep learning and machine learning models.

Slide 27

Slide 27 text

Using ONNX with Ruby • menoh-ruby • Menoh (C++) is DNN inference library. • you can inference in Ruby using ONNX!

Slide 28

Slide 28 text

menoh-ruby

Slide 29

Slide 29 text

What does Red Chainer use ONNX for?

Slide 30

Slide 30 text

Automatic generation of Ruby code for Red Chainer

Slide 31

Slide 31 text

What’s Automatic generation of Ruby code • github.com/hatappi/onnx-red-chainer • Output Ruby code of model and learned parameters for Red Chainer from ONNX ﬁle • use models and learned parameters with Red Chainer when inferring • you may change the model yourself and learn anew!

Slide 32

Slide 32 text

onnx-red-chainer

Slide 33

Slide 33 text

onnx-red-chainer + learned parameters

Slide 34

Slide 34 text

Why Automatically Generate Ruby Code? • Red Chainer is for fun and deep learning in Ruby. • I want you to write a DNN model in Red Chainer using an existing model. • Of course you can do porting manually. • You can easily get started by creating it automatically.

Slide 35

Slide 35 text

Future of onnx-red-chainer • Increase the corresponding operator. • Providing export function of ONNX ﬁle.

Slide 36

Slide 36 text

Summary (Red Chainer) • Red Chainer is a framework for having fun and deep learning in Ruby. • Currently supports Chainer v3, but v4 will continue to be developed. • Model and learned parameters in other frameworks can be used with Red Chainer by using onnx-red-chainer

Slide 37

Slide 37 text

Red Chainer and Cumo: Practical Deep Learning in Ruby RubyKaigi 2019 Naotoshi Seo & Yusaku Hatanaka

Slide 38

Slide 38 text

Self Introduction • Naotoshi Seo @sonots • The author of Cumo, CUDA aware numerical library for Ruby. • CRuby, and Chainer committer • ZOZO Technologies, Inc from Jan 2019. • Started MLOps team from this April. !38

Slide 39

Slide 39 text

Outline • Project Introduction of Cumo (Review of Last Year Presentation) • What's new to Cumo • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • Introduction of ChainerX !39

Slide 40

Slide 40 text

Project Introduction (Review of Last Year Presentation) !40

Slide 41

Slide 41 text

What is Cumo? • (NVIDIA) GPU version of Ruby/Numo • Pronounced like koo-mo • Ruby Association Grant 2017 !41 https://www.ruby.or.jp/en/news/20171206 Project Introduction https://github.com/sonots/cumo-logo

Slide 42

Slide 42 text

Why GPU? • GPU is fast, and recently essential for Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU !42 Project Introduction • GPU is bad at branching • GPU simpliﬁes branch prediction and out- of-order mechanism instead. • GPU is suitable for matrix computation

Slide 43

Slide 43 text

!43 Project Introduction CUDA Memory Pool 1. Round up memory size by 512 2. cudaMalloc if no block is available 3. Push to arena intead of cudaFree 4. Pop from arena if a free block is available instead of cudaMalloc Implemented Best-ﬁt with Coalescing (BFC), which is the one used in malloc(3)

Slide 44

Slide 44 text

Element-wise Operation !44 Review of Last Year 40 times faster for size of 10^8 4J[F /VNP NT $VNP NT ? ? ? ? ? a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b UIJT  SFE (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 faster

Slide 45

Slide 45 text

Dot product !45 831 times faster than Numo w/ BLAS for size of 10^8 UIJT (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT /VNP#-"4 $VNP NT ? ? ? ? ? a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) UIJT  ZFMMPX Review of Last Year faster

Slide 46

Slide 46 text

Red-chainer mnist example !46 380 sec/epoch → 5 sec/epoch 75 Times Faster !! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Review of Last Year faster

Slide 47

Slide 47 text

Red Chainer Integration !47

Slide 48

Slide 48 text

Last Year's Way !48 Red Chainer Integration • Cumo is highly compatible with Numo, so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPHSC require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself.

Slide 49

Slide 49 text

New Programmable Way !49 Red Chainer Integration require 'chainer' gpu = Chainer::CUDA.available? ? 0 : -1 xm = Chainer::Device.create(gpu).xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b    Chainer.get_array_module(a) #=> Cumo

Slide 50

Slide 50 text

CUDA APIs of red-chainer !50 Red Chainer Integration

Slide 51

Slide 51 text

Backend APIs of red-chainer !51 Red Chainer Integration Chainer

Slide 52

Slide 52 text

Device APIs of red-chainer !52 Red Chainer Integration Chainer::Device

Slide 53

Slide 53 text

Function CPU/GPU Branching !53 Red Chainer Integration class Convolution2DFunction < Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [x.conv(w, b, ...)] end end

Slide 54

Slide 54 text

Support Faster Convolutional Neural Networks (CNN) with cuDNN !54

Slide 55

Slide 55 text

Convolutional Neural Networks (CNN) • In the 2012 Image Recognition Competition ImageNet Large Scale Visual Recognition Competition (ILSVRC), the method using a CNN called AlexNet won the ﬁrst place, and DNN became famous. • It is necessary to support fast Convolution by Red Chainer, otherwise, you can tell Red Chainer is useless. !55 https://qiita.com/yu4u/items/7e93c454c9410c4b5427 Fast CNN with cuDNN

Slide 56

Slide 56 text

What is cuDNN • The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU- accelerated library of primitives for deep neural networks. • Support highly tuned Convolution, Batch Normalization, Pooling, etc. • cuDNN accelerates widely used deep learning frameworks, including Caﬀe,Caﬀe2, Chainer, Keras,MATLAB, MxNet, TensorFlow, and PyTorch. • And, Red Chainer now. !56 • https://developer.nvidia.com/cudnn Fast CNN with cuDNN

Slide 57

Slide 57 text

Cumo supports • conv(x, w, b, stride, pad) • conv_transpose • conv_grad_w • batch_norm(x, w, b, stride, pad) • batch_norm_backward • (max|avg)_pool • (max|avg)_pool_backward !57 Fast CNN with cuDNN

Slide 58

Slide 58 text

Convolution Layer !58 https://speakerdeck.com/nineties/convolutionfalseshu-li-toarugorizumu • In addition, Batch, Input channel, Output channel, Bias. Convolution

Slide 59

Slide 59 text

Direct Algorithm !59 https://speakerdeck.com/nineties/convolutionfalseshu-li-toarugorizumu Convolution

Slide 60

Slide 60 text

More Algorithms !60 • Direct • Im2col • FFT • Winograd • It depends on input tensor size and available memory Which is best? Convolution

Slide 61

Slide 61 text

Auto tune • cuDNN supports auto-tuning of algorithm • cudnnFindConvolutionForwardAlgorithm • cudnnFindConvolutionBackward(Data|Filter)Algorithm • They try all algorithms for the input data, and ﬁnd the fastest one. • Cumo's convolution calls it on the ﬁrst-call and caches results. !61 Convolution

Slide 62

Slide 62 text

Convolution !62 3906 times faster than Numo for size of 2^10 (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT $VNP NT ? ? ? ? ? x = xm::Float32.ones(32, 3, size, size) w = xm::Float32.ones(2, 3, 3, 3) b = xm::Float32.ones(2) y = F.convolution_2d( x, w, b, stride: 2, pad: 1) Convolution UIJT  SFE 3BUJP faster faster

Slide 63

Slide 63 text

Red-chainer cifar example (resnet-18) !63 0.12 iters/sec → 3.8 iters/sec  23 days → 17 hours to ﬁnish 32 Times Faster ! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Fast CNN with cuDNN TODO: red-chainer currently has logics computing in Ruby.  Remove them, then we must be able achieve better performance. faster

Slide 64

Slide 64 text

ChainerX !64

Slide 65

Slide 65 text

ChainerX = Numpy-Like ndarray + autograd in C++ • Written in C++ w/ a thin python binding • = far less host-side overhead • With pluggable device backends • = open to quickly add new device support • With pure C++ API • = available for Python-free apps !65 Speed Environment Support Quick Deployment ChainerX

Slide 66

Slide 66 text

!66 ChainerX

Slide 67

Slide 67 text

!67 ChainerX

Slide 68

Slide 68 text

ChainerX Python API !68 Import chainerx as chx x = chx.ones((2, 3), dtype=chx.ﬂoat32, device='cuda:0') y = (x + 1).require_grad() z = chx.exp(y).sum() z.backward() • Numpy-like API • Provides NN functions such as • conv, batch_norm • Multiple device supports • Be diﬀerentiable by require_grad() ChainerX

Slide 69

Slide 69 text

ChainerX Ruby API !69 require 'chainerx' x = ChainerX.ones([2, 3], dtype=ChainerX::Float32, device='cuda:0') y = (x + 1).require_grad z = ChaienrX.exp(y).sum z.backward • Numpy-like API to Ruby • Reuse core codes ChainerX

Slide 70

Slide 70 text

GSoC 2019 on Chainer !70 https://github.com/chainer/chainer/wiki/GSoC-2019-Project-Ideas ChainerX

Slide 71

Slide 71 text

Summary (Cumo) !71 • Project Introduction of Cumo (Review of Last Year Presentation) • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • 32 times faster! • Introduction of ChainerX • Ruby binding implementation is welcome!

Slide 72

Slide 72 text

Acknowledgements !72 • Ruby Association • 2017 Grant and GPU server • My company, ZOZO Technologies, for travel support. • @hatappi and @naitoh for their work of red-chainer, Numo, and Cumo • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers of Chainer/CuPy/ChainerX (including me) as a reference implementation • And, my wife for giving time to develop