• This is created with Red Data Tools (https://red-data-tools.github.io/) • Red Data Tools is a project that provides data processing tools for Ruby. • Ported Chainer (Python) in Ruby. • I want you to do fun Deep Learning in Ruby
• community project created by Facebook and Microsoft. • ONNX goal is to make it possible for developers to use the right combinations of tools for their project. • Contents are expressed in Protocol Buffers.
that make up the graph and a list of each compute node. • Learned parameters are stored in binary • Can be converted to Numo::NArray with Numo::NArray.from_binary • detail: https://github.com/onnx/onnx/blob/ master/docs/IR.md
Ruby code of model and learned parameters for Red Chainer from ONNX file • use models and learned parameters with Red Chainer when inferring • you may change the model yourself and learn anew!
fun and deep learning in Ruby. • I want you to write a DNN model in Red Chainer using an existing model. • Of course you can do porting manually. • You can easily get started by creating it automatically.
having fun and deep learning in Ruby. • Currently supports Chainer v3, but v4 will continue to be developed. • Model and learned parameters in other frameworks can be used with Red Chainer by using onnx-red-chainer
Cumo, CUDA aware numerical library for Ruby. • CRuby, and Chainer committer • ZOZO Technologies, Inc from Jan 2019. • Started MLOps team from this April. !38
Pronounced like koo-mo • Ruby Association Grant 2017 !41 https://www.ruby.or.jp/en/news/20171206 Project Introduction https://github.com/sonots/cumo-logo
Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU !42 Project Introduction • GPU is bad at branching • GPU simplifies branch prediction and out- of-order mechanism instead. • GPU is suitable for matrix computation
size by 512 2. cudaMalloc if no block is available 3. Push to arena intead of cudaFree 4. Pop from arena if a free block is available instead of cudaMalloc Implemented Best-fit with Coalescing (BFC), which is the one used in malloc(3)
for size of 10^8 4J[F /VNP NT $VNP NT ? ? ? ? ? a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b UIJT SFE (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 faster
for size of 10^8 UIJT (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT /VNP#-"4 $VNP NT ? ? ? ? ? a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) UIJT ZFMMPX Review of Last Year faster
highly compatible with Numo, so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPH SC require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself.
Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [x.conv(w, b, ...)] end end
Competition ImageNet Large Scale Visual Recognition Competition (ILSVRC), the method using a CNN called AlexNet won the first place, and DNN became famous. • It is necessary to support fast Convolution by Red Chainer, otherwise, you can tell Red Chainer is useless. !55 https://qiita.com/yu4u/items/7e93c454c9410c4b5427 Fast CNN with cuDNN
library (cuDNN) is a GPU- accelerated library of primitives for deep neural networks. • Support highly tuned Convolution, Batch Normalization, Pooling, etc. • cuDNN accelerates widely used deep learning frameworks, including Caffe,Caffe2, Chainer, Keras,MATLAB, MxNet, TensorFlow, and PyTorch. • And, Red Chainer now. !56 • https://developer.nvidia.com/cudnn Fast CNN with cuDNN
• cudnnFindConvolutionBackward(Data|Filter)Algorithm • They try all algorithms for the input data, and find the fastest one. • Cumo's convolution calls it on the first-call and caches results. !61 Convolution
23 days → 17 hours to finish 32 Times Faster ! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Fast CNN with cuDNN TODO: red-chainer currently has logics computing in Ruby. Remove them, then we must be able achieve better performance. faster
in C++ w/ a thin python binding • = far less host-side overhead • With pluggable device backends • = open to quickly add new device support • With pure C++ API • = available for Python-free apps !65 Speed Environment Support Quick Deployment ChainerX
chx.ones((2, 3), dtype=chx.float32, device='cuda:0') y = (x + 1).require_grad() z = chx.exp(y).sum() z.backward() • Numpy-like API • Provides NN functions such as • conv, batch_norm • Multiple device supports • Be differentiable by require_grad() ChainerX
Last Year Presentation) • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • 32 times faster! • Introduction of ChainerX • Ruby binding implementation is welcome!
server • My company, ZOZO Technologies, for travel support. • @hatappi and @naitoh for their work of red-chainer, Numo, and Cumo • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers of Chainer/CuPy/ChainerX (including me) as a reference implementation • And, my wife for giving time to develop