Red Chainer and Cumo: Practical Deep Learning in Ruby at RubyKaigi 2019

Red Chainer and Cumo: Practical Deep Learning in Ruby at RubyKaigi 2019

Aef92c3acc29ad8543e04135687fc4f1?s=128

Naotoshi Seo

April 20, 2019
Tweet

Transcript

  1. Red Chainer and Cumo: Practical Deep Learning in Ruby RubyKaigi

    2019 Naotoshi Seo & Yusaku Hatanaka
  2. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  3. Red Chainer and Cumo: Practical Deep Learning in Ruby RubyKaigi

    2019 Naotoshi Seo & Yusaku Hatanaka
  4. self.introduction • Yusaku Hatanaka (@hatappi) • Red Data Tools Member.

    • I'm creating a DNN Framework for Ruby! • Merpay, Inc from Jan 2019
  5. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  6. Python Chainer DNN TensorFlow MXNet NumPy Tensor CuPy

  7. Ruby Red Chainer DNN TensorFlow.rb MXNet.rb menoh-ruby Numo::NArray Tensor Cumo

  8. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  9. About Red Chainer • Deep Learning (DNN) Framework for Ruby.

    • This is created with Red Data Tools (https://red-data-tools.github.io/) • Red Data Tools is a project that provides data processing tools for Ruby. • Ported Chainer (Python) in Ruby. • I want you to do fun Deep Learning in Ruby
  10. structure of Red Chainer Red Chainer Numo::NArray Cumo CPU GPU

  11. MNIST 28 x 28 ɾ ɾ ɾ 784 ɾ ɾ

    ɾ 1000 unit ɾ ɾ ɾ Fully Connected Relu ɾ ɾ ɾ 1000 unit ɾ ɾ ɾ ɾ ɾ ɾ 10 unit ɾ ɾ ɾ 10 unit softmax cross entropy -8.644561 -10.105622 2.354139 Fully Connected Relu Fully Connected
  12. MNIST

  13. MNIST

  14. Red Chainer’s history 2017/10 First release 2017/08 First Commit 2019/03

    Correspondence to Chainer v3 2018/05 Convolutional Neural Network
  15. Outline • Current status of DNN and Scientific Computing in

    Ruby • Introduction to Red Chainer • Red Chainer's problem, approach to the solution • second part => Cumo
  16. Red Chainer's problem • Speed • Collaboration with other DNN

    frameworks

  17. Red Chainer's problem • Speed <= sonots san • Collaboration

    with other DNN frameworks <= me!!!

  18. Collaboration with other DNN frameworks • For example, there are

    models and learned parameters in Chainer. • But you cannot use them in Red Chainer.
  19. None
  20. What’s ONNX • ONNX is Open Neural Network Exchange Format.

    • community project created by Facebook and Microsoft. • ONNX goal is to make it possible for developers to use the right combinations of tools for their project. • Contents are expressed in Protocol Buffers.
  21. Protocol Buffers • released to the open source community by

    Google in 2008 • language-neutral, platform-neutral extensible mechanism for serializing structured data.
  22. protoc --ruby_out=. profile.proto protoc --go_out=. profile.proto

  23. GET / serialzed data

  24. Chainer output ONNX file • github.com/chainer/onnx-chainer • ONNX support by

    Chainer • output ONNX file from Chainer Model
  25. ONNX Intermediate Representation • ONNX contains a list of parameters

    that make up the graph and a list of each compute node. • Learned parameters are stored in binary • Can be converted to Numo::NArray with Numo::NArray.from_binary • detail: https://github.com/onnx/onnx/blob/ master/docs/IR.md
  26. ONNX visualization • You can also visualize models from ONNX

    files! • Netron is a viewer for neural network, deep learning and machine learning models.
  27. Using ONNX with Ruby • menoh-ruby • Menoh (C++) is

    DNN inference library. • you can inference in Ruby using ONNX!
  28. menoh-ruby

  29. What does Red Chainer use ONNX for?

  30. Automatic generation of Ruby code for Red Chainer

  31. What’s Automatic generation of Ruby code • github.com/hatappi/onnx-red-chainer • Output

    Ruby code of model and learned parameters for Red Chainer from ONNX file • use models and learned parameters with Red Chainer when inferring • you may change the model yourself and learn anew!
  32. onnx-red-chainer

  33. onnx-red-chainer + learned parameters

  34. Why Automatically Generate Ruby Code? • Red Chainer is for

    fun and deep learning in Ruby. • I want you to write a DNN model in Red Chainer using an existing model. • Of course you can do porting manually. • You can easily get started by creating it automatically.
  35. Future of onnx-red-chainer • Increase the corresponding operator. • Providing

    export function of ONNX file.
  36. Summary (Red Chainer) • Red Chainer is a framework for

    having fun and deep learning in Ruby. • Currently supports Chainer v3, but v4 will continue to be developed. • Model and learned parameters in other frameworks can be used with Red Chainer by using onnx-red-chainer
  37. Red Chainer and Cumo: Practical Deep Learning in Ruby RubyKaigi

    2019 Naotoshi Seo & Yusaku Hatanaka
  38. Self Introduction • Naotoshi Seo @sonots • The author of

    Cumo, CUDA aware numerical library for Ruby. • CRuby, and Chainer committer • ZOZO Technologies, Inc from Jan 2019. • Started MLOps team from this April. !38
  39. Outline • Project Introduction of Cumo (Review of Last Year

    Presentation) • What's new to Cumo • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • Introduction of ChainerX !39
  40. Project Introduction (Review of Last Year Presentation) !40

  41. What is Cumo? • (NVIDIA) GPU version of Ruby/Numo •

    Pronounced like koo-mo • Ruby Association Grant 2017 !41 https://www.ruby.or.jp/en/news/20171206 Project Introduction https://github.com/sonots/cumo-logo
  42. Why GPU? • GPU is fast, and recently essential for

    Deep Learning • GPU is good at parallel computation • Order of magnitude is like 24 cores with CPU • 3,000 ~ 4,000 cores with GPU !42 Project Introduction • GPU is bad at branching • GPU simplifies branch prediction and out- of-order mechanism instead. • GPU is suitable for matrix computation
  43. !43 Project Introduction CUDA Memory Pool 1. Round up memory

    size by 512 2. cudaMalloc if no block is available 3. Push to arena intead of cudaFree 4. Pop from arena if a free block is available instead of cudaMalloc Implemented Best-fit with Coalescing (BFC), which is the one used in malloc(3)
  44. Element-wise Operation !44 Review of Last Year 40 times faster

    for size of 10^8 4J[F /VNP NT $VNP NT ?   ?   ?   ?   ?   a = xm::Float32.ones(size) b = xm::Float32.ones(size) a + b UIJT
 SFE (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 faster
  45. Dot product !45 831 times faster than Numo w/ BLAS

    for size of 10^8 UIJT (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT /VNP#-"4 $VNP NT ?    ?    ?    ?    ?    a = xm::Float32.ones(100, size/100) b = xm::Float32.ones(size/100, 100) a.dot(b) UIJT
 ZFMMPX Review of Last Year faster
  46. Red-chainer mnist example !46 380 sec/epoch → 5 sec/epoch 75

    Times Faster !! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Review of Last Year faster
  47. Red Chainer Integration !47

  48. Last Year's Way !48 Red Chainer Integration • Cumo is

    highly compatible with Numo, so sed makes it work a Numo application with Cumo. TFEJFT/VNP$VNPHFTOVNPDVNPH SC require 'numo/narray' a = Numo::SFloat.zeros((2,3)) b = Numo::SFloat.ones((2,3)) a + b require 'cumo/narray' a = Cumo::SFloat.zeros((2,3)) b = Cumo::SFloat.ones((2,3)) a + b • It works, but it is nonsense that let users of red-chainer to convert red-chainer itself.
  49. New Programmable Way !49 Red Chainer Integration require 'chainer' gpu

    = Chainer::CUDA.available? ? 0 : -1 xm = Chainer::Device.create(gpu).xm #=> Cumo a = xm::SFloat.zeros((2,3)) b = xm::SFloat.ones((2,3)) a + b
 
 Chainer.get_array_module(a) #=> Cumo
  50. CUDA APIs of red-chainer !50 Red Chainer Integration

  51. Backend APIs of red-chainer !51 Red Chainer Integration Chainer

  52. Device APIs of red-chainer !52 Red Chainer Integration Chainer::Device

  53. Function CPU/GPU Branching !53 Red Chainer Integration class Convolution2DFunction <

    Chainer::Function def forward_cpu(inputs) x, w, b = inputs kh, kw = w.shape[2], w.shape[3] @col = Chainer::Utils::Conv.im2col(x, ...) y = Chainer::Utils::Math.tensordot(@col, ...) y += b if b [y.transpose(0, 3, 1, 2)] end def forward_gpu(inputs) x, w, b = inputs [x.conv(w, b, ...)] end end
  54. Support Faster Convolutional Neural Networks (CNN) with cuDNN !54

  55. Convolutional Neural Networks (CNN) • In the 2012 Image Recognition

    Competition ImageNet Large Scale Visual Recognition Competition (ILSVRC), the method using a CNN called AlexNet won the first place, and DNN became famous. • It is necessary to support fast Convolution by Red Chainer, otherwise, you can tell Red Chainer is useless. !55 https://qiita.com/yu4u/items/7e93c454c9410c4b5427 Fast CNN with cuDNN
  56. What is cuDNN • The NVIDIA CUDA® Deep Neural Network

    library (cuDNN) is a GPU- accelerated library of primitives for deep neural networks. • Support highly tuned Convolution, Batch Normalization, Pooling, etc. • cuDNN accelerates widely used deep learning frameworks, including Caffe,Caffe2, Chainer, Keras,MATLAB, MxNet, TensorFlow, and PyTorch. • And, Red Chainer now. !56 • https://developer.nvidia.com/cudnn Fast CNN with cuDNN
  57. Cumo supports • conv(x, w, b, stride, pad) • conv_transpose

    • conv_grad_w • batch_norm(x, w, b, stride, pad) • batch_norm_backward • (max|avg)_pool • (max|avg)_pool_backward !57 Fast CNN with cuDNN
  58. Convolution Layer !58 https://speakerdeck.com/nineties/convolutionfalseshu-li-toarugorizumu • In addition, Batch, Input channel,

    Output channel, Bias. Convolution
  59. Direct Algorithm !59 https://speakerdeck.com/nineties/convolutionfalseshu-li-toarugorizumu Convolution

  60. More Algorithms !60 • Direct • Im2col • FFT •

    Winograd • It depends on input tensor size and available memory Which is best? Convolution
  61. Auto tune • cuDNN supports auto-tuning of algorithm • cudnnFindConvolutionForwardAlgorithm

    • cudnnFindConvolutionBackward(Data|Filter)Algorithm • They try all algorithms for the input data, and find the fastest one. • Cumo's convolution calls it on the first-call and caches results. !61 Convolution
  62. Convolution !62 3906 times faster than Numo for size of

    2^10 (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 4J[F /VNP NT $VNP NT ?   ?   ?   ?   ?   x = xm::Float32.ones(32, 3, size, size) w = xm::Float32.ones(2, 3, 3, 3) b = xm::Float32.ones(2) y = F.convolution_2d( x, w, b, stride: 2, pad: 1) Convolution UIJT
 SFE 3BUJP faster faster
  63. Red-chainer cifar example (resnet-18) !63 0.12 iters/sec → 3.8 iters/sec


    23 days → 17 hours to finish 32 Times Faster ! (AWS p3 xlarge) Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz NVIDIA Volta v100 Fast CNN with cuDNN TODO: red-chainer currently has logics computing in Ruby.
 Remove them, then we must be able achieve better performance. faster
  64. ChainerX !64

  65. ChainerX = Numpy-Like ndarray + autograd in C++ • Written

    in C++ w/ a thin python binding • = far less host-side overhead • With pluggable device backends • = open to quickly add new device support • With pure C++ API • = available for Python-free apps !65 Speed Environment Support Quick Deployment ChainerX
  66. !66 ChainerX

  67. !67 ChainerX

  68. ChainerX Python API !68 Import chainerx as chx x =

    chx.ones((2, 3), dtype=chx.float32, device='cuda:0') y = (x + 1).require_grad() z = chx.exp(y).sum() z.backward() • Numpy-like API • Provides NN functions such as • conv, batch_norm • Multiple device supports • Be differentiable by require_grad() ChainerX
  69. ChainerX Ruby API !69 require 'chainerx' x = ChainerX.ones([2, 3],

    dtype=ChainerX::Float32, device='cuda:0') y = (x + 1).require_grad z = ChaienrX.exp(y).sum z.backward • Numpy-like API to Ruby • Reuse core codes ChainerX
  70. GSoC 2019 on Chainer !70 https://github.com/chainer/chainer/wiki/GSoC-2019-Project-Ideas ChainerX

  71. Summary (Cumo) !71 • Project Introduction of Cumo (Review of

    Last Year Presentation) • Red Chainer Integration • Support Fast Convolutional Neural Networks with cuDNN • 32 times faster! • Introduction of ChainerX • Ruby binding implementation is welcome!
  72. Acknowledgements !72 • Ruby Association • 2017 Grant and GPU

    server • My company, ZOZO Technologies, for travel support. • @hatappi and @naitoh for their work of red-chainer, Numo, and Cumo • red-data-tools org and Speee, Inc for hosting meetup. • Preferred Networks, Inc and developers of Chainer/CuPy/ChainerX (including me) as a reference implementation • And, my wife for giving time to develop