Slide 1

Slide 1 text

Using Ruby in data science Kenta Murata, Speee Inc. Fri, Nov 17, 2017 RubyConf 2017 in New Orleans, LA, US This Slide is available at https://speakerdeck.com/mrkn/using-ruby-in- data-science 1 / 56

Slide 2

Slide 2 text

How many people related to data science? Data scientists Data engineers Developers of applications that utilize data 2 / 56

Slide 3

Slide 3 text

We need to be familiar with the way of data utilization 3 / 56

Slide 4

Slide 4 text

But Ruby hinders us from becoming famil- iar with data science because Ruby is di - cult to use in data science 4 / 56

Slide 5

Slide 5 text

The situation is changing recently 5 / 56

Slide 6

Slide 6 text

Now Ruby is getting easier to use with data science little by little 6 / 56

Slide 7

Slide 7 text

Today I'll talk about how we can use Ruby in data science 7 / 56

Slide 8

Slide 8 text

self . introduction Kenta Murata @mrkn (github, twitter, etc.) moo-ra-ken Full-time CRuby committer and Researcher at Speee Inc. bigdecimal, enumerable-statistics, pycall, mxnet.rb, etc. 8 / 56

Slide 9

Slide 9 text

9 / 56

Slide 10

Slide 10 text

"Speee" == "Speed".succ 10 / 56

Slide 11

Slide 11 text

"Speee" == "Speed".succ faster than fast what “speed” means 11 / 56

Slide 12

Slide 12 text

"Speee" == "Speed".succ faster than fast what “speed” means iterate business trial cycles in overwhelming speed 12 / 56

Slide 13

Slide 13 text

Full-time CRuby committer My company employs me as a full-time CRuby committer. I'm permitted by my company to do any great things for Ruby ecosystem. In this year, I'm totally working for making tools for data science that are used with ap- plications written in Ruby 13 / 56

Slide 14

Slide 14 text

Topics in this talk 1. The current status of Ruby in data science 2. The patterns to use Ruby in data science 3. The future perspective 4. Conclusion 14 / 56

Slide 15

Slide 15 text

The current status of Ruby in data science 15 / 56

Slide 16

Slide 16 text

Three major projects for data science in Ruby 16 / 56

Slide 17

Slide 17 text

The 1st project: SciRuby NMatrix Daru, Daru-IO, Daru-View RB-GSL GnuplotRB Statsample Mixed_models ArrayFire etc. 17 / 56

Slide 18

Slide 18 text

http://gems.sciruby.com/ 18 / 56

Slide 19

Slide 19 text

SciRuby's bene ts and drawbacks Bene ts You only need Ruby NMatrix supports in-memory sparse matrices. But supported operations are limited. You can use data frames with Daru. 19 / 56

Slide 20

Slide 20 text

Data frames The basic data structure to manipulate and vi- sualize living data in data science. 2D table data structure like a SQL table. In Ruby, we can use data frames with Daru (or Pandas via pycall as described later). 20 / 56

Slide 21

Slide 21 text

SciRuby's bene ts and drawbacks Drawbacks NMatrix is slow for large amount of data [NMa- trix#362] Daru is less functionality for practical data sci- ence works. Less documented, so difficult to use. Reason of Drawbacks The small population of developers and users. 21 / 56

Slide 22

Slide 22 text

The 2nd project: Ruby Numo Numo::NArray Numo::FFTE Numo::FFTW Numo::Gnuplot Numo::GSL Numo::Linalg 22 / 56

Slide 23

Slide 23 text

Ruby Numo's bene ts and drawbacks Bene ts You need only Ruby Numo::NArray is faster than NMatrix and pure Ruby Drawbacks No spare matrices support No data frame support Less documented. 23 / 56

Slide 24

Slide 24 text

For the details of Ruby Numo You can watch Masa Tanaka's talk in RubyKaigi 2017 at http://rubykaigi.org/2017/presentations/masa16- tanaka.html 24 / 56

Slide 25

Slide 25 text

Which SciRuby or Ruby Numo is better? For data science (w/o other languages) SciRuby is better because it has Daru For scientific computing Ruby Numo is better because nmatrix is too slow 25 / 56

Slide 26

Slide 26 text

The 3rd project: Ruby Data Tools red-arrow red-chainer red-arrow-nmatrix red-arrow-numo-narray red-arrow-pycall 26 / 56

Slide 27

Slide 27 text

Ruby Data Tools bene ts and drawbacks Bene ts It supports Apache Arrow. The core developer, Kohei Suto, is a member of Apache Arrow's PMC. Drawbacks Too young to use in production. Now only support data I/O, data manipulation is not supported. 27 / 56

Slide 28

Slide 28 text

It is hard to do data science by only Ruby 28 / 56

Slide 29

Slide 29 text

Almost all data scientists shouldn't want to use Ruby in their jobs 29 / 56

Slide 30

Slide 30 text

Because they need the biggest powers of standard data tools in Python and R, espe- cially in exploratory data analysis 30 / 56

Slide 31

Slide 31 text

Ruby and Ruby on Rails are best for writing business web applications. 31 / 56

Slide 32

Slide 32 text

You should use Ruby and other languages like Python together 32 / 56

Slide 33

Slide 33 text

pycall 33 / 56

Slide 34

Slide 34 text

What is pycall? Pycall allows you to use Python libraries from your Ruby code very naturally. Pycall consists of two parts: The Ruby binding library of libpython.so Object-oriented protocol gateway between Ruby and Python 34 / 56

Slide 35

Slide 35 text

Example use of pycall: This example uses numpy via pycall. [1] pry(main)> require 'numpy' => true [2] pry(main)> x = Numpy.arange(2 * 3).reshape([2, 3]) => array([[0, 1, 2], [3, 4, 5]]) [3] pry(main)> y = Numpy.arange(3 * 4).reshape([3, 4]) => array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) [4] pry(main)> z = x.dot y => array([[20, 23, 26, 29], [56, 68, 80, 92]]) [5] pry(main)> z.shape => (2, 4) 35 / 56

Slide 36

Slide 36 text

Pycall family wrapper gems The following gems are available: numpy, pandas, matplotlib The following gems are future works: scikit-learn, seaborn, bokeh, keras, etc. You can use any Python libraries without wrapper gems 36 / 56

Slide 37

Slide 37 text

Demonstration 37 / 56

Slide 38

Slide 38 text

Resources about pycall I used in RubyKaigi 2017 Demonstrations

Slide 39

Slide 39 text

Other example usage of pycall Blog posts about scikit-learn examples by Soren D Using the scikit-learn machine learning li- brary in Ruby using PyCall Implementing OCR using a Random Forest Classifier in Ruby Mai Nguyen's workshop material in KiwiRuby conference

Slide 40

Slide 40 text

Pycall provides us access to Python's data tools 40 / 56

Slide 41

Slide 41 text

You can use all the following tools from your Ruby code numpy, pandas, pillow, matplotlib, bokeh, holoviews, scikit-learn, scikit-image, keras, tensor ow, etc. 41 / 56

Slide 42

Slide 42 text

The current best patterns to use Ruby in data science 42 / 56

Slide 43

Slide 43 text

You should use Ruby and other languages together Because: Almost all data scientists shouldn't want to use Ruby in their jobs They need the biggest powers of standard data tools like pandas in exploratory data analysis Ruby and Ruby on Rails are best for writing business web applications. 43 / 56

Slide 44

Slide 44 text

Three implementation patterns I proposes three implementation patterns to inte- grate application written in Ruby and data process- ing systems written in Python 1. Referring the same database directly 2. RPC by serialized data like JSON 3. Directly call by pycall 44 / 56

Slide 45

Slide 45 text

1. Referring the same database directly 45 / 56

Slide 46

Slide 46 text

2. RPC by serialized data like JSON 46 / 56

Slide 47

Slide 47 text

3. Directly call by pycall 47 / 56

Slide 48

Slide 48 text

Choose the right pattern according to the situation. 1. Referring the same database directly 2. RPC by serialized data like JSON 3. Directly call by pycall 48 / 56

Slide 49

Slide 49 text

The future perspective 49 / 56

Slide 50

Slide 50 text

Two topics about the future Apache Arrow GPGPU and deep learning 50 / 56

Slide 51

Slide 51 text

Apache Arrow and Red Data Tools Apache Arrow will be the core of almost data tools. Pandas 2.0 will employ Apache Arrow as its core. PySpark already uses Apache Arrow to ex- change data between Python and Spark Red Data Tools is important for the future of Ruby's data science ecosystem. You should join Red Data Tools project if you are interested in Apache Arrow. https://red-data-tools.github.io/ 51 / 56

Slide 52

Slide 52 text

GPGPU Now we have ArrayFire by @prasunanand He is also making RbCUDA in RubyGrant 2017 @sonots will make Cumo, that is Cupy clone for Numo::NArray, in RubyGrant 2017 52 / 56

Slide 53

Slide 53 text

Deep Learning We already have tensorflow.rb written by @Arafatk In Red Data Tools, @hatappi started to make RedChainer, that is Chainer clone I'm working for writing Ruby binding of MXNet 53 / 56

Slide 54

Slide 54 text

Conclusion 54 / 56

Slide 55

Slide 55 text

Conclusion I described three major projects in Ruby about data science I demonstrated an example usage of pycall I illustrates three patterns to integrate applica- tion written in Ruby and data processing system written in Python I talked about the future perspective 55 / 56

Slide 56

Slide 56 text

Docker image to try data tools for Ruby We prepared docker image for you to try data tools for Ruby. $ docker run -it --rm -p 8888:8888 -v $(pwd):/home/jovyan/work rubydata/notebooks 56 / 56