Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Ruby in data science

Kenta Murata
November 17, 2017

Using Ruby in data science

My talk slide will be presented in RubyConf 2017

Kenta Murata

November 17, 2017
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Using Ruby in data science Kenta Murata, Speee Inc. Fri,

    Nov 17, 2017 RubyConf 2017 in New Orleans, LA, US This Slide is available at https://speakerdeck.com/mrkn/using-ruby-in- data-science 1 / 56
  2. How many people related to data science? Data scientists Data

    engineers Developers of applications that utilize data 2 / 56
  3. But Ruby hinders us from becoming famil- iar with data

    science because Ruby is di - cult to use in data science 4 / 56
  4. self . introduction Kenta Murata @mrkn (github, twitter, etc.) moo-ra-ken

    Full-time CRuby committer and Researcher at Speee Inc. bigdecimal, enumerable-statistics, pycall, mxnet.rb, etc. 8 / 56
  5. "Speee" == "Speed".succ faster than fast what “speed” means iterate

    business trial cycles in overwhelming speed 12 / 56
  6. Full-time CRuby committer My company employs me as a full-time

    CRuby committer. I'm permitted by my company to do any great things for Ruby ecosystem. In this year, I'm totally working for making tools for data science that are used with ap- plications written in Ruby 13 / 56
  7. Topics in this talk 1. The current status of Ruby

    in data science 2. The patterns to use Ruby in data science 3. The future perspective 4. Conclusion 14 / 56
  8. SciRuby's bene ts and drawbacks Bene ts You only need

    Ruby NMatrix supports in-memory sparse matrices. But supported operations are limited. You can use data frames with Daru. 19 / 56
  9. Data frames The basic data structure to manipulate and vi-

    sualize living data in data science. 2D table data structure like a SQL table. In Ruby, we can use data frames with Daru (or Pandas via pycall as described later). 20 / 56
  10. SciRuby's bene ts and drawbacks Drawbacks NMatrix is slow for

    large amount of data [NMa- trix#362] Daru is less functionality for practical data sci- ence works. Less documented, so difficult to use. Reason of Drawbacks The small population of developers and users. 21 / 56
  11. Ruby Numo's bene ts and drawbacks Bene ts You need

    only Ruby Numo::NArray is faster than NMatrix and pure Ruby Drawbacks No spare matrices support No data frame support Less documented. 23 / 56
  12. For the details of Ruby Numo You can watch Masa

    Tanaka's talk in RubyKaigi 2017 at http://rubykaigi.org/2017/presentations/masa16- tanaka.html 24 / 56
  13. Which SciRuby or Ruby Numo is better? For data science

    (w/o other languages) SciRuby is better because it has Daru For scientific computing Ruby Numo is better because nmatrix is too slow 25 / 56
  14. Ruby Data Tools bene ts and drawbacks Bene ts It

    supports Apache Arrow. The core developer, Kohei Suto, is a member of Apache Arrow's PMC. Drawbacks Too young to use in production. Now only support data I/O, data manipulation is not supported. 27 / 56
  15. Because they need the biggest powers of standard data tools

    in Python and R, espe- cially in exploratory data analysis 30 / 56
  16. What is pycall? Pycall allows you to use Python libraries

    from your Ruby code very naturally. Pycall consists of two parts: The Ruby binding library of libpython.so Object-oriented protocol gateway between Ruby and Python 34 / 56
  17. Example use of pycall: This example uses numpy via pycall.

    [1] pry(main)> require 'numpy' => true [2] pry(main)> x = Numpy.arange(2 * 3).reshape([2, 3]) => array([[0, 1, 2], [3, 4, 5]]) [3] pry(main)> y = Numpy.arange(3 * 4).reshape([3, 4]) => array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) [4] pry(main)> z = x.dot y => array([[20, 23, 26, 29], [56, 68, 80, 92]]) [5] pry(main)> z.shape => (2, 4) 35 / 56
  18. Pycall family wrapper gems The following gems are available: numpy,

    pandas, matplotlib The following gems are future works: scikit-learn, seaborn, bokeh, keras, etc. You can use any Python libraries without wrapper gems 36 / 56
  19. Other example usage of pycall Blog posts about scikit-learn examples

    by Soren D Using the scikit-learn machine learning li- brary in Ruby using PyCall Implementing OCR using a Random Forest Classifier in Ruby Mai Nguyen's workshop material in KiwiRuby conference
  20. You can use all the following tools from your Ruby

    code numpy, pandas, pillow, matplotlib, bokeh, holoviews, scikit-learn, scikit-image, keras, tensor ow, etc. 41 / 56
  21. You should use Ruby and other languages together Because: Almost

    all data scientists shouldn't want to use Ruby in their jobs They need the biggest powers of standard data tools like pandas in exploratory data analysis Ruby and Ruby on Rails are best for writing business web applications. 43 / 56
  22. Three implementation patterns I proposes three implementation patterns to inte-

    grate application written in Ruby and data process- ing systems written in Python 1. Referring the same database directly 2. RPC by serialized data like JSON 3. Directly call by pycall 44 / 56
  23. Choose the right pattern according to the situation. 1. Referring

    the same database directly 2. RPC by serialized data like JSON 3. Directly call by pycall 48 / 56
  24. Apache Arrow and Red Data Tools Apache Arrow will be

    the core of almost data tools. Pandas 2.0 will employ Apache Arrow as its core. PySpark already uses Apache Arrow to ex- change data between Python and Spark Red Data Tools is important for the future of Ruby's data science ecosystem. You should join Red Data Tools project if you are interested in Apache Arrow. https://red-data-tools.github.io/ 51 / 56
  25. GPGPU Now we have ArrayFire by @prasunanand He is also

    making RbCUDA in RubyGrant 2017 @sonots will make Cumo, that is Cupy clone for Numo::NArray, in RubyGrant 2017 52 / 56
  26. Deep Learning We already have tensorflow.rb written by @Arafatk In

    Red Data Tools, @hatappi started to make RedChainer, that is Chainer clone I'm working for writing Ruby binding of MXNet 53 / 56
  27. Conclusion I described three major projects in Ruby about data

    science I demonstrated an example usage of pycall I illustrates three patterns to integrate applica- tion written in Ruby and data processing system written in Python I talked about the future perspective 55 / 56
  28. Docker image to try data tools for Ruby We prepared

    docker image for you to try data tools for Ruby. $ docker run -it --rm -p 8888:8888 -v $(pwd):/home/jovyan/work rubydata/notebooks 56 / 56