Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Ruby in data science

7cca11c5257fda526eeb4b1ada28f904?s=47 Kenta Murata
November 17, 2017

Using Ruby in data science

My talk slide will be presented in RubyConf 2017

7cca11c5257fda526eeb4b1ada28f904?s=128

Kenta Murata

November 17, 2017
Tweet

Transcript

  1. Using Ruby in data science Kenta Murata, Speee Inc. Fri,

    Nov 17, 2017 RubyConf 2017 in New Orleans, LA, US This Slide is available at https://speakerdeck.com/mrkn/using-ruby-in- data-science 1 / 56
  2. How many people related to data science? Data scientists Data

    engineers Developers of applications that utilize data 2 / 56
  3. We need to be familiar with the way of data

    utilization 3 / 56
  4. But Ruby hinders us from becoming famil- iar with data

    science because Ruby is di - cult to use in data science 4 / 56
  5. The situation is changing recently 5 / 56

  6. Now Ruby is getting easier to use with data science

    little by little 6 / 56
  7. Today I'll talk about how we can use Ruby in

    data science 7 / 56
  8. self . introduction Kenta Murata @mrkn (github, twitter, etc.) moo-ra-ken

    Full-time CRuby committer and Researcher at Speee Inc. bigdecimal, enumerable-statistics, pycall, mxnet.rb, etc. 8 / 56
  9. 9 / 56

  10. "Speee" == "Speed".succ 10 / 56

  11. "Speee" == "Speed".succ faster than fast what “speed” means 11

    / 56
  12. "Speee" == "Speed".succ faster than fast what “speed” means iterate

    business trial cycles in overwhelming speed 12 / 56
  13. Full-time CRuby committer My company employs me as a full-time

    CRuby committer. I'm permitted by my company to do any great things for Ruby ecosystem. In this year, I'm totally working for making tools for data science that are used with ap- plications written in Ruby 13 / 56
  14. Topics in this talk 1. The current status of Ruby

    in data science 2. The patterns to use Ruby in data science 3. The future perspective 4. Conclusion 14 / 56
  15. The current status of Ruby in data science 15 /

    56
  16. Three major projects for data science in Ruby 16 /

    56
  17. The 1st project: SciRuby NMatrix Daru, Daru-IO, Daru-View RB-GSL GnuplotRB

    Statsample Mixed_models ArrayFire etc. 17 / 56
  18. http://gems.sciruby.com/ 18 / 56

  19. SciRuby's bene ts and drawbacks Bene ts You only need

    Ruby NMatrix supports in-memory sparse matrices. But supported operations are limited. You can use data frames with Daru. 19 / 56
  20. Data frames The basic data structure to manipulate and vi-

    sualize living data in data science. 2D table data structure like a SQL table. In Ruby, we can use data frames with Daru (or Pandas via pycall as described later). 20 / 56
  21. SciRuby's bene ts and drawbacks Drawbacks NMatrix is slow for

    large amount of data [NMa- trix#362] Daru is less functionality for practical data sci- ence works. Less documented, so difficult to use. Reason of Drawbacks The small population of developers and users. 21 / 56
  22. The 2nd project: Ruby Numo Numo::NArray Numo::FFTE Numo::FFTW Numo::Gnuplot Numo::GSL

    Numo::Linalg 22 / 56
  23. Ruby Numo's bene ts and drawbacks Bene ts You need

    only Ruby Numo::NArray is faster than NMatrix and pure Ruby Drawbacks No spare matrices support No data frame support Less documented. 23 / 56
  24. For the details of Ruby Numo You can watch Masa

    Tanaka's talk in RubyKaigi 2017 at http://rubykaigi.org/2017/presentations/masa16- tanaka.html 24 / 56
  25. Which SciRuby or Ruby Numo is better? For data science

    (w/o other languages) SciRuby is better because it has Daru For scientific computing Ruby Numo is better because nmatrix is too slow 25 / 56
  26. The 3rd project: Ruby Data Tools red-arrow red-chainer red-arrow-nmatrix red-arrow-numo-narray

    red-arrow-pycall 26 / 56
  27. Ruby Data Tools bene ts and drawbacks Bene ts It

    supports Apache Arrow. The core developer, Kohei Suto, is a member of Apache Arrow's PMC. Drawbacks Too young to use in production. Now only support data I/O, data manipulation is not supported. 27 / 56
  28. It is hard to do data science by only Ruby

    28 / 56
  29. Almost all data scientists shouldn't want to use Ruby in

    their jobs 29 / 56
  30. Because they need the biggest powers of standard data tools

    in Python and R, espe- cially in exploratory data analysis 30 / 56
  31. Ruby and Ruby on Rails are best for writing business

    web applications. 31 / 56
  32. You should use Ruby and other languages like Python together

    32 / 56
  33. pycall 33 / 56

  34. What is pycall? Pycall allows you to use Python libraries

    from your Ruby code very naturally. Pycall consists of two parts: The Ruby binding library of libpython.so Object-oriented protocol gateway between Ruby and Python 34 / 56
  35. Example use of pycall: This example uses numpy via pycall.

    [1] pry(main)> require 'numpy' => true [2] pry(main)> x = Numpy.arange(2 * 3).reshape([2, 3]) => array([[0, 1, 2], [3, 4, 5]]) [3] pry(main)> y = Numpy.arange(3 * 4).reshape([3, 4]) => array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) [4] pry(main)> z = x.dot y => array([[20, 23, 26, 29], [56, 68, 80, 92]]) [5] pry(main)> z.shape => (2, 4) 35 / 56
  36. Pycall family wrapper gems The following gems are available: numpy,

    pandas, matplotlib The following gems are future works: scikit-learn, seaborn, bokeh, keras, etc. You can use any Python libraries without wrapper gems 36 / 56
  37. Demonstration 37 / 56

  38. Resources about pycall I used in RubyKaigi 2017 Demonstrations

  39. Other example usage of pycall Blog posts about scikit-learn examples

    by Soren D Using the scikit-learn machine learning li- brary in Ruby using PyCall Implementing OCR using a Random Forest Classifier in Ruby Mai Nguyen's workshop material in KiwiRuby conference
  40. Pycall provides us access to Python's data tools 40 /

    56
  41. You can use all the following tools from your Ruby

    code numpy, pandas, pillow, matplotlib, bokeh, holoviews, scikit-learn, scikit-image, keras, tensor ow, etc. 41 / 56
  42. The current best patterns to use Ruby in data science

    42 / 56
  43. You should use Ruby and other languages together Because: Almost

    all data scientists shouldn't want to use Ruby in their jobs They need the biggest powers of standard data tools like pandas in exploratory data analysis Ruby and Ruby on Rails are best for writing business web applications. 43 / 56
  44. Three implementation patterns I proposes three implementation patterns to inte-

    grate application written in Ruby and data process- ing systems written in Python 1. Referring the same database directly 2. RPC by serialized data like JSON 3. Directly call by pycall 44 / 56
  45. 1. Referring the same database directly 45 / 56

  46. 2. RPC by serialized data like JSON 46 / 56

  47. 3. Directly call by pycall 47 / 56

  48. Choose the right pattern according to the situation. 1. Referring

    the same database directly 2. RPC by serialized data like JSON 3. Directly call by pycall 48 / 56
  49. The future perspective 49 / 56

  50. Two topics about the future Apache Arrow GPGPU and deep

    learning 50 / 56
  51. Apache Arrow and Red Data Tools Apache Arrow will be

    the core of almost data tools. Pandas 2.0 will employ Apache Arrow as its core. PySpark already uses Apache Arrow to ex- change data between Python and Spark Red Data Tools is important for the future of Ruby's data science ecosystem. You should join Red Data Tools project if you are interested in Apache Arrow. https://red-data-tools.github.io/ 51 / 56
  52. GPGPU Now we have ArrayFire by @prasunanand He is also

    making RbCUDA in RubyGrant 2017 @sonots will make Cumo, that is Cupy clone for Numo::NArray, in RubyGrant 2017 52 / 56
  53. Deep Learning We already have tensorflow.rb written by @Arafatk In

    Red Data Tools, @hatappi started to make RedChainer, that is Chainer clone I'm working for writing Ruby binding of MXNet 53 / 56
  54. Conclusion 54 / 56

  55. Conclusion I described three major projects in Ruby about data

    science I demonstrated an example usage of pycall I illustrates three patterns to integrate applica- tion written in Ruby and data processing system written in Python I talked about the future perspective 55 / 56
  56. Docker image to try data tools for Ruby We prepared

    docker image for you to try data tools for Ruby. $ docker run -it --rm -p 8888:8888 -v $(pwd):/home/jovyan/work rubydata/notebooks 56 / 56