Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Ruby in data science

Kenta Murata
November 17, 2017

Using Ruby in data science

My talk slide will be presented in RubyConf 2017

Kenta Murata

November 17, 2017
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Using Ruby in data science
    Kenta Murata, Speee Inc.
    Fri, Nov 17, 2017 RubyConf 2017 in New Orleans, LA, US
    This Slide is available at https://speakerdeck.com/mrkn/using-ruby-in-
    data-science
    1 / 56

    View Slide

  2. How many people related to data science?
    Data scientists
    Data engineers
    Developers of applications that utilize data
    2 / 56

    View Slide

  3. We need to be familiar with the way of
    data utilization
    3 / 56

    View Slide

  4. But Ruby hinders us from becoming famil-
    iar with data science because Ruby is di -
    cult to use in data science
    4 / 56

    View Slide

  5. The situation is changing recently
    5 / 56

    View Slide

  6. Now Ruby is getting easier to use with
    data science little by little
    6 / 56

    View Slide

  7. Today I'll talk about how we can use Ruby
    in data science
    7 / 56

    View Slide

  8. self . introduction
    Kenta Murata
    @mrkn (github, twitter, etc.)
    moo-ra-ken
    Full-time CRuby committer and Researcher at
    Speee Inc.
    bigdecimal, enumerable-statistics, pycall,
    mxnet.rb, etc.
    8 / 56

    View Slide

  9. 9 / 56

    View Slide

  10. "Speee" == "Speed".succ
    10 / 56

    View Slide

  11. "Speee" == "Speed".succ
    faster than fast what “speed” means
    11 / 56

    View Slide

  12. "Speee" == "Speed".succ
    faster than fast what “speed” means
    iterate business trial cycles in overwhelming speed
    12 / 56

    View Slide

  13. Full-time CRuby committer
    My company employs me as a full-time CRuby
    committer.
    I'm permitted by my company to do any great
    things for Ruby ecosystem.
    In this year, I'm totally working for making
    tools for data science that are used with ap-
    plications written in Ruby
    13 / 56

    View Slide

  14. Topics in this talk
    1. The current status of Ruby in data science
    2. The patterns to use Ruby in data science
    3. The future perspective
    4. Conclusion
    14 / 56

    View Slide

  15. The current status of Ruby in data science
    15 / 56

    View Slide

  16. Three major projects for data science in
    Ruby
    16 / 56

    View Slide

  17. The 1st project: SciRuby
    NMatrix
    Daru, Daru-IO, Daru-View
    RB-GSL
    GnuplotRB
    Statsample
    Mixed_models
    ArrayFire
    etc.
    17 / 56

    View Slide

  18. http://gems.sciruby.com/
    18 / 56

    View Slide

  19. SciRuby's bene ts and drawbacks
    Bene ts
    You only need Ruby
    NMatrix supports in-memory sparse matrices.
    But supported operations are limited.
    You can use data frames with Daru.
    19 / 56

    View Slide

  20. Data frames
    The basic data structure to manipulate and vi-
    sualize living data in data science.
    2D table data structure like a SQL table.
    In Ruby, we can use data frames with Daru (or
    Pandas via pycall as described later).
    20 / 56

    View Slide

  21. SciRuby's bene ts and drawbacks
    Drawbacks
    NMatrix is slow for large amount of data [NMa-
    trix#362]
    Daru is less functionality for practical data sci-
    ence works.
    Less documented, so difficult to use.
    Reason of Drawbacks
    The small population of developers and users.
    21 / 56

    View Slide

  22. The 2nd project: Ruby Numo
    Numo::NArray
    Numo::FFTE
    Numo::FFTW
    Numo::Gnuplot
    Numo::GSL
    Numo::Linalg
    22 / 56

    View Slide

  23. Ruby Numo's bene ts and drawbacks
    Bene ts
    You need only Ruby
    Numo::NArray is faster than NMatrix and pure
    Ruby
    Drawbacks
    No spare matrices support
    No data frame support
    Less documented.
    23 / 56

    View Slide

  24. For the details of Ruby Numo
    You can watch Masa Tanaka's talk in RubyKaigi
    2017 at
    http://rubykaigi.org/2017/presentations/masa16-
    tanaka.html
    24 / 56

    View Slide

  25. Which SciRuby or Ruby Numo is better?
    For data science (w/o other languages)
    SciRuby is better because it has Daru
    For scientific computing
    Ruby Numo is better because nmatrix is too
    slow
    25 / 56

    View Slide

  26. The 3rd project: Ruby Data Tools
    red-arrow
    red-chainer
    red-arrow-nmatrix
    red-arrow-numo-narray
    red-arrow-pycall
    26 / 56

    View Slide

  27. Ruby Data Tools bene ts and drawbacks
    Bene ts
    It supports Apache Arrow.
    The core developer, Kohei Suto, is a member of
    Apache Arrow's PMC.
    Drawbacks
    Too young to use in production.
    Now only support data I/O, data manipulation is
    not supported.
    27 / 56

    View Slide

  28. It is hard to do data science by only Ruby
    28 / 56

    View Slide

  29. Almost all data scientists shouldn't want
    to use Ruby in their jobs
    29 / 56

    View Slide

  30. Because they need the biggest powers of
    standard data tools in Python and R, espe-
    cially in exploratory data analysis
    30 / 56

    View Slide

  31. Ruby and Ruby on Rails are best for writing
    business web applications.
    31 / 56

    View Slide

  32. You should use Ruby and other languages
    like Python together
    32 / 56

    View Slide

  33. pycall
    33 / 56

    View Slide

  34. What is pycall?
    Pycall allows you to use Python libraries from
    your Ruby code very naturally.
    Pycall consists of two parts:
    The Ruby binding library of libpython.so
    Object-oriented protocol gateway between
    Ruby and Python
    34 / 56

    View Slide

  35. Example use of pycall:
    This example uses numpy via pycall.
    [1] pry(main)> require 'numpy'
    => true
    [2] pry(main)> x = Numpy.arange(2 * 3).reshape([2, 3])
    => array([[0, 1, 2],
    [3, 4, 5]])
    [3] pry(main)> y = Numpy.arange(3 * 4).reshape([3, 4])
    => array([[ 0, 1, 2, 3],
    [ 4, 5, 6, 7],
    [ 8, 9, 10, 11]])
    [4] pry(main)> z = x.dot y
    => array([[20, 23, 26, 29],
    [56, 68, 80, 92]])
    [5] pry(main)> z.shape
    => (2, 4)
    35 / 56

    View Slide

  36. Pycall family wrapper gems
    The following gems are available:
    numpy, pandas, matplotlib
    The following gems are future works:
    scikit-learn, seaborn, bokeh, keras, etc.
    You can use any Python libraries without wrapper
    gems
    36 / 56

    View Slide

  37. Demonstration
    37 / 56

    View Slide

  38. Resources about pycall I used in RubyKaigi
    2017
    Demonstrations

    View Slide

  39. Other example usage of pycall
    Blog posts about scikit-learn examples by Soren
    D
    Using the scikit-learn machine learning li-
    brary in Ruby using PyCall
    Implementing OCR using a Random Forest
    Classifier in Ruby
    Mai Nguyen's workshop material in KiwiRuby
    conference

    View Slide

  40. Pycall provides us access to Python's data
    tools
    40 / 56

    View Slide

  41. You can use all the following tools from
    your Ruby code
    numpy, pandas, pillow,
    matplotlib, bokeh, holoviews,
    scikit-learn, scikit-image,
    keras, tensor ow, etc.
    41 / 56

    View Slide

  42. The current best patterns to use Ruby in
    data science
    42 / 56

    View Slide

  43. You should use Ruby and other languages
    together
    Because:
    Almost all data scientists shouldn't want to use
    Ruby in their jobs
    They need the biggest powers of standard data
    tools like pandas in exploratory data analysis
    Ruby and Ruby on Rails are best for writing
    business web applications.
    43 / 56

    View Slide

  44. Three implementation patterns
    I proposes three implementation patterns to inte-
    grate application written in Ruby and data process-
    ing systems written in Python
    1. Referring the same database directly
    2. RPC by serialized data like JSON
    3. Directly call by pycall
    44 / 56

    View Slide

  45. 1. Referring the same database directly
    45 / 56

    View Slide

  46. 2. RPC by serialized data like JSON
    46 / 56

    View Slide

  47. 3. Directly call by pycall
    47 / 56

    View Slide

  48. Choose the right pattern according to the
    situation.
    1. Referring the same database directly
    2. RPC by serialized data like JSON
    3. Directly call by pycall
    48 / 56

    View Slide

  49. The future perspective
    49 / 56

    View Slide

  50. Two topics about the future
    Apache Arrow
    GPGPU and deep learning
    50 / 56

    View Slide

  51. Apache Arrow and Red Data Tools
    Apache Arrow will be the core of almost data
    tools.
    Pandas 2.0 will employ Apache Arrow as its
    core.
    PySpark already uses Apache Arrow to ex-
    change data between Python and Spark
    Red Data Tools is important for the future of
    Ruby's data science ecosystem.
    You should join Red Data Tools project if you
    are interested in Apache Arrow.
    https://red-data-tools.github.io/
    51 / 56

    View Slide

  52. GPGPU
    Now we have ArrayFire by @prasunanand
    He is also making RbCUDA in RubyGrant 2017
    @sonots will make Cumo, that is Cupy clone for
    Numo::NArray, in RubyGrant 2017
    52 / 56

    View Slide

  53. Deep Learning
    We already have tensorflow.rb written by
    @Arafatk
    In Red Data Tools, @hatappi started to make
    RedChainer, that is Chainer clone
    I'm working for writing Ruby binding of MXNet
    53 / 56

    View Slide

  54. Conclusion
    54 / 56

    View Slide

  55. Conclusion
    I described three major projects in Ruby about
    data science
    I demonstrated an example usage of pycall
    I illustrates three patterns to integrate applica-
    tion written in Ruby and data processing system
    written in Python
    I talked about the future perspective
    55 / 56

    View Slide

  56. Docker image to try data tools for Ruby
    We prepared docker image for you to try data tools
    for Ruby.
    $ docker run -it --rm -p
    8888:8888
    -v $(pwd):/home/jovyan/work
    rubydata/notebooks
    56 / 56

    View Slide