Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development of Data Science Ecosystem for Ruby

Kenta Murata
September 18, 2017

Development of Data Science Ecosystem for Ruby

My talk in RubyKaigi 2017

Kenta Murata

September 18, 2017
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Development of Data Science Ecosystem for Ruby Kenta Murata, Speee

    Inc. 2017.09.19 RubyKaigi 2017 in Hiroshima Japan 1 / 55
  2. RubyKaigi 2016 in Kyoto I stated that Ruby was not

    practically usable in data science. 7 / 55
  3. Today's Topics 1. The current situation Now Ruby is practically

    usable for data science 2. For the future What we should effort to keep Ruby avail- able for data science 3. Request for you Shall we develop our tools and community? 8 / 55
  4. self.introduce Kenta Murata @mrkn (github, twitter, etc.) Researcher at Speee

    Inc. CRuby committer bigdecimal, enumerable-statistics, pycall, etc. 9 / 55
  5. The past two options to use Ruby in data science

    1. Ruby-only way 2. Use Python and R for data analysis and connect by JSON API 12 / 55
  6. Ruby-only way There are large restrictions in data processing parts

    Few capabilities with the existing tools 13 / 55
  7. Use Python or R together with Ruby There are large

    costs in data exchange by JSON API Development and maintenance of API endpoints JSON serialization for exchanging data Letting data processing systems refer the same database of the main application It increases the development cost of the main application 14 / 55
  8. The third option introduced by PyCall PyCall allows us to

    use a Python interpreter to- gether with a Ruby interpreter in the same process. PyCall provides low-cost ways of data exchanging. Directly data conversion to Python data types. Sharing the same memory pointers. Use Apache Arrow data structure by red-ar- row-pycall library 15 / 55
  9. Three options available today 1. Ruby-only way 2. Using Python

    and R for data analysis, and con- nect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 16 / 55
  10. For Example: Use seaborn for visualizing benchmark results Measure benchmarking

    results and collect them in a pandas dataframe Visualize the results by using seaborn that is Python visualization library built on matplotlib Perform all the above things in one Ruby script 18 / 55
  11. # Benchmark ================================================ require 'benchmark' N, L = 100, 1_000_000

    ary = Array.new(L) { rand } methods, times = [], [] N.times do methods << :inject times << Benchmark.realtime { ary.inject(:+) } methods << :while # ---------------------------------- times << Benchmark.realtime { sum, i = ary[0], 1 while i < L sum += ary[i]; i += 1 end } methods << :sum # ------------------------------------ times << Benchmark.realtime { ary.sum } end # Make dataframe =========================================== require 'pandas' df = Pandas::DataFrame.new(data: { method: methods, time: times }) Pandas.options.display.width = `tput cols`.to_i puts df.groupby(:method).describe # Visualization ============================================ require 'matplotlib' plt = Matplotlib::Pyplot sns = PyCall.import_module('seaborn') sns.barplot(x: 'method', y: 'time', data: df) plt.title("Array summation benchmark (#{N} trials)") plt.savefig('bench.png', dpi: 100) 19 / 55
  12. $ ruby sum_bench.rb time count mean std min 25% 50%

    75% max method inject 100.0 0.140720 0.020082 0.126592 0.132516 0.135811 0.139753 sum 100.0 0.017629 0.001289 0.015933 0.016553 0.017336 0.018437 while 100.0 0.126714 0.012356 0.116296 0.121269 0.123295 0.127468 20 / 55
  13. Example 3: Object recognition by Keras Detecting bboxes of objects

    in a photo Keras's model of SSD300 25 / 55
  14. In fact, PyCall is just a wrapper library of libpython

    that is written in C language 28 / 55
  15. PyCall is too young so it needs to be applied

    for various use cases 30 / 55
  16. Three options (again) 1. Ruby-only way 2. Using Python and

    R for data analysis, and con- nect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 34 / 55
  17. Python's case: only two options 1. Python-only way, that is

    best practice 2. Use R only for statistical analysis methods that are unavailable in Python We can use Rpy2 for this case 35 / 55
  18. Exchanging data between multiple sys- tems in data science E.g.

    Data extraction from RDBMS to client programs 41 / 55
  19. The current way to exchange data be- tween systems Each

    system has its own internal memory format Serialize and deserialize for exchanging data wasted a lot of CPU time Similar functions are implemented in multiple systems 42 / 55
  20. Big News in Red Data Tools Kouhei Sutou (@kou) officially

    became a mem- ber of PMC (project management committie) of Apache Arrow yesterday 49 / 55
  21. Big News in Red Data Tools Kouhei Sutou (@kou) officially

    became a mem- ber of PMC (project management committie) of Apache Arrow yesterday This means there is at least one person who de- velops Ruby-support of Apache Arrow as a core developer So you will be able to use Apache Arrow's new feature ASAP 50 / 55
  22. Join Red Data Tools There are gitter channels both in

    English and Japanese https://gitter.im/red-data-tools/en https://gitter.im/red-data-tools/ja 51 / 55
  23. Summary Ruby has already been a programming lan- guage that

    is usable in data science You can use Python tools from Ruby by using PyCall as demonstrations I performed in this talk Red Data Tools enables us to use Apache Arrow and it guarantees that Ruby will be connected to multiple data processing systems in the future But there are lots of things should be done for the future 52 / 55
  24. Requests for you Try PyCall to make real-world use cases,

    and find bugs Join Red Data Tools to contribute the future of Ruby in data science Join the workshop tomorrow 53 / 55
  25. RubyData Workshop in RubyKaigi 2017 13:50-15:50 in Room Ran https://github.com/RubyData/rubykaigi2017

    1. PyCall Lecture 2. Getting started to Red Data Tools project 54 / 55