Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Development of Data Science Ecosystem for Ruby

7cca11c5257fda526eeb4b1ada28f904?s=47 Kenta Murata
September 18, 2017

Development of Data Science Ecosystem for Ruby

My talk in RubyKaigi 2017

7cca11c5257fda526eeb4b1ada28f904?s=128

Kenta Murata

September 18, 2017
Tweet

Transcript

  1. Development of Data Science Ecosystem for Ruby Kenta Murata, Speee

    Inc. 2017.09.19 RubyKaigi 2017 in Hiroshima Japan 1 / 55
  2. BigData is important in your business 2 / 55

  3. Data Science Data Analysis Data Aggregation Machine Learning 3 /

    55
  4. Data Mining 4 / 55

  5. Programming Languages for Data Science 5 / 55

  6. Python R Scala 6 / 55

  7. RubyKaigi 2016 in Kyoto I stated that Ruby was not

    practically usable in data science. 7 / 55
  8. Today's Topics 1. The current situation Now Ruby is practically

    usable for data science 2. For the future What we should effort to keep Ruby avail- able for data science 3. Request for you Shall we develop our tools and community? 8 / 55
  9. self.introduce Kenta Murata @mrkn (github, twitter, etc.) Researcher at Speee

    Inc. CRuby committer bigdecimal, enumerable-statistics, pycall, etc. 9 / 55
  10. The current situation of Ruby in data science 10 /

    55
  11. How to use Ruby in data science 11 / 55

  12. The past two options to use Ruby in data science

    1. Ruby-only way 2. Use Python and R for data analysis and connect by JSON API 12 / 55
  13. Ruby-only way There are large restrictions in data processing parts

    Few capabilities with the existing tools 13 / 55
  14. Use Python or R together with Ruby There are large

    costs in data exchange by JSON API Development and maintenance of API endpoints JSON serialization for exchanging data Letting data processing systems refer the same database of the main application It increases the development cost of the main application 14 / 55
  15. The third option introduced by PyCall PyCall allows us to

    use a Python interpreter to- gether with a Ruby interpreter in the same process. PyCall provides low-cost ways of data exchanging. Directly data conversion to Python data types. Sharing the same memory pointers. Use Apache Arrow data structure by red-ar- row-pycall library 15 / 55
  16. Three options available today 1. Ruby-only way 2. Using Python

    and R for data analysis, and con- nect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 16 / 55
  17. PyCall 17 / 55

  18. For Example: Use seaborn for visualizing benchmark results Measure benchmarking

    results and collect them in a pandas dataframe Visualize the results by using seaborn that is Python visualization library built on matplotlib Perform all the above things in one Ruby script 18 / 55
  19. # Benchmark ================================================ require 'benchmark' N, L = 100, 1_000_000

    ary = Array.new(L) { rand } methods, times = [], [] N.times do methods << :inject times << Benchmark.realtime { ary.inject(:+) } methods << :while # ---------------------------------- times << Benchmark.realtime { sum, i = ary[0], 1 while i < L sum += ary[i]; i += 1 end } methods << :sum # ------------------------------------ times << Benchmark.realtime { ary.sum } end # Make dataframe =========================================== require 'pandas' df = Pandas::DataFrame.new(data: { method: methods, time: times }) Pandas.options.display.width = `tput cols`.to_i puts df.groupby(:method).describe # Visualization ============================================ require 'matplotlib' plt = Matplotlib::Pyplot sns = PyCall.import_module('seaborn') sns.barplot(x: 'method', y: 'time', data: df) plt.title("Array summation benchmark (#{N} trials)") plt.savefig('bench.png', dpi: 100) 19 / 55
  20. $ ruby sum_bench.rb time count mean std min 25% 50%

    75% max method inject 100.0 0.140720 0.020082 0.126592 0.132516 0.135811 0.139753 sum 100.0 0.017629 0.001289 0.015933 0.016553 0.017336 0.018437 while 100.0 0.126714 0.012356 0.116296 0.121269 0.123295 0.127468 20 / 55
  21. 21 / 55

  22. Example 2: Use pandas in Rails app 22 / 55

  23. 23 / 55

  24. https://github.com/mrkn/bugs-viewer- rk2017 24 / 55

  25. Example 3: Object recognition by Keras Detecting bboxes of objects

    in a photo Keras's model of SSD300 25 / 55
  26. PyCall makes Ruby easily usable for data manipuration, data visualization,

    and machine learning 26 / 55
  27. Python is a best friend of Ruby from now on

    27 / 55
  28. In fact, PyCall is just a wrapper library of libpython

    that is written in C language 28 / 55
  29. Try PyCall 29 / 55

  30. PyCall is too young so it needs to be applied

    for various use cases 30 / 55
  31. https://github.com/mrkn/pycall.rb 31 / 55

  32. Ask me if you want to try PyCall in your

    business 32 / 55
  33. 33 / 55

  34. Three options (again) 1. Ruby-only way 2. Using Python and

    R for data analysis, and con- nect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 34 / 55
  35. Python's case: only two options 1. Python-only way, that is

    best practice 2. Use R only for statistical analysis methods that are unavailable in Python We can use Rpy2 for this case 35 / 55
  36. Python can be easily used for almost all situations in

    data science 36 / 55
  37. PyCall should be a temporary way until Ruby will get

    ready for data science 37 / 55
  38. Look ahead to the near future 38 / 55

  39. Red Data Tools project https://red-data-tools.github.io/ 39 / 55

  40. Apache Arrow 40 / 55

  41. Exchanging data between multiple sys- tems in data science E.g.

    Data extraction from RDBMS to client programs 41 / 55
  42. The current way to exchange data be- tween systems Each

    system has its own internal memory format Serialize and deserialize for exchanging data wasted a lot of CPU time Similar functions are implemented in multiple systems 42 / 55
  43. The current situation of data exchanging 43 / 55

  44. The near future with Apache Arrow 44 / 55

  45. The future in which Ruby can be used with Apache

    Arrow 45 / 55
  46. Red Arrow https://github.com/red-data-tools/red-arrow 46 / 55

  47. Red Data Tools products red-arrow red-chainer red-arrow-pycall red-arrow-numo-narray red-arrow-nmatrix red-arrow-activerecord

    etc. 47 / 55
  48. Big News in Red Data Tools 48 / 55

  49. Big News in Red Data Tools Kouhei Sutou (@kou) officially

    became a mem- ber of PMC (project management committie) of Apache Arrow yesterday 49 / 55
  50. Big News in Red Data Tools Kouhei Sutou (@kou) officially

    became a mem- ber of PMC (project management committie) of Apache Arrow yesterday This means there is at least one person who de- velops Ruby-support of Apache Arrow as a core developer So you will be able to use Apache Arrow's new feature ASAP 50 / 55
  51. Join Red Data Tools There are gitter channels both in

    English and Japanese https://gitter.im/red-data-tools/en https://gitter.im/red-data-tools/ja 51 / 55
  52. Summary Ruby has already been a programming lan- guage that

    is usable in data science You can use Python tools from Ruby by using PyCall as demonstrations I performed in this talk Red Data Tools enables us to use Apache Arrow and it guarantees that Ruby will be connected to multiple data processing systems in the future But there are lots of things should be done for the future 52 / 55
  53. Requests for you Try PyCall to make real-world use cases,

    and find bugs Join Red Data Tools to contribute the future of Ruby in data science Join the workshop tomorrow 53 / 55
  54. RubyData Workshop in RubyKaigi 2017 13:50-15:50 in Room Ran https://github.com/RubyData/rubykaigi2017

    1. PyCall Lecture 2. Getting started to Red Data Tools project 54 / 55
  55. 55 / 55