Slide 1

Slide 1 text

Development of Data Science Ecosystem for Ruby Kenta Murata, Speee Inc. 2017.09.19 RubyKaigi 2017 in Hiroshima Japan 1 / 55

Slide 2

Slide 2 text

BigData is important in your business 2 / 55

Slide 3

Slide 3 text

Data Science Data Analysis Data Aggregation Machine Learning 3 / 55

Slide 4

Slide 4 text

Data Mining 4 / 55

Slide 5

Slide 5 text

Programming Languages for Data Science 5 / 55

Slide 6

Slide 6 text

Python R Scala 6 / 55

Slide 7

Slide 7 text

RubyKaigi 2016 in Kyoto I stated that Ruby was not practically usable in data science. 7 / 55

Slide 8

Slide 8 text

Today's Topics 1. The current situation Now Ruby is practically usable for data science 2. For the future What we should effort to keep Ruby avail- able for data science 3. Request for you Shall we develop our tools and community? 8 / 55

Slide 9

Slide 9 text

self.introduce Kenta Murata @mrkn (github, twitter, etc.) Researcher at Speee Inc. CRuby committer bigdecimal, enumerable-statistics, pycall, etc. 9 / 55

Slide 10

Slide 10 text

The current situation of Ruby in data science 10 / 55

Slide 11

Slide 11 text

How to use Ruby in data science 11 / 55

Slide 12

Slide 12 text

The past two options to use Ruby in data science 1. Ruby-only way 2. Use Python and R for data analysis and connect by JSON API 12 / 55

Slide 13

Slide 13 text

Ruby-only way There are large restrictions in data processing parts Few capabilities with the existing tools 13 / 55

Slide 14

Slide 14 text

Use Python or R together with Ruby There are large costs in data exchange by JSON API Development and maintenance of API endpoints JSON serialization for exchanging data Letting data processing systems refer the same database of the main application It increases the development cost of the main application 14 / 55

Slide 15

Slide 15 text

The third option introduced by PyCall PyCall allows us to use a Python interpreter to- gether with a Ruby interpreter in the same process. PyCall provides low-cost ways of data exchanging. Directly data conversion to Python data types. Sharing the same memory pointers. Use Apache Arrow data structure by red-ar- row-pycall library 15 / 55

Slide 16

Slide 16 text

Three options available today 1. Ruby-only way 2. Using Python and R for data analysis, and con- nect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 16 / 55

Slide 17

Slide 17 text

PyCall 17 / 55

Slide 18

Slide 18 text

For Example: Use seaborn for visualizing benchmark results Measure benchmarking results and collect them in a pandas dataframe Visualize the results by using seaborn that is Python visualization library built on matplotlib Perform all the above things in one Ruby script 18 / 55

Slide 19

Slide 19 text

# Benchmark ================================================ require 'benchmark' N, L = 100, 1_000_000 ary = Array.new(L) { rand } methods, times = [], [] N.times do methods << :inject times << Benchmark.realtime { ary.inject(:+) } methods << :while # ---------------------------------- times << Benchmark.realtime { sum, i = ary[0], 1 while i < L sum += ary[i]; i += 1 end } methods << :sum # ------------------------------------ times << Benchmark.realtime { ary.sum } end # Make dataframe =========================================== require 'pandas' df = Pandas::DataFrame.new(data: { method: methods, time: times }) Pandas.options.display.width = `tput cols`.to_i puts df.groupby(:method).describe # Visualization ============================================ require 'matplotlib' plt = Matplotlib::Pyplot sns = PyCall.import_module('seaborn') sns.barplot(x: 'method', y: 'time', data: df) plt.title("Array summation benchmark (#{N} trials)") plt.savefig('bench.png', dpi: 100) 19 / 55

Slide 20

Slide 20 text

$ ruby sum_bench.rb time count mean std min 25% 50% 75% max method inject 100.0 0.140720 0.020082 0.126592 0.132516 0.135811 0.139753 sum 100.0 0.017629 0.001289 0.015933 0.016553 0.017336 0.018437 while 100.0 0.126714 0.012356 0.116296 0.121269 0.123295 0.127468 20 / 55

Slide 21

Slide 21 text

21 / 55

Slide 22

Slide 22 text

Example 2: Use pandas in Rails app 22 / 55

Slide 23

Slide 23 text

23 / 55

Slide 24

Slide 24 text

https://github.com/mrkn/bugs-viewer- rk2017 24 / 55

Slide 25

Slide 25 text

Example 3: Object recognition by Keras Detecting bboxes of objects in a photo Keras's model of SSD300 25 / 55

Slide 26

Slide 26 text

PyCall makes Ruby easily usable for data manipuration, data visualization, and machine learning 26 / 55

Slide 27

Slide 27 text

Python is a best friend of Ruby from now on 27 / 55

Slide 28

Slide 28 text

In fact, PyCall is just a wrapper library of libpython that is written in C language 28 / 55

Slide 29

Slide 29 text

Try PyCall 29 / 55

Slide 30

Slide 30 text

PyCall is too young so it needs to be applied for various use cases 30 / 55

Slide 31

Slide 31 text

https://github.com/mrkn/pycall.rb 31 / 55

Slide 32

Slide 32 text

Ask me if you want to try PyCall in your business 32 / 55

Slide 33

Slide 33 text

33 / 55

Slide 34

Slide 34 text

Three options (again) 1. Ruby-only way 2. Using Python and R for data analysis, and con- nect via JSON API or let them look the same DB 3. Use PyCall to call Python from Ruby 34 / 55

Slide 35

Slide 35 text

Python's case: only two options 1. Python-only way, that is best practice 2. Use R only for statistical analysis methods that are unavailable in Python We can use Rpy2 for this case 35 / 55

Slide 36

Slide 36 text

Python can be easily used for almost all situations in data science 36 / 55

Slide 37

Slide 37 text

PyCall should be a temporary way until Ruby will get ready for data science 37 / 55

Slide 38

Slide 38 text

Look ahead to the near future 38 / 55

Slide 39

Slide 39 text

Red Data Tools project https://red-data-tools.github.io/ 39 / 55

Slide 40

Slide 40 text

Apache Arrow 40 / 55

Slide 41

Slide 41 text

Exchanging data between multiple sys- tems in data science E.g. Data extraction from RDBMS to client programs 41 / 55

Slide 42

Slide 42 text

The current way to exchange data be- tween systems Each system has its own internal memory format Serialize and deserialize for exchanging data wasted a lot of CPU time Similar functions are implemented in multiple systems 42 / 55

Slide 43

Slide 43 text

The current situation of data exchanging 43 / 55

Slide 44

Slide 44 text

The near future with Apache Arrow 44 / 55

Slide 45

Slide 45 text

The future in which Ruby can be used with Apache Arrow 45 / 55

Slide 46

Slide 46 text

Red Arrow https://github.com/red-data-tools/red-arrow 46 / 55

Slide 47

Slide 47 text

Red Data Tools products red-arrow red-chainer red-arrow-pycall red-arrow-numo-narray red-arrow-nmatrix red-arrow-activerecord etc. 47 / 55

Slide 48

Slide 48 text

Big News in Red Data Tools 48 / 55

Slide 49

Slide 49 text

Big News in Red Data Tools Kouhei Sutou (@kou) officially became a mem- ber of PMC (project management committie) of Apache Arrow yesterday 49 / 55

Slide 50

Slide 50 text

Big News in Red Data Tools Kouhei Sutou (@kou) officially became a mem- ber of PMC (project management committie) of Apache Arrow yesterday This means there is at least one person who de- velops Ruby-support of Apache Arrow as a core developer So you will be able to use Apache Arrow's new feature ASAP 50 / 55

Slide 51

Slide 51 text

Join Red Data Tools There are gitter channels both in English and Japanese https://gitter.im/red-data-tools/en https://gitter.im/red-data-tools/ja 51 / 55

Slide 52

Slide 52 text

Summary Ruby has already been a programming lan- guage that is usable in data science You can use Python tools from Ruby by using PyCall as demonstrations I performed in this talk Red Data Tools enables us to use Apache Arrow and it guarantees that Ruby will be connected to multiple data processing systems in the future But there are lots of things should be done for the future 52 / 55

Slide 53

Slide 53 text

Requests for you Try PyCall to make real-world use cases, and find bugs Join Red Data Tools to contribute the future of Ruby in data science Join the workshop tomorrow 53 / 55

Slide 54

Slide 54 text

RubyData Workshop in RubyKaigi 2017 13:50-15:50 in Room Ran https://github.com/RubyData/rubykaigi2017 1. PyCall Lecture 2. Getting started to Red Data Tools project 54 / 55

Slide 55

Slide 55 text

55 / 55