Slide 1

Slide 1 text

RubyData and Rails To make Ruby ready-to-use in the data science field. And the impact that it has on Rails applications. Kenta Murata Speee Inc. Rails Developer Meetup 2019 2019-03-22

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

About me • @mrkn (Kenta Murata) • Full-time CRuby committer at Speee Inc. • Apache Arrow contributor • My gems: • pycall, enumerable-statistics, bigdecimal, etc. • Hobbies: • Mathematics, Physics, Photograph

Slide 4

Slide 4 text

Contents • Introduction • RubyData • Red Data Tools • Apache Arrow • The impact on Rails applications • Conclusion

Slide 5

Slide 5 text

Introduction

Slide 6

Slide 6 text

About 3 years ago… • I started to make data tools for Ruby • At that time, I tried and failed to perform image analysis and simple data analysis with Ruby • Ruby couldn’t compete with Python and Julia for tasks I wanted to do

Slide 7

Slide 7 text

My Goal • Make Ruby a natural candidate to build data processing systems • We can analyze small data by Ruby’s data tools without any hard effort to keep using Ruby • There are developers and users of Ruby’s data tools around of the world

Slide 8

Slide 8 text

Now • Ruby still cannot compete with Python and Julia • But necessary tool ecosystems are growing steadily by efforts of several developers and non-developers who have the similar visions • I think we have bright futures

Slide 9

Slide 9 text

The map of data tool ecosystems around of Ruby Red Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python
 ecosystem Other languages ecosystems

Slide 10

Slide 10 text

The map of data tool ecosystems around of Ruby Red Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python
 ecosystem Other languages ecosystems

Slide 11

Slide 11 text

RubyData

Slide 12

Slide 12 text

What is RubyData RubyData has two sides Community Project

Slide 13

Slide 13 text

On the community side RubyData is a community for developers and users who use opensource data tools with Ruby Hosting discource • discourse.ruby- data.org • It’s used for the discussion during GSoC 2018 Holding workshops • Several workshops in past RubyKaigis • The next is going to hold on the 2nd day in RubyKaigi 2019

Slide 14

Slide 14 text

On the project side • RubyData is the umbrella project for integrating a lot of data tools for Ruby • RubyData provides two docker images • rubydata/minimal-notebook • rubydata/datascience-notebook • These notebooks are available to use on binder

Slide 15

Slide 15 text

The map of data tool ecosystems around of Ruby Red Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python
 ecosystem Other languages ecosystems RubyData

Slide 16

Slide 16 text

On the project side • RubyData is the umbrella project for integrating a lot of data tools for Ruby • RubyData provides two docker images • rubydata/minimal-notebook • rubydata/datascience-notebook • These notebooks are available to use on binder

Slide 17

Slide 17 text

rubydata/minimal- notebook • Based on jupyter/scipy-notebook image • It consists of … • Jupyter Notebook and JupyterLab • SciPy stacks
 (scipy, numpy, pandas, matplotlib, IPython, etc.) • IRuby and data tools for Ruby including pycall.rb

Slide 18

Slide 18 text

rubydata/datascience- notebook • Based on jupyter/datascience-notebook images • It consists of … • Things in rubydata/minimal-notebook • IJulia and Julia tools • IRKernel and R tools

Slide 19

Slide 19 text

Availability on binder https://github.com/RubyData/docker-stacks

Slide 20

Slide 20 text

The objective of RubyData • Encourage and help Rubyists who try to utilize Ruby in the data science field • Developing data tools for Ruby • Using Ruby for data analysis frontend language • Integrating Rails applications with some data processing tools

Slide 21

Slide 21 text

Red Data Tools

Slide 22

Slide 22 text

What is Red Data Tools • Red Data Tools is a development project • The objective is developing data tools for Ruby • Holding the development meetup every month

Slide 23

Slide 23 text

The policy of Red Data Tools 1. Collaborate across the Ruby community
 Rubyコミュニティーを超えて協⼒する 2. Acting rather than blaming
 ⾮難することよりも⼿を動かすことが⼤事 3. Continuous, iterative progress rather than a short, big project
 ⼀回だけの活発な活動よりも⼩さくてもいいので継続的に活動することが⼤事 4. The current lack of knowledge doesn't matter
 現時点での知識不⾜は問題ではない 5. Ignore criticism from outsiders
 部外者からの⾮難は気にしない 6. Fun!
 楽しくやろう!

Slide 24

Slide 24 text

Red Data Tools Products • Apache Arrow related tools (explained below) • Charty … visualization • Red Chainer … deep learning • etc.

Slide 25

Slide 25 text

OSS Gate for Red Data Tools • Development meetup held every month in Speee Lounge, Tokyo • Not only Red Data Tools, but also Ruby Numo people, SciRuby people, and others have attended • Like asakusa.rb, in this meetup we concentrate the development of data tools for Ruby • There are two Apache Arrow committers

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

A good story • Red Data Tools has two Apache Arrow committers • One is @kou, Kouhei Sutou, the founder of Red Data Tools project • Another one is @shiro615, he started his contribution to Apache Arrow as his first OSS activity in Red Data Tools meetup, and got the commit-bit in Nov 2018

Slide 31

Slide 31 text

Apache Arrow

Slide 32

Slide 32 text

Existing data tools is too dated • Not optimized for the contemporary computer architecture • Single-threaded algorithms are not friendly to multi- core CPU and GPGPU • Data layout is not optimized for CPU cache • Tools are fragmented for each programming language ecosystem • Data in memory couldn’t be shared among tools

Slide 33

Slide 33 text

Arrow’s Key Idea • Language agnostic, open standard in-memory format for columnar data (i.e. data frames) • Bring together database and data science communities to collaborate on shared computational technologies • Defragment data access among different tools

Slide 34

Slide 34 text

Before and After • Each system has own internal memory format • 80% computation wasted on serialization & deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality With Arrow Without Arrow https://arrow.apache.org/

Slide 35

Slide 35 text

Modules • Arrow (In-memory storage) • Parquet (File storage) • Gandiva (Computation engine) • Plasma (Distributed object store) • Flight (Efficient gRPC transport)

Slide 36

Slide 36 text

Use cases of Apache Arrow Accessing data • Reading and writing widely used storage formats • Interacting with database and other data sources Exchanging data • Zero-copy IPC • Efficient RPC and client-server communications Computation with data • Efficient in-memory and out-of-core data frame analysis • JIT compile for vectorized expression evaluations by LLVM

Slide 37

Slide 37 text

Red Arrow Red Arrow Arrow GLib Arrow C++ GObject Introspection Wrap with extern “C” functions Ruby binding C binding Core library

Slide 38

Slide 38 text

Red Arrow family Red Arrow Red Parquet Red Gandiva Red Plasma Arrow C++ Parquet C++ Gandiva C++ Plasma C++ GObject Introspection GObject Introspection GObject Introspection GObject Introspection

Slide 39

Slide 39 text

Red Arrow family is available • It can be used for memory-efficient collection for the primitive data types • Objects of classes of Red Arrow can be passed to Python without copy by using red-arrow-pycall • You can read and write Parquet file format

Slide 40

Slide 40 text

The impact on Rails applications

Slide 41

Slide 41 text

ActiveRecord + Apache Arrow • I performed an experiment to integrate ActiveRecord and Apache Arrow • Arrow::RecordBatch was employed as the internal data representation of AR::Result • A RecordBatch represents a bunch of columnar table data • mysql2 was modified to generate an instance of Arrow::RecordBatch directly from a query result • The memory consumption and computation time of AR’s pluck method are compared b/w the original and Apache Arrow versions

Slide 42

Slide 42 text

Result • Memory consumption is reduced more than x10 in Arrow version • Computation time is also reduced in Arrow version x12 less

Slide 43

Slide 43 text

The effect of Apache Arrow • Using Apache Arrow tremendously improves the memory consumption of pluck method without the loss of computational speed • My experimental implementation can be applied only by changing the connection adapter name:
 “mysql2” → “arrow-mysql2” • The technologies for the systems that utilize massive data is also applicable to Web applications • The activity for Ruby’s data tools can have good effects for Rails applications • I’ll explain the detail of this experiment and the additional research now I’m performing to improve pluck method, in RubyKaigi 2019

Slide 44

Slide 44 text

Conclusion

Slide 45

Slide 45 text

Wrap up this talk • Ruby ͷͨΊͷσʔλɾπʔϧ։ൃ͸ண࣮ʹਐḿ͍ͯ͠Δ ͕ɺ։ൃΛՃ଎͢ΔͨΊʹߋʹ։ൃऀ͕ඞཁͩ͠ɺ։ൃҎ ֎ͷ׆ಈΛ͢Δਓ΋ඞཁͰ͋Δ • Apache Arrow is important for the future of Ruby in the data science field • Apache Arrow is also important for Rails application • I will talk about the mechanisms how Apache Arrow improves pluck method in RubyKaigi 2019