RubyData and Rails To make Ruby ready-to-use in the data science field. And the impact that it has on Rails applications. Kenta Murata Speee Inc. Rails Developer Meetup 2019 2019-03-22

About me • @mrkn (Kenta Murata) • Full-time CRuby committer at Speee Inc. • Apache Arrow contributor • My gems: • pycall, enumerable-statistics, bigdecimal, etc. • Hobbies: • Mathematics, Physics, Photograph

Contents • Introduction • RubyData • Red Data Tools • Apache Arrow • The impact on Rails applications • Conclusion

About 3 years ago… • I started to make data tools for Ruby • At that time, I tried and failed to perform image analysis and simple data analysis with Ruby • Ruby couldn’t compete with Python and Julia for tasks I wanted to do

My Goal • Make Ruby a natural candidate to build data processing systems • We can analyze small data by Ruby’s data tools without any hard effort to keep using Ruby • There are developers and users of Ruby’s data tools around of the world

Now • Ruby still cannot compete with Python and Julia • But necessary tool ecosystems are growing steadily by efforts of several developers and non-developers who have the similar visions • I think we have bright futures

The map of data tool ecosystems around of Ruby Red Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python ecosystem Other languages ecosystems
 

What is RubyData RubyData has two sides Community Project

On the community side RubyData is a community for developers and users who use opensource data tools with Ruby Hosting discource • discourse.ruby- • It’s used for the discussion during GSoC 2018 Holding workshops • Several workshops in past RubyKaigis • The next is going to hold on the 2nd day in RubyKaigi 2019

On the project side • RubyData is the umbrella project for integrating a lot of data tools for Ruby • RubyData provides two docker images • rubydata/minimal-notebook • rubydata/datascience-notebook • These notebooks are available to use on binder

The map of data tool ecosystems around of Ruby Red Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python ecosystem Other languages ecosystems RubyData
 ecosystem Other languages ecosystems RubyData

rubydata/minimal- notebook • Based on jupyter/scipy-notebook image • It consists of … • Jupyter Notebook and JupyterLab • SciPy stacks
 (scipy, numpy, pandas, matplotlib, IPython, etc.) • IRuby and data tools for Ruby including pycall.rb

rubydata/datascience- notebook • Based on jupyter/datascience-notebook images • It consists of … • Things in rubydata/minimal-notebook • IJulia and Julia tools • IRKernel and R tools

Availability on binder

The objective of RubyData • Encourage and help Rubyists who try to utilize Ruby in the data science field • Developing data tools for Ruby • Using Ruby for data analysis frontend language • Integrating Rails applications with some data processing tools

Red Data Tools

What is Red Data Tools • Red Data Tools is a development project • The objective is developing data tools for Ruby • Holding the development meetup every month

The policy of Red Data Tools 1. Collaborate across the Ruby community
 Rubyコミュニティーを超えて協⼒する 2. Acting rather than blaming
 ⾮難することよりも⼿を動かすことが⼤事 3. Continuous, iterative progress rather than a short, big project
 ⼀回だけの活発な活動よりも⼩さくてもいいので継続的に活動することが⼤事 4. The current lack of knowledge doesn't matter
 現時点での知識不⾜は問題ではない 5. Ignore criticism from outsiders
 部外者からの⾮難は気にしない 6. Fun!

Red Data Tools Products • Apache Arrow related tools (explained below) • Charty … visualization • Red Chainer … deep learning • etc.

OSS Gate for Red Data Tools • Development meetup held every month in Speee Lounge, Tokyo • Not only Red Data Tools, but also Ruby Numo people, SciRuby people, and others have attended • Like asakusa.rb, in this meetup we concentrate the development of data tools for Ruby • There are two Apache Arrow committers

A good story • Red Data Tools has two Apache Arrow committers • One is @kou, Kouhei Sutou, the founder of Red Data Tools project • Another one is @shiro615, he started his contribution to Apache Arrow as his first OSS activity in Red Data Tools meetup, and got the commit-bit in Nov 2018

Apache Arrow

Existing data tools is too dated • Not optimized for the contemporary computer architecture • Single-threaded algorithms are not friendly to multi- core CPU and GPGPU • Data layout is not optimized for CPU cache • Tools are fragmented for each programming language ecosystem • Data in memory couldn’t be shared among tools

Arrow’s Key Idea • Language agnostic, open standard in-memory format for columnar data (i.e. data frames) • Bring together database and data science communities to collaborate on shared computational technologies • Defragment data access among different tools

Before and After • Each system has own internal memory format • 80% computation wasted on serialization & deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality With Arrow Without Arrow

Modules • Arrow (In-memory storage) • Parquet (File storage) • Gandiva (Computation engine) • Plasma (Distributed object store) • Flight (Efficient gRPC transport)

Use cases of Apache Arrow Accessing data • Reading and writing widely used storage formats • Interacting with database and other data sources Exchanging data • Zero-copy IPC • Efficient RPC and client-server communications Computation with data • Efficient in-memory and out-of-core data frame analysis • JIT compile for vectorized expression evaluations by LLVM

Red Arrow Red Arrow Arrow GLib Arrow C++ GObject Introspection Wrap with extern “C” functions Ruby binding C binding Core library

Red Arrow family Red Arrow Red Parquet Red Gandiva Red Plasma Arrow C++ Parquet C++ Gandiva C++ Plasma C++ GObject Introspection GObject Introspection GObject Introspection GObject Introspection

Red Arrow family is available • It can be used for memory-efficient collection for the primitive data types • Objects of classes of Red Arrow can be passed to Python without copy by using red-arrow-pycall • You can read and write Parquet file format

The impact on Rails applications

ActiveRecord + Apache Arrow • I performed an experiment to integrate ActiveRecord and Apache Arrow • Arrow::RecordBatch was employed as the internal data representation of AR::Result • A RecordBatch represents a bunch of columnar table data • mysql2 was modified to generate an instance of Arrow::RecordBatch directly from a query result • The memory consumption and computation time of AR’s pluck method are compared b/w the original and Apache Arrow versions

Result • Memory consumption is reduced more than x10 in Arrow version • Computation time is also reduced in Arrow version x12 less

The effect of Apache Arrow • Using Apache Arrow tremendously improves the memory consumption of pluck method without the loss of computational speed • My experimental implementation can be applied only by changing the connection adapter name:
 “mysql2” → “arrow-mysql2” • The technologies for the systems that utilize massive data is also applicable to Web applications • The activity for Ruby’s data tools can have good effects for Rails applications • I’ll explain the detail of this experiment and the additional research now I’m performing to improve pluck method, in RubyKaigi 2019

Wrap up this talk • Ruby ͷͨΊͷσʔλɾπʔϧ։ൃ͸ண࣮ʹਐḿ͍ͯ͠Δ ͕ɺ։ൃΛՃ଎͢ΔͨΊʹߋʹ։ൃऀ͕ඞཁͩ͠ɺ։ൃҎ ֎ͷ׆ಈΛ͢Δਓ΋ඞཁͰ͋Δ • Apache Arrow is important for the future of Ruby in the data science field • Apache Arrow is also important for Rails application • I will talk about the mechanisms how Apache Arrow improves pluck method in RubyKaigi 2019