RubyData and Rails

RubyData and Rails

To make Ruby ready-to-use in the data science field. And the impact that it has on Rails applications.

at Rails Developer Meetup 2019

7cca11c5257fda526eeb4b1ada28f904?s=128

Kenta Murata

March 22, 2019
Tweet

Transcript

  1. 1.

    RubyData and Rails To make Ruby ready-to-use in the data

    science field. And the impact that it has on Rails applications. Kenta Murata Speee Inc. Rails Developer Meetup 2019 2019-03-22
  2. 2.
  3. 3.

    About me • @mrkn (Kenta Murata) • Full-time CRuby committer

    at Speee Inc. • Apache Arrow contributor • My gems: • pycall, enumerable-statistics, bigdecimal, etc. • Hobbies: • Mathematics, Physics, Photograph
  4. 4.

    Contents • Introduction • RubyData • Red Data Tools •

    Apache Arrow • The impact on Rails applications • Conclusion
  5. 6.

    About 3 years ago… • I started to make data

    tools for Ruby • At that time, I tried and failed to perform image analysis and simple data analysis with Ruby • Ruby couldn’t compete with Python and Julia for tasks I wanted to do
  6. 7.

    My Goal • Make Ruby a natural candidate to build

    data processing systems • We can analyze small data by Ruby’s data tools without any hard effort to keep using Ruby • There are developers and users of Ruby’s data tools around of the world
  7. 8.

    Now • Ruby still cannot compete with Python and Julia

    • But necessary tool ecosystems are growing steadily by efforts of several developers and non-developers who have the similar visions • I think we have bright futures
  8. 9.

    The map of data tool ecosystems around of Ruby Red

    Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python
 ecosystem Other languages ecosystems
  9. 10.

    The map of data tool ecosystems around of Ruby Red

    Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python
 ecosystem Other languages ecosystems
  10. 11.
  11. 13.

    On the community side RubyData is a community for developers

    and users who use opensource data tools with Ruby Hosting discource • discourse.ruby- data.org • It’s used for the discussion during GSoC 2018 Holding workshops • Several workshops in past RubyKaigis • The next is going to hold on the 2nd day in RubyKaigi 2019
  12. 14.

    On the project side • RubyData is the umbrella project

    for integrating a lot of data tools for Ruby • RubyData provides two docker images • rubydata/minimal-notebook • rubydata/datascience-notebook • These notebooks are available to use on binder
  13. 15.

    The map of data tool ecosystems around of Ruby Red

    Data Tools pycall.rb Ruby Numo Apache Arrow SciRuby Python
 ecosystem Other languages ecosystems RubyData
  14. 16.

    On the project side • RubyData is the umbrella project

    for integrating a lot of data tools for Ruby • RubyData provides two docker images • rubydata/minimal-notebook • rubydata/datascience-notebook • These notebooks are available to use on binder
  15. 17.

    rubydata/minimal- notebook • Based on jupyter/scipy-notebook image • It consists

    of … • Jupyter Notebook and JupyterLab • SciPy stacks
 (scipy, numpy, pandas, matplotlib, IPython, etc.) • IRuby and data tools for Ruby including pycall.rb
  16. 18.

    rubydata/datascience- notebook • Based on jupyter/datascience-notebook images • It consists

    of … • Things in rubydata/minimal-notebook • IJulia and Julia tools • IRKernel and R tools
  17. 20.

    The objective of RubyData • Encourage and help Rubyists who

    try to utilize Ruby in the data science field • Developing data tools for Ruby • Using Ruby for data analysis frontend language • Integrating Rails applications with some data processing tools
  18. 22.

    What is Red Data Tools • Red Data Tools is

    a development project • The objective is developing data tools for Ruby • Holding the development meetup every month
  19. 23.

    The policy of Red Data Tools 1. Collaborate across the

    Ruby community
 Rubyコミュニティーを超えて協⼒する 2. Acting rather than blaming
 ⾮難することよりも⼿を動かすことが⼤事 3. Continuous, iterative progress rather than a short, big project
 ⼀回だけの活発な活動よりも⼩さくてもいいので継続的に活動することが⼤事 4. The current lack of knowledge doesn't matter
 現時点での知識不⾜は問題ではない 5. Ignore criticism from outsiders
 部外者からの⾮難は気にしない 6. Fun!
 楽しくやろう!
  20. 24.

    Red Data Tools Products • Apache Arrow related tools (explained

    below) • Charty … visualization • Red Chainer … deep learning • etc.
  21. 25.

    OSS Gate for Red Data Tools • Development meetup held

    every month in Speee Lounge, Tokyo • Not only Red Data Tools, but also Ruby Numo people, SciRuby people, and others have attended • Like asakusa.rb, in this meetup we concentrate the development of data tools for Ruby • There are two Apache Arrow committers
  22. 26.
  23. 27.
  24. 28.
  25. 29.
  26. 30.

    A good story • Red Data Tools has two Apache

    Arrow committers • One is @kou, Kouhei Sutou, the founder of Red Data Tools project • Another one is @shiro615, he started his contribution to Apache Arrow as his first OSS activity in Red Data Tools meetup, and got the commit-bit in Nov 2018
  27. 32.

    Existing data tools is too dated • Not optimized for

    the contemporary computer architecture • Single-threaded algorithms are not friendly to multi- core CPU and GPGPU • Data layout is not optimized for CPU cache • Tools are fragmented for each programming language ecosystem • Data in memory couldn’t be shared among tools
  28. 33.

    Arrow’s Key Idea • Language agnostic, open standard in-memory format

    for columnar data (i.e. data frames) • Bring together database and data science communities to collaborate on shared computational technologies • Defragment data access among different tools
  29. 34.

    Before and After • Each system has own internal memory

    format • 80% computation wasted on serialization & deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality With Arrow Without Arrow https://arrow.apache.org/
  30. 35.

    Modules • Arrow (In-memory storage) • Parquet (File storage) •

    Gandiva (Computation engine) • Plasma (Distributed object store) • Flight (Efficient gRPC transport)
  31. 36.

    Use cases of Apache Arrow Accessing data • Reading and

    writing widely used storage formats • Interacting with database and other data sources Exchanging data • Zero-copy IPC • Efficient RPC and client-server communications Computation with data • Efficient in-memory and out-of-core data frame analysis • JIT compile for vectorized expression evaluations by LLVM
  32. 37.

    Red Arrow Red Arrow Arrow GLib Arrow C++ GObject Introspection

    Wrap with extern “C” functions Ruby binding C binding Core library
  33. 38.

    Red Arrow family Red Arrow Red Parquet Red Gandiva Red

    Plasma Arrow C++ Parquet C++ Gandiva C++ Plasma C++ GObject Introspection GObject Introspection GObject Introspection GObject Introspection
  34. 39.

    Red Arrow family is available • It can be used

    for memory-efficient collection for the primitive data types • Objects of classes of Red Arrow can be passed to Python without copy by using red-arrow-pycall • You can read and write Parquet file format
  35. 41.

    ActiveRecord + Apache Arrow • I performed an experiment to

    integrate ActiveRecord and Apache Arrow • Arrow::RecordBatch was employed as the internal data representation of AR::Result • A RecordBatch represents a bunch of columnar table data • mysql2 was modified to generate an instance of Arrow::RecordBatch directly from a query result • The memory consumption and computation time of AR’s pluck method are compared b/w the original and Apache Arrow versions
  36. 42.

    Result • Memory consumption is reduced more than x10 in

    Arrow version • Computation time is also reduced in Arrow version x12 less
  37. 43.

    The effect of Apache Arrow • Using Apache Arrow tremendously

    improves the memory consumption of pluck method without the loss of computational speed • My experimental implementation can be applied only by changing the connection adapter name:
 “mysql2” → “arrow-mysql2” • The technologies for the systems that utilize massive data is also applicable to Web applications • The activity for Ruby’s data tools can have good effects for Rails applications • I’ll explain the detail of this experiment and the additional research now I’m performing to improve pluck method, in RubyKaigi 2019
  38. 45.

    Wrap up this talk • Ruby ͷͨΊͷσʔλɾπʔϧ։ൃ͸ண࣮ʹਐḿ͍ͯ͠Δ ͕ɺ։ൃΛՃ଎͢ΔͨΊʹߋʹ։ൃऀ͕ඞཁͩ͠ɺ։ൃҎ ֎ͷ׆ಈΛ͢Δਓ΋ඞཁͰ͋Δ •

    Apache Arrow is important for the future of Ruby in the data science field • Apache Arrow is also important for Rails application • I will talk about the mechanisms how Apache Arrow improves pluck method in RubyKaigi 2019