Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RubyData and Rails

RubyData and Rails

To make Ruby ready-to-use in the data science field. And the impact that it has on Rails applications.

at Rails Developer Meetup 2019

Kenta Murata

March 22, 2019
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. RubyData and Rails
    To make Ruby ready-to-use in the data science field.
    And the impact that it has on Rails applications.
    Kenta Murata
    Speee Inc.
    Rails Developer Meetup 2019 2019-03-22

    View full-size slide

  2. About me
    • @mrkn (Kenta Murata)
    • Full-time CRuby committer at Speee Inc.
    • Apache Arrow contributor
    • My gems:
    • pycall, enumerable-statistics, bigdecimal, etc.
    • Hobbies:
    • Mathematics, Physics, Photograph

    View full-size slide

  3. Contents
    • Introduction
    • RubyData
    • Red Data Tools
    • Apache Arrow
    • The impact on Rails applications
    • Conclusion

    View full-size slide

  4. Introduction

    View full-size slide

  5. About 3 years ago…
    • I started to make data tools for Ruby
    • At that time, I tried and failed to perform image
    analysis and simple data analysis with Ruby
    • Ruby couldn’t compete with Python and Julia for
    tasks I wanted to do

    View full-size slide

  6. My Goal
    • Make Ruby a natural candidate to build data
    processing systems
    • We can analyze small data by Ruby’s data tools
    without any hard effort to keep using Ruby
    • There are developers and users of Ruby’s data tools
    around of the world

    View full-size slide

  7. Now
    • Ruby still cannot compete with Python and Julia
    • But necessary tool ecosystems are growing steadily
    by efforts of several developers and non-developers
    who have the similar visions
    • I think we have bright futures

    View full-size slide

  8. The map of data tool
    ecosystems around of Ruby
    Red Data Tools
    pycall.rb
    Ruby Numo Apache Arrow
    SciRuby
    Python

    ecosystem
    Other languages
    ecosystems

    View full-size slide

  9. The map of data tool
    ecosystems around of Ruby
    Red Data Tools
    pycall.rb
    Ruby Numo Apache Arrow
    SciRuby
    Python

    ecosystem
    Other languages
    ecosystems

    View full-size slide

  10. What is RubyData
    RubyData has two sides
    Community Project

    View full-size slide

  11. On the community side
    RubyData is a community for developers and users
    who use opensource data tools with Ruby
    Hosting discource
    • discourse.ruby-
    data.org
    • It’s used for the
    discussion during
    GSoC 2018
    Holding workshops
    • Several workshops in
    past RubyKaigis
    • The next is going to
    hold on the 2nd day
    in RubyKaigi 2019

    View full-size slide

  12. On the project side
    • RubyData is the umbrella project for integrating a
    lot of data tools for Ruby
    • RubyData provides two docker images
    • rubydata/minimal-notebook
    • rubydata/datascience-notebook
    • These notebooks are available to use on binder

    View full-size slide

  13. The map of data tool
    ecosystems around of Ruby
    Red Data Tools
    pycall.rb
    Ruby Numo Apache Arrow
    SciRuby
    Python

    ecosystem
    Other languages
    ecosystems
    RubyData

    View full-size slide

  14. On the project side
    • RubyData is the umbrella project for integrating a
    lot of data tools for Ruby
    • RubyData provides two docker images
    • rubydata/minimal-notebook
    • rubydata/datascience-notebook
    • These notebooks are available to use on binder

    View full-size slide

  15. rubydata/minimal-
    notebook
    • Based on jupyter/scipy-notebook image
    • It consists of …
    • Jupyter Notebook and JupyterLab
    • SciPy stacks

    (scipy, numpy, pandas, matplotlib, IPython, etc.)
    • IRuby and data tools for Ruby including pycall.rb

    View full-size slide

  16. rubydata/datascience-
    notebook
    • Based on jupyter/datascience-notebook images
    • It consists of …
    • Things in rubydata/minimal-notebook
    • IJulia and Julia tools
    • IRKernel and R tools

    View full-size slide

  17. Availability on binder
    https://github.com/RubyData/docker-stacks

    View full-size slide

  18. The objective of RubyData
    • Encourage and help Rubyists who try to utilize Ruby
    in the data science field
    • Developing data tools for Ruby
    • Using Ruby for data analysis frontend language
    • Integrating Rails applications with some data
    processing tools

    View full-size slide

  19. Red Data Tools

    View full-size slide

  20. What is Red Data Tools
    • Red Data Tools is a development project
    • The objective is developing data tools for Ruby
    • Holding the development meetup every month

    View full-size slide

  21. The policy of Red Data Tools
    1. Collaborate across the Ruby community

    Rubyコミュニティーを超えて協⼒する
    2. Acting rather than blaming

    ⾮難することよりも⼿を動かすことが⼤事
    3. Continuous, iterative progress rather than a short, big project

    ⼀回だけの活発な活動よりも⼩さくてもいいので継続的に活動することが⼤事
    4. The current lack of knowledge doesn't matter

    現時点での知識不⾜は問題ではない
    5. Ignore criticism from outsiders

    部外者からの⾮難は気にしない
    6. Fun!

    楽しくやろう!

    View full-size slide

  22. Red Data Tools Products
    • Apache Arrow related tools (explained below)
    • Charty … visualization
    • Red Chainer … deep learning
    • etc.

    View full-size slide

  23. OSS Gate for Red Data Tools
    • Development meetup held every month in Speee
    Lounge, Tokyo
    • Not only Red Data Tools, but also Ruby Numo
    people, SciRuby people, and others have attended
    • Like asakusa.rb, in this meetup we concentrate the
    development of data tools for Ruby
    • There are two Apache Arrow committers

    View full-size slide

  24. A good story
    • Red Data Tools has two Apache Arrow committers
    • One is @kou, Kouhei Sutou, the founder of Red Data
    Tools project
    • Another one is @shiro615, he started his
    contribution to Apache Arrow as his first OSS
    activity in Red Data Tools meetup, and got the
    commit-bit in Nov 2018

    View full-size slide

  25. Apache Arrow

    View full-size slide

  26. Existing data tools is too dated
    • Not optimized for the contemporary computer
    architecture
    • Single-threaded algorithms are not friendly to multi-
    core CPU and GPGPU
    • Data layout is not optimized for CPU cache
    • Tools are fragmented for each programming language
    ecosystem
    • Data in memory couldn’t be shared among tools

    View full-size slide

  27. Arrow’s Key Idea
    • Language agnostic, open standard in-memory
    format for columnar data (i.e. data frames)
    • Bring together database and data science
    communities to collaborate on shared
    computational technologies
    • Defragment data access among different tools

    View full-size slide

  28. Before and After
    • Each system has own internal
    memory format
    • 80% computation wasted on
    serialization & deserialization
    • Similar functionality implemented
    in multiple projects
    • All systems utilize the same
    memory format
    • No overhead for cross-system
    communication
    • Projects can share functionality
    With Arrow Without Arrow
    https://arrow.apache.org/

    View full-size slide

  29. Modules
    • Arrow (In-memory storage)
    • Parquet (File storage)
    • Gandiva (Computation engine)
    • Plasma (Distributed object store)
    • Flight (Efficient gRPC transport)

    View full-size slide

  30. Use cases of Apache Arrow
    Accessing data
    • Reading and writing widely used storage formats
    • Interacting with database and other data sources
    Exchanging data
    • Zero-copy IPC
    • Efficient RPC and client-server communications
    Computation with data
    • Efficient in-memory and out-of-core data frame analysis
    • JIT compile for vectorized expression evaluations by LLVM

    View full-size slide

  31. Red Arrow
    Red Arrow
    Arrow GLib
    Arrow C++
    GObject Introspection
    Wrap with extern “C” functions
    Ruby binding
    C binding
    Core library

    View full-size slide

  32. Red Arrow family
    Red Arrow
    Red Parquet
    Red Gandiva
    Red Plasma
    Arrow C++
    Parquet C++
    Gandiva C++
    Plasma C++
    GObject Introspection
    GObject Introspection
    GObject Introspection
    GObject Introspection

    View full-size slide

  33. Red Arrow family is available
    • It can be used for memory-efficient collection for
    the primitive data types
    • Objects of classes of Red Arrow can be passed to
    Python without copy by using red-arrow-pycall
    • You can read and write Parquet file format

    View full-size slide

  34. The impact on Rails
    applications

    View full-size slide

  35. ActiveRecord + Apache Arrow
    • I performed an experiment to integrate ActiveRecord and
    Apache Arrow
    • Arrow::RecordBatch was employed as the internal data
    representation of AR::Result
    • A RecordBatch represents a bunch of columnar table data
    • mysql2 was modified to generate an instance of
    Arrow::RecordBatch directly from a query result
    • The memory consumption and computation time of AR’s pluck
    method are compared b/w the original and Apache Arrow
    versions

    View full-size slide

  36. Result
    • Memory consumption is reduced more than x10 in Arrow version
    • Computation time is also reduced in Arrow version
    x12 less

    View full-size slide

  37. The effect of Apache Arrow
    • Using Apache Arrow tremendously improves the memory
    consumption of pluck method without the loss of computational
    speed
    • My experimental implementation can be applied only by changing
    the connection adapter name:

    “mysql2” → “arrow-mysql2”
    • The technologies for the systems that utilize massive data is also
    applicable to Web applications
    • The activity for Ruby’s data tools can have good effects for Rails
    applications
    • I’ll explain the detail of this experiment and the additional research
    now I’m performing to improve pluck method, in RubyKaigi 2019

    View full-size slide

  38. Wrap up this talk
    • Ruby ͷͨΊͷσʔλɾπʔϧ։ൃ͸ண࣮ʹਐḿ͍ͯ͠Δ
    ͕ɺ։ൃΛՃ଎͢ΔͨΊʹߋʹ։ൃऀ͕ඞཁͩ͠ɺ։ൃҎ
    ֎ͷ׆ಈΛ͢Δਓ΋ඞཁͰ͋Δ
    • Apache Arrow is important for the future of Ruby in
    the data science field
    • Apache Arrow is also important for Rails application
    • I will talk about the mechanisms how Apache Arrow
    improves pluck method in RubyKaigi 2019

    View full-size slide