$30 off During Our Annual Pro Sale. View Details »

Reducing ActiveRecord memory consumption using Apache Arrow

Reducing ActiveRecord memory consumption using Apache Arrow

Using Apache Arrow as the internal data structure of ActiveRecord, the execution time of the pluck method was reduced, and this method could be more parallelizable.

And I suggested that we should let DBMSs support the data format of Apache Arrow as their query result so that the memory consumption of Active Record is also reduced.

Kenta Murata

April 20, 2019
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Reducing ActiveRecord memory
    consumption using Apache Arrow
    Kenta Murata
    Speee Inc.
    2019-04-20
    RubyKaigi 2019

    View Slide

  2. Contents
    • Background
    • ActiveRecord and the pluck method
    • Table data structure in Apache Arrow
    • Experiment and results
    • Future development plans
    • Conclusion

    View Slide

  3. About myself
    • name: mrkn
    • full_name: Kenta Murata
    • affiliation: Speee Inc.
    • job: Full-time CRuby committer
    • hobby: CS, Math, Camera

    View Slide

  4. View Slide

  5. Background

    View Slide

  6. I found Apache Arrow in 2016

    View Slide

  7. Apache Arrow
    • Language-agnostic, open standard in-memory format for column-
    major data (i.e. data frames)
    • Started in Sep 2015, publicly announced in Feb 2016
    • Defragment data access among different tools and languages
    • Bring together databases and data science communities to
    collaborate on shared computational technologies

    View Slide

  8. Before and After
    • Each system has own internal
    memory format
    • 80% computation wasted on
    serialization & deserialization
    • Similar functionality implemented
    in multiple projects
    • All systems utilize the same
    memory format
    • No overhead for cross-system
    communication
    • Projects can share functionality
    Without Arrow (Now) With Arrow (Future)
    https://arrow.apache.org/

    View Slide

  9. 2016

    View Slide

  10. 2018
    • I started to join to the development of Apache Arrow
    • Apache Arrow was ready to use from Ruby using red-arrow.gem at that time
    • It’s thanks to Kouhei Suto
    • The purpose is for two goals:
    1. Making sparse tensor library for Ruby and letting it interoperable with other system such
    as SciPy stack on Python
    2. Making data frame library for Ruby using Apache Arrow’s data structures and operations
    • I speculated that Apache Arrow may improve the performance of the application of
    ActiveRecord as I investigated the implementation of Apache Arrow C++ library

    View Slide

  11. How does AR manage the result of query
    MYSQL_RES
    Result
    AR::Relation
    Extracted
    mysqld
    • fields : An array of the information of columns
    • rows : An array of the records in a result from DBMS
    AR::Result
    fields
    rows

    View Slide

  12. Use Apache Arrow in AR::Result
    MYSQL_RES
    Result
    AR::Relation
    Extracted
    mysqld
    • fields : An array of the information of columns
    • columns : An array of the column arrays in a result from DBMS
    Arrow::Table
    AR::Result
    fields
    columns

    View Slide

  13. Apache Arrow’s benefit
    • Apache Arrow uses column-major layout
    • It has an advantage on the cache efficiency and SIMD-instruction
    applicability for scanning columns to filter records
    • The memory consumption can be much efficient than row-major layout
    • Almost all operations are implemented in C++ so they are faster than
    things written in Ruby
    • It is much parallelizable than things written in Ruby because these
    operations can perform with releasing RubyVM GIL

    View Slide

  14. Row-major vs Column-major
    id name height weight
    1 A 172 67
    2 B 185 78
    3 C 166 60
    4 D 158 62
    5 E 190 75
    6 F 178 69
    1 A 172 67 2 B 185
    78 3 C 166 60 4 D
    158 62 5 E 190 75 6
    F 178 69
    1 2 3 4 5 6 A
    B C D E F 172 185
    166 158 190 178 67 78 60
    62 75 69
    Row-major layout Column-major layout

    View Slide

  15. Selecting height ≥ 170
    id name height weight
    1 A 172 67
    2 B 185 78
    3 C 166 60
    4 D 158 62
    5 E 190 75
    6 F 178 69
    1 A 172 67 2 B 185
    78 3 C 166 60 4 D
    158 62 5 E 190 75 6
    F 178 69
    1 2 3 4 5 6 A
    B C D E F 172 185
    166 158 190 178 67 78 60
    62 75 69
    Row-major layout Column-major layout
    ↑Need to scan the height column
    Low efficiency High efficiency

    View Slide

  16. Apache Arrow’s benefit
    • Apache Arrow uses column-major layout
    • It has an advantage on the cache efficiency and SIMD-instruction
    applicability for scanning columns to filter records
    • The memory consumption can be much efficient than row-major layout
    • Almost all operations are implemented in C++ so they are faster than
    things written in Ruby
    • It is much parallelizable than things written in Ruby because these
    operations can perform with releasing RubyVM GIL

    View Slide

  17. I performed a PoC experiment
    • Making rough implementation of a new subclass of AR::Result,
    called ArrowResult class, that uses Arrow::Table
    • Making rough implementation of the new connection adapter
    that generates ArrowResult object as the result of a query
    • Comparing execution time and memory consumption of pluck
    method between the original and the Arrow versions

    View Slide

  18. Why pluck?
    • pluck method is simple
    • It just converts AR::Result to Array of Array
    • It is easy to understand
    • It is easy to appeal the superiority of Apache Arrow

    View Slide

  19. The result of PoC
    • I already reported at Rails Developer Meetup on 22 Mar 2019:
    • The execution time was not increased
    • The memory consumption was EXTREMELY reduced!!
    • It’s amazing!!! I didn’t expected such the result.
    But …

    View Slide

  20. The PoC Code was Broken!!!

    View Slide

  21. The reported result is wrong

    View Slide

  22. I fixed the code and perform the
    experiments again for today

    View Slide

  23. The new right result
    • The execution time was much reduced
    • The memory consumption was increased
    • And the additional results:
    • pluck method got highly parallelizable than before

    View Slide

  24. ActiveRecord and the
    pluck method

    View Slide

  25. ActiveRecord classes and modules
    AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc.
    AR::Relation
    AR::Base
    creates
    query
    AR::Result
    query
    DBMS
    Mysql2::Result,

    PG::Result, etc.
    query
    AR::Calculations
    (module)
    include
    pluck is here!
    result

    View Slide

  26. What I tweaked for the experiments
    AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc.
    AR::Relation
    AR::Base
    creates
    query
    AR::Result
    query
    DBMS
    Mysql2::Result,

    PG::Result, etc.
    query
    AR::Calculations
    (module)
    include
    result

    View Slide

  27. Breaking down pluck method
    columns a SQL string a Mysql2::Result an AR::Result an Array of records
    build SQL query fetch cast values
    args return value
    these stages occupy almost execution time

    and almost consumed memory

    View Slide

  28. Focusing 3 stages
    query fetch cast_values
    a SQL string a Mysql2::Result an AR::Result an Array of records
    Mysql2::Client#query Mysql2::Result#to_a AR::Result#cast_values
    running without GIL (waiting I/O, etc.)
    => other threads can run simultaneously

    View Slide

  29. Table data structure in
    Apache Arrow

    View Slide

  30. Arrow::Table
    • A Table object consists of a Schema
    object and a vector of Column objects
    • A Schema consists of a vector of Field,
    and a Field consists of its data type and
    its name
    • A Column consists of a Field object and a
    ChunkedArray object
    • The field of the column at a certain index
    is same as the field at the same index in
    the schema
    • Each column has the same length
    Schema
    Table
    vector
    ↑Field := (data_type, name)
    f1 f2 f3 f4 f5
    f1 f2 f3 f4 f5
    ↑Column := (field, chunked_array)
    ChunkedArray
    ChunkedArray
    ChunkedArray
    ChunkedArray
    ChunkedArray
    same field

    View Slide

  31. ChunkedArray
    • ChunkedArray consists of multiple
    arrays with the same value type
    • Each array has a contiguous buffer for
    storing values
    • Each array can have different length
    • It allows us to treat multiple arrays as
    a single array
    • It allows us to build a table by divide
    and conquer method in parallel
    ChunkedArray
    a1 a2 a3 a4 a5
    type type type type type
    Buffer
    Buffer
    Buffer
    Buffer
    Buffer
    Arrays

    View Slide

  32. Experiment and results

    View Slide

  33. The Experiment Setting
    • Comparing the execution time and the memory consumption
    between the two versions of pluck method: original vs arrow
    • In the arrow version, the result of the query is divided multiple
    batches and they are processed in parallel

    View Slide

  34. Table definition and query statement
    CREATE TABLE IF NOT EXISTS mysql2_test (
    null_test VARCHAR(10),
    bit_test BIT,
    tiny_int_test TINYINT,
    small_int_test SMALLINT,
    medium_int_test MEDIUMINT,
    int_test INT,
    big_int_test BIGINT,
    float_test FLOAT(10,3),
    float_zero_test FLOAT(10,3),
    double_test DOUBLE(10,3),
    decimal_test DECIMAL(10,3),
    decimal_zero_test DECIMAL(10,3),
    date_test DATE,
    date_time_test DATETIME,
    timestamp_test TIMESTAMP,
    time_test TIME,
    year_test YEAR(4),
    char_test CHAR(10),
    varchar_test VARCHAR(10),
    binary_test BINARY(10),
    varbinary_test VARBINARY(10),
    tiny_blob_test TINYBLOB,
    tiny_text_test TINYTEXT,
    blob_test BLOB,
    text_test TEXT,
    medium_blob_test MEDIUMBLOB,
    medium_text_test MEDIUMTEXT,
    long_blob_test LONGBLOB,
    long_text_test LONGTEXT,
    enum_test ENUM('val1', 'val2'),
    set_test SET('val1', 'val2')
    ) DEFAULT CHARSET=utf8
    Mysql2Test.limit(n).pluck(
    :int_test, :double_test,
    :varchar_test, :text_test,
    )

    View Slide

  35. Changes of 3 stages
    query fetch cast_values
    SQL Mysql::Result AR::Result Array
    SQL Mysql::Result
    ArrowResult
    (Arrow::Table) Array
    Mysql2::Client#query
    Mysql2::Client#query
    Mysql2::Result#to_a
    Mysql2::Result#to_arrow
    AR::Result#cast_values
    Arrow::Table#raw_records
    Original version
    Arrow version

    View Slide

  36. Mysql2::Result#to_arrow
    In [1]: client = Mysql2::Client.new(host: 'localhost', username: 'root', database: 'test')
    result = client.query('select * from ruby_committers', as: :array)
    records = result.to_a
    records[0..2]
    Out[1]: [[1, "matz", "3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900],
    [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900],
    [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]]
    In [2]: result = client.query('select * from ruby_committers', as: :array)
    table = result.to_arrow
    Out[2]: #
    id svn first_commit
    0 1 matz 3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4
    1 2 aamine 6f4751f5f63d177534f34e0c278967883e11c035
    2 3 shugo 6d36661c5eb87831cc60fa416470c8459d2a16c0
    : : : :
    The result of Mysql2::Result#to_a is an array of records.
    The result of Mysql2::Result#to_arrow is an Arrow::Table.

    View Slide

  37. Arrow::Table#raw_records
    In [3]: records2 = table.raw_records
    records2[0..2]
    Out[3]: [[1, "matz", "3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900],
    [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900],
    [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]]
    In [4]: records == records2
    Out[4]: true
    The result of Arrow::Table#raw_records is an array of records.
    This equals to the results of Mysql2::Result#to_a.

    View Slide

  38. Execution time
    Execution time [sec]
    Total number of records fetched from DB

    View Slide

  39. Ratio of Arrow / Original
    Actual batch size for 0
    Total records Batch size
    1000 42
    2000 84
    3000 125
    5000 209
    10000 417
    20000 834
    30000 1250
    50000 2084

    View Slide

  40. Maximum RSS
    Maximum RSS [MB]
    Total number of records fetched from DB

    View Slide

  41. Ratio of Arrow / Original
    Actual batch size for 0
    Total records Batch size
    1000 42
    2000 84
    3000 125
    5000 209
    10000 417
    20000 834
    30000 1250
    50000 2084

    View Slide

  42. Best batch sizes for total records
    ←Better Worse→
    ←Better Worse→
    Actual batch size for 0
    Total records Batch size
    1000 42
    2000 84
    3000 125
    5000 209
    10000 417
    20000 834
    30000 1250
    50000 2084
    ** This may depends on machine spec **

    View Slide

  43. Breaking down the execution time
    query fetch cast_values
    SQL Mysql::Result AR::Result Array
    SQL Mysql::Result Arrow::Table Array
    Mysql2::Client#query
    Mysql2::Client#query
    Mysql2::Result#to_a
    Mysql2::Result#to_arrow
    AR::Result#cast_values
    Arrow::Table#raw_records
    Original version
    Arrow version

    View Slide

  44. Breaking down the execution time
    Data transformation stage (query → fetch → cast_values)
    Execution time [msec]

    View Slide

  45. GIL handling on each stage
    query fetch cast_values
    SQL Mysql::Result AR::Result Array
    SQL Mysql::Result Arrow::Table Array
    Mysql2::Client#query
    Mysql2::Client#query
    Mysql2::Result#to_a
    Mysql2::Result#to_arrow
    AR::Result#cast_values
    Arrow::Table#raw_records
    Original version
    Arrow version

    View Slide

  46. GIL handling on each stage
    query fetch cast_values
    SQL Mysql::Result AR::Result Array
    SQL Mysql::Result Arrow::Table Array
    Mysql2::Client#query
    Mysql2::Client#query
    Mysql2::Result#to_a
    Mysql2::Result#to_arrow
    AR::Result#cast_values
    Arrow::Table#raw_records
    Original version
    Arrow version
    with GIL
    without GIL
    without GIL without GIL
    with GIL
    with GIL

    View Slide

  47. Using Arrow can release GIL for longer time
    • We can run other threads preemptively during the stages from
    the query to the fetch if we use Arrow version
    • We can speculate if we perform this benchmark in multiple
    threads, Arrow version may be faster than the original version
    • Let’s try with 10 threads

    View Slide

  48. Multi-threaded execution time

    View Slide

  49. Multi-threaded execution time

    View Slide

  50. Breaking down the memory consumption
    • Use massif tool of valgrind to investigate how heap is used
    • Visualize it by massif-visualizer tool
    • Compare the memory consumption between arrow and original
    versions

    View Slide

  51. Massif result of the arrow version
    64.1MB
    41.5MB
    The size of the
    resultant Arrow::Table
    The size of the builders
    of column arrays
    Building Arrow::Table
    shrinks memory usage
    The size of the working
    memory of MySQL client
    43.4MB
    ≒ 40MB

    View Slide

  52. Massif result of the original version
    ≒ 40MB
    43.4MB

    View Slide

  53. Memory consumption summary
    Memory Consumption Original Arrow
    MySQL Client 43MB 43MB
    Array Builders 0 64MB
    Arrow::Table 0 42MB
    Ruby objects 40MB 40MB
    These can be vanished if
    MySQL returns Array::Table.

    View Slide

  54. Summary of Experiments
    • Arrow version has advantages about both speed and parallelism
    on the case for processing large amount of data obtained from DB
    • These advantages are important especially for batch jobs running
    on Sidekiq workers
    • The increase of memory consumption is at most x2
    • The memory consumption can be reduced if MySQL returns
    Arrow::Table object instead of MYSQL_RES object

    View Slide

  55. We should let MySQL support
    Apahe Arrow’s data format

    View Slide

  56. Future development plan

    View Slide

  57. Apache Arrow
    • Language-agnostic, open standard in-memory format for column-
    major data (i.e. data frames)
    • Started in Sep 2015, publicly announced in Feb 2016
    • Defragment data access among different tools and languages
    • Bring together databases and data science communities to
    collaborate on shared computational technologies

    View Slide

  58. We should let MySQL support
    Apahe Arrow’s data format

    View Slide

  59. We should let any DBMSs support
    Apahe Arrow’s data format

    View Slide

  60. ARROW-3193

    View Slide

  61. ARROW-1106

    View Slide

  62. ARROW-3102

    View Slide

  63. ARROW-3155

    View Slide

  64. We should let any DBMSs support
    Apahe Arrow’s data format

    View Slide

  65. Welcome to

    Red Data Tools project

    View Slide

  66. The other perspective

    View Slide

  67. Making Red Arrow a solid
    data frame library

    View Slide

  68. Using data frame in ActiveRecord
    • Replacing AR::Result with Arrow::Table and rewriting the operation of
    AR::Relation using Arrow’s operation will improve the performance of
    all operations of Active Record
    • e.g. we can do the following things without issuing query to DBMS:
    • We can pick up certain records in an AR::Relation using SQL
    statement
    • We can join the records in multiple AR::Relation objects
    • etc.

    View Slide

  69. Making Red Arrow as a solid data frame library
    • Making Arrow::Table object like pandas.DataFrame
    • Making Arrow::Column object like pandas.Series
    • etc.
    • We’ve already started this work.

    View Slide

  70. Welcome to

    Red Data Tools project

    View Slide

  71. Conclusions

    View Slide

  72. Conclusion
    • We performed experiments to replace the internal data of
    AR::Result with Arrow::Table object
    • This reduced the execution time of pluck method
    • This improved the parallelizability of pluck method
    • The memory consumption was increased x2-x3
    • But it can be reduced by letting MySQL support to return results in
    Arrow format

    View Slide