Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reducing ActiveRecord memory consumption using Apache Arrow

Reducing ActiveRecord memory consumption using Apache Arrow

Using Apache Arrow as the internal data structure of ActiveRecord, the execution time of the pluck method was reduced, and this method could be more parallelizable.

And I suggested that we should let DBMSs support the data format of Apache Arrow as their query result so that the memory consumption of Active Record is also reduced.

Kenta Murata

April 20, 2019
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Contents • Background • ActiveRecord and the pluck method •

    Table data structure in Apache Arrow • Experiment and results • Future development plans • Conclusion
  2. About myself • name: mrkn • full_name: Kenta Murata •

    affiliation: Speee Inc. • job: Full-time CRuby committer • hobby: CS, Math, Camera
  3. Apache Arrow • Language-agnostic, open standard in-memory format for column-

    major data (i.e. data frames) • Started in Sep 2015, publicly announced in Feb 2016 • Defragment data access among different tools and languages • Bring together databases and data science communities to collaborate on shared computational technologies
  4. Before and After • Each system has own internal memory

    format • 80% computation wasted on serialization & deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality Without Arrow (Now) With Arrow (Future) https://arrow.apache.org/
  5. 2018 • I started to join to the development of

    Apache Arrow • Apache Arrow was ready to use from Ruby using red-arrow.gem at that time • It’s thanks to Kouhei Suto • The purpose is for two goals: 1. Making sparse tensor library for Ruby and letting it interoperable with other system such as SciPy stack on Python 2. Making data frame library for Ruby using Apache Arrow’s data structures and operations • I speculated that Apache Arrow may improve the performance of the application of ActiveRecord as I investigated the implementation of Apache Arrow C++ library
  6. How does AR manage the result of query MYSQL_RES Result

    AR::Relation Extracted mysqld • fields : An array of the information of columns • rows : An array of the records in a result from DBMS AR::Result fields rows
  7. Use Apache Arrow in AR::Result MYSQL_RES Result AR::Relation Extracted mysqld

    • fields : An array of the information of columns • columns : An array of the column arrays in a result from DBMS Arrow::Table AR::Result fields columns
  8. Apache Arrow’s benefit • Apache Arrow uses column-major layout •

    It has an advantage on the cache efficiency and SIMD-instruction applicability for scanning columns to filter records • The memory consumption can be much efficient than row-major layout • Almost all operations are implemented in C++ so they are faster than things written in Ruby • It is much parallelizable than things written in Ruby because these operations can perform with releasing RubyVM GIL
  9. Row-major vs Column-major id name height weight 1 A 172

    67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 2 3 4 5 6 A B C D E F 172 185 166 158 190 178 67 78 60 62 75 69 Row-major layout Column-major layout
  10. Selecting height ≥ 170 id name height weight 1 A

    172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 2 3 4 5 6 A B C D E F 172 185 166 158 190 178 67 78 60 62 75 69 Row-major layout Column-major layout ↑Need to scan the height column Low efficiency High efficiency
  11. Apache Arrow’s benefit • Apache Arrow uses column-major layout •

    It has an advantage on the cache efficiency and SIMD-instruction applicability for scanning columns to filter records • The memory consumption can be much efficient than row-major layout • Almost all operations are implemented in C++ so they are faster than things written in Ruby • It is much parallelizable than things written in Ruby because these operations can perform with releasing RubyVM GIL
  12. I performed a PoC experiment • Making rough implementation of

    a new subclass of AR::Result, called ArrowResult class, that uses Arrow::Table • Making rough implementation of the new connection adapter that generates ArrowResult object as the result of a query • Comparing execution time and memory consumption of pluck method between the original and the Arrow versions
  13. Why pluck? • pluck method is simple • It just

    converts AR::Result to Array of Array • It is easy to understand • It is easy to appeal the superiority of Apache Arrow
  14. The result of PoC • I already reported at Rails

    Developer Meetup on 22 Mar 2019: • The execution time was not increased • The memory consumption was EXTREMELY reduced!! • It’s amazing!!! I didn’t expected such the result. But …
  15. The new right result • The execution time was much

    reduced • The memory consumption was increased • And the additional results: • pluck method got highly parallelizable than before
  16. ActiveRecord classes and modules AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc. AR::Relation AR::Base

    creates query AR::Result query DBMS Mysql2::Result,
 PG::Result, etc. query AR::Calculations (module) include pluck is here! result
  17. What I tweaked for the experiments AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc.

    AR::Relation AR::Base creates query AR::Result query DBMS Mysql2::Result,
 PG::Result, etc. query AR::Calculations (module) include result
  18. Breaking down pluck method columns a SQL string a Mysql2::Result

    an AR::Result an Array of records build SQL query fetch cast values args return value these stages occupy almost execution time
 and almost consumed memory
  19. Focusing 3 stages query fetch cast_values a SQL string a

    Mysql2::Result an AR::Result an Array of records Mysql2::Client#query Mysql2::Result#to_a AR::Result#cast_values running without GIL (waiting I/O, etc.) => other threads can run simultaneously
  20. Arrow::Table • A Table object consists of a Schema object

    and a vector of Column objects • A Schema consists of a vector of Field, and a Field consists of its data type and its name • A Column consists of a Field object and a ChunkedArray object • The field of the column at a certain index is same as the field at the same index in the schema • Each column has the same length Schema Table vector<Column> ↑Field := (data_type, name) f1 f2 f3 f4 f5 f1 f2 f3 f4 f5 ↑Column := (field, chunked_array) ChunkedArray ChunkedArray ChunkedArray ChunkedArray ChunkedArray same field
  21. ChunkedArray • ChunkedArray consists of multiple arrays with the same

    value type • Each array has a contiguous buffer for storing values • Each array can have different length • It allows us to treat multiple arrays as a single array • It allows us to build a table by divide and conquer method in parallel ChunkedArray a1 a2 a3 a4 a5 type type type type type Buffer Buffer Buffer Buffer Buffer Arrays
  22. The Experiment Setting • Comparing the execution time and the

    memory consumption between the two versions of pluck method: original vs arrow • In the arrow version, the result of the query is divided multiple batches and they are processed in parallel •
  23. Table definition and query statement CREATE TABLE IF NOT EXISTS

    mysql2_test ( null_test VARCHAR(10), bit_test BIT, tiny_int_test TINYINT, small_int_test SMALLINT, medium_int_test MEDIUMINT, int_test INT, big_int_test BIGINT, float_test FLOAT(10,3), float_zero_test FLOAT(10,3), double_test DOUBLE(10,3), decimal_test DECIMAL(10,3), decimal_zero_test DECIMAL(10,3), date_test DATE, date_time_test DATETIME, timestamp_test TIMESTAMP, time_test TIME, year_test YEAR(4), char_test CHAR(10), varchar_test VARCHAR(10), binary_test BINARY(10), varbinary_test VARBINARY(10), tiny_blob_test TINYBLOB, tiny_text_test TINYTEXT, blob_test BLOB, text_test TEXT, medium_blob_test MEDIUMBLOB, medium_text_test MEDIUMTEXT, long_blob_test LONGBLOB, long_text_test LONGTEXT, enum_test ENUM('val1', 'val2'), set_test SET('val1', 'val2') ) DEFAULT CHARSET=utf8 Mysql2Test.limit(n).pluck( :int_test, :double_test, :varchar_test, :text_test, )
  24. Changes of 3 stages query fetch cast_values SQL Mysql::Result AR::Result

    Array SQL Mysql::Result ArrowResult (Arrow::Table) Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version
  25. Mysql2::Result#to_arrow In [1]: client = Mysql2::Client.new(host: 'localhost', username: 'root', database:

    'test') result = client.query('select * from ruby_committers', as: :array) records = result.to_a records[0..2] Out[1]: [[1, "matz", "3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900], [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900], [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]] In [2]: result = client.query('select * from ruby_committers', as: :array) table = result.to_arrow Out[2]: #<Arrow::Table:0x7fb850b2beb0 ptr=0x7fb8512b9f10> id svn first_commit 0 1 matz 3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4 1 2 aamine 6f4751f5f63d177534f34e0c278967883e11c035 2 3 shugo 6d36661c5eb87831cc60fa416470c8459d2a16c0 : : : : The result of Mysql2::Result#to_a is an array of records. The result of Mysql2::Result#to_arrow is an Arrow::Table.
  26. Arrow::Table#raw_records In [3]: records2 = table.raw_records records2[0..2] Out[3]: [[1, "matz",

    "3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900], [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900], [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]] In [4]: records == records2 Out[4]: true The result of Arrow::Table#raw_records is an array of records. This equals to the results of Mysql2::Result#to_a.
  27. Ratio of Arrow / Original Actual batch size for 0

    Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084
  28. Ratio of Arrow / Original Actual batch size for 0

    Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084
  29. Best batch sizes for total records ←Better Worse→ ←Better Worse→

    Actual batch size for 0 Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084 ** This may depends on machine spec **
  30. Breaking down the execution time query fetch cast_values SQL Mysql::Result

    AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version
  31. Breaking down the execution time Data transformation stage (query →

    fetch → cast_values) Execution time [msec]
  32. GIL handling on each stage query fetch cast_values SQL Mysql::Result

    AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version
  33. GIL handling on each stage query fetch cast_values SQL Mysql::Result

    AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version with GIL without GIL without GIL without GIL with GIL with GIL
  34. Using Arrow can release GIL for longer time • We

    can run other threads preemptively during the stages from the query to the fetch if we use Arrow version • We can speculate if we perform this benchmark in multiple threads, Arrow version may be faster than the original version • Let’s try with 10 threads
  35. Breaking down the memory consumption • Use massif tool of

    valgrind to investigate how heap is used • Visualize it by massif-visualizer tool • Compare the memory consumption between arrow and original versions
  36. Massif result of the arrow version 64.1MB 41.5MB The size

    of the resultant Arrow::Table The size of the builders of column arrays Building Arrow::Table shrinks memory usage The size of the working memory of MySQL client 43.4MB ≒ 40MB
  37. Memory consumption summary Memory Consumption Original Arrow MySQL Client 43MB

    43MB Array Builders 0 64MB Arrow::Table 0 42MB Ruby objects 40MB 40MB These can be vanished if MySQL returns Array::Table.
  38. Summary of Experiments • Arrow version has advantages about both

    speed and parallelism on the case for processing large amount of data obtained from DB • These advantages are important especially for batch jobs running on Sidekiq workers • The increase of memory consumption is at most x2 • The memory consumption can be reduced if MySQL returns Arrow::Table object instead of MYSQL_RES object
  39. Apache Arrow • Language-agnostic, open standard in-memory format for column-

    major data (i.e. data frames) • Started in Sep 2015, publicly announced in Feb 2016 • Defragment data access among different tools and languages • Bring together databases and data science communities to collaborate on shared computational technologies
  40. Using data frame in ActiveRecord • Replacing AR::Result with Arrow::Table

    and rewriting the operation of AR::Relation using Arrow’s operation will improve the performance of all operations of Active Record • e.g. we can do the following things without issuing query to DBMS: • We can pick up certain records in an AR::Relation using SQL statement • We can join the records in multiple AR::Relation objects • etc.
  41. Making Red Arrow as a solid data frame library •

    Making Arrow::Table object like pandas.DataFrame • Making Arrow::Column object like pandas.Series • etc. • We’ve already started this work.
  42. Conclusion • We performed experiments to replace the internal data

    of AR::Result with Arrow::Table object • This reduced the execution time of pluck method • This improved the parallelizability of pluck method • The memory consumption was increased x2-x3 • But it can be reduced by letting MySQL support to return results in Arrow format