Reducing ActiveRecord memory consumption using Apache Arrow

Reducing ActiveRecord memory consumption using Apache Arrow Kenta Murata Speee
Inc. 2019-04-20 RubyKaigi 2019

Contents • Background • ActiveRecord and the pluck method •
Table data structure in Apache Arrow • Experiment and results • Future development plans • Conclusion

About myself • name: mrkn • full_name: Kenta Murata •
affiliation: Speee Inc. • job: Full-time CRuby committer • hobby: CS, Math, Camera

Background

I found Apache Arrow in 2016

Apache Arrow • Language-agnostic, open standard in-memory format for column-
major data (i.e. data frames) • Started in Sep 2015, publicly announced in Feb 2016 • Defragment data access among different tools and languages • Bring together databases and data science communities to collaborate on shared computational technologies

Before and After • Each system has own internal memory
format • 80% computation wasted on serialization & deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality Without Arrow (Now) With Arrow (Future) https://arrow.apache.org/

2018 • I started to join to the development of
Apache Arrow • Apache Arrow was ready to use from Ruby using red-arrow.gem at that time • It’s thanks to Kouhei Suto • The purpose is for two goals: 1. Making sparse tensor library for Ruby and letting it interoperable with other system such as SciPy stack on Python 2. Making data frame library for Ruby using Apache Arrow’s data structures and operations • I speculated that Apache Arrow may improve the performance of the application of ActiveRecord as I investigated the implementation of Apache Arrow C++ library

How does AR manage the result of query MYSQL_RES Result
AR::Relation Extracted mysqld • fields : An array of the information of columns • rows : An array of the records in a result from DBMS AR::Result fields rows

Use Apache Arrow in AR::Result MYSQL_RES Result AR::Relation Extracted mysqld
• fields : An array of the information of columns • columns : An array of the column arrays in a result from DBMS Arrow::Table AR::Result fields columns

Apache Arrow’s beneﬁt • Apache Arrow uses column-major layout •
It has an advantage on the cache efficiency and SIMD-instruction applicability for scanning columns to filter records • The memory consumption can be much efficient than row-major layout • Almost all operations are implemented in C++ so they are faster than things written in Ruby • It is much parallelizable than things written in Ruby because these operations can perform with releasing RubyVM GIL

Row-major vs Column-major id name height weight 1 A 172
67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 2 3 4 5 6 A B C D E F 172 185 166 158 190 178 67 78 60 62 75 69 Row-major layout Column-major layout

Selecting height ≥ 170 id name height weight 1 A
172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 2 3 4 5 6 A B C D E F 172 185 166 158 190 178 67 78 60 62 75 69 Row-major layout Column-major layout ↑Need to scan the height column Low efficiency High efficiency

Apache Arrow’s beneﬁt • Apache Arrow uses column-major layout •
It has an advantage on the cache efficiency and SIMD-instruction applicability for scanning columns to filter records • The memory consumption can be much efficient than row-major layout • Almost all operations are implemented in C++ so they are faster than things written in Ruby • It is much parallelizable than things written in Ruby because these operations can perform with releasing RubyVM GIL

I performed a PoC experiment • Making rough implementation of
a new subclass of AR::Result, called ArrowResult class, that uses Arrow::Table • Making rough implementation of the new connection adapter that generates ArrowResult object as the result of a query • Comparing execution time and memory consumption of pluck method between the original and the Arrow versions

Why pluck? • pluck method is simple • It just
converts AR::Result to Array of Array • It is easy to understand • It is easy to appeal the superiority of Apache Arrow

The result of PoC • I already reported at Rails
Developer Meetup on 22 Mar 2019: • The execution time was not increased • The memory consumption was EXTREMELY reduced!! • It’s amazing!!! I didn’t expected such the result. But …

The PoC Code was Broken!!!

The reported result is wrong

I ﬁxed the code and perform the experiments again for
today

The new right result • The execution time was much
reduced • The memory consumption was increased • And the additional results: • pluck method got highly parallelizable than before

ActiveRecord and the pluck method

ActiveRecord classes and modules AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc. AR::Relation AR::Base
creates query AR::Result query DBMS Mysql2::Result,  PG::Result, etc. query AR::Calculations (module) include pluck is here! result

What I tweaked for the experiments AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc.
AR::Relation AR::Base creates query AR::Result query DBMS Mysql2::Result,  PG::Result, etc. query AR::Calculations (module) include result

Breaking down pluck method columns a SQL string a Mysql2::Result
an AR::Result an Array of records build SQL query fetch cast values args return value these stages occupy almost execution time  and almost consumed memory

Focusing 3 stages query fetch cast_values a SQL string a
Mysql2::Result an AR::Result an Array of records Mysql2::Client#query Mysql2::Result#to_a AR::Result#cast_values running without GIL (waiting I/O, etc.) => other threads can run simultaneously

Table data structure in Apache Arrow

Arrow::Table • A Table object consists of a Schema object
and a vector of Column objects • A Schema consists of a vector of Field, and a Field consists of its data type and its name • A Column consists of a Field object and a ChunkedArray object • The field of the column at a certain index is same as the field at the same index in the schema • Each column has the same length Schema Table vector<Column> ↑Field := (data_type, name) f1 f2 f3 f4 f5 f1 f2 f3 f4 f5 ↑Column := (field, chunked_array) ChunkedArray ChunkedArray ChunkedArray ChunkedArray ChunkedArray same field

ChunkedArray • ChunkedArray consists of multiple arrays with the same
value type • Each array has a contiguous buffer for storing values • Each array can have different length • It allows us to treat multiple arrays as a single array • It allows us to build a table by divide and conquer method in parallel ChunkedArray a1 a2 a3 a4 a5 type type type type type Buffer Buffer Buffer Buffer Buffer Arrays

Experiment and results

The Experiment Setting • Comparing the execution time and the
memory consumption between the two versions of pluck method: original vs arrow • In the arrow version, the result of the query is divided multiple batches and they are processed in parallel •

Table definition and query statement CREATE TABLE IF NOT EXISTS
mysql2_test ( null_test VARCHAR(10), bit_test BIT, tiny_int_test TINYINT, small_int_test SMALLINT, medium_int_test MEDIUMINT, int_test INT, big_int_test BIGINT, float_test FLOAT(10,3), float_zero_test FLOAT(10,3), double_test DOUBLE(10,3), decimal_test DECIMAL(10,3), decimal_zero_test DECIMAL(10,3), date_test DATE, date_time_test DATETIME, timestamp_test TIMESTAMP, time_test TIME, year_test YEAR(4), char_test CHAR(10), varchar_test VARCHAR(10), binary_test BINARY(10), varbinary_test VARBINARY(10), tiny_blob_test TINYBLOB, tiny_text_test TINYTEXT, blob_test BLOB, text_test TEXT, medium_blob_test MEDIUMBLOB, medium_text_test MEDIUMTEXT, long_blob_test LONGBLOB, long_text_test LONGTEXT, enum_test ENUM('val1', 'val2'), set_test SET('val1', 'val2') ) DEFAULT CHARSET=utf8 Mysql2Test.limit(n).pluck( :int_test, :double_test, :varchar_test, :text_test, )

Changes of 3 stages query fetch cast_values SQL Mysql::Result AR::Result
Array SQL Mysql::Result ArrowResult (Arrow::Table) Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version

Mysql2::Result#to_arrow In [1]: client = Mysql2::Client.new(host: 'localhost', username: 'root', database:
'test') result = client.query('select * from ruby_committers', as: :array) records = result.to_a records[0..2] Out[1]: [[1, "matz", "3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900], [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900], [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]] In [2]: result = client.query('select * from ruby_committers', as: :array) table = result.to_arrow Out[2]: #<Arrow::Table:0x7fb850b2beb0 ptr=0x7fb8512b9f10> id svn first_commit 0 1 matz 3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4 1 2 aamine 6f4751f5f63d177534f34e0c278967883e11c035 2 3 shugo 6d36661c5eb87831cc60fa416470c8459d2a16c0 : : : : The result of Mysql2::Result#to_a is an array of records. The result of Mysql2::Result#to_arrow is an Arrow::Table.

Arrow::Table#raw_records In [3]: records2 = table.raw_records records2[0..2] Out[3]: [[1, "matz",
"3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900], [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900], [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]] In [4]: records == records2 Out[4]: true The result of Arrow::Table#raw_records is an array of records. This equals to the results of Mysql2::Result#to_a.

Execution time Execution time [sec] Total number of records fetched
from DB

Ratio of Arrow / Original Actual batch size for 0
Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084

Maximum RSS Maximum RSS [MB] Total number of records fetched
from DB

Ratio of Arrow / Original Actual batch size for 0
Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084

Best batch sizes for total records ←Better Worse→ ←Better Worse→
Actual batch size for 0 Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084 ** This may depends on machine spec **

Breaking down the execution time query fetch cast_values SQL Mysql::Result
AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version

Breaking down the execution time Data transformation stage (query →
fetch → cast_values) Execution time [msec]

GIL handling on each stage query fetch cast_values SQL Mysql::Result
AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version

GIL handling on each stage query fetch cast_values SQL Mysql::Result
AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version with GIL without GIL without GIL without GIL with GIL with GIL

Using Arrow can release GIL for longer time • We
can run other threads preemptively during the stages from the query to the fetch if we use Arrow version • We can speculate if we perform this benchmark in multiple threads, Arrow version may be faster than the original version • Let’s try with 10 threads

Multi-threaded execution time

Breaking down the memory consumption • Use massif tool of
valgrind to investigate how heap is used • Visualize it by massif-visualizer tool • Compare the memory consumption between arrow and original versions

Massif result of the arrow version 64.1MB 41.5MB The size
of the resultant Arrow::Table The size of the builders of column arrays Building Arrow::Table shrinks memory usage The size of the working memory of MySQL client 43.4MB ≒ 40MB

Massif result of the original version ≒ 40MB 43.4MB

Memory consumption summary Memory Consumption Original Arrow MySQL Client 43MB
43MB Array Builders 0 64MB Arrow::Table 0 42MB Ruby objects 40MB 40MB These can be vanished if MySQL returns Array::Table.

Summary of Experiments • Arrow version has advantages about both
speed and parallelism on the case for processing large amount of data obtained from DB • These advantages are important especially for batch jobs running on Sidekiq workers • The increase of memory consumption is at most x2 • The memory consumption can be reduced if MySQL returns Arrow::Table object instead of MYSQL_RES object

We should let MySQL support Apahe Arrow’s data format

Future development plan

Apache Arrow • Language-agnostic, open standard in-memory format for column-
major data (i.e. data frames) • Started in Sep 2015, publicly announced in Feb 2016 • Defragment data access among different tools and languages • Bring together databases and data science communities to collaborate on shared computational technologies

We should let MySQL support Apahe Arrow’s data format

We should let any DBMSs support Apahe Arrow’s data format

ARROW-3193

ARROW-1106

ARROW-3102

ARROW-3155

We should let any DBMSs support Apahe Arrow’s data format

Welcome to  Red Data Tools project

The other perspective

Making Red Arrow a solid data frame library

Using data frame in ActiveRecord • Replacing AR::Result with Arrow::Table
and rewriting the operation of AR::Relation using Arrow’s operation will improve the performance of all operations of Active Record • e.g. we can do the following things without issuing query to DBMS: • We can pick up certain records in an AR::Relation using SQL statement • We can join the records in multiple AR::Relation objects • etc.

Making Red Arrow as a solid data frame library •
Making Arrow::Table object like pandas.DataFrame • Making Arrow::Column object like pandas.Series • etc. • We’ve already started this work.

Welcome to  Red Data Tools project

Conclusions

Conclusion • We performed experiments to replace the internal data
of AR::Result with Arrow::Table object • This reduced the execution time of pluck method • This improved the parallelizability of pluck method • The memory consumption was increased x2-x3 • But it can be reduced by letting MySQL support to return results in Arrow format

Reducing ActiveRecord memory consumption using ...

Reducing ActiveRecord memory consumption using Apache Arrow

More Decks by Kenta Murata

Other Decks in Technology

Featured

Transcript