Reducing ActiveRecord memory consumption using Apache Arrow

Slide 1

Slide 1 text

Reducing ActiveRecord memory consumption using Apache Arrow Kenta Murata Speee Inc. 2019-04-20 RubyKaigi 2019

Slide 2

Slide 2 text

Contents • Background • ActiveRecord and the pluck method • Table data structure in Apache Arrow • Experiment and results • Future development plans • Conclusion

Slide 3

Slide 3 text

About myself • name: mrkn • full_name: Kenta Murata • affiliation: Speee Inc. • job: Full-time CRuby committer • hobby: CS, Math, Camera

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Background

Slide 6

Slide 6 text

I found Apache Arrow in 2016

Slide 7

Slide 7 text

Apache Arrow • Language-agnostic, open standard in-memory format for column- major data (i.e. data frames) • Started in Sep 2015, publicly announced in Feb 2016 • Defragment data access among different tools and languages • Bring together databases and data science communities to collaborate on shared computational technologies

Slide 8

Slide 8 text

Before and After • Each system has own internal memory format • 80% computation wasted on serialization & deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality Without Arrow (Now) With Arrow (Future) https://arrow.apache.org/

Slide 9

Slide 9 text

2016

Slide 10

Slide 10 text

2018 • I started to join to the development of Apache Arrow • Apache Arrow was ready to use from Ruby using red-arrow.gem at that time • It’s thanks to Kouhei Suto • The purpose is for two goals: 1. Making sparse tensor library for Ruby and letting it interoperable with other system such as SciPy stack on Python 2. Making data frame library for Ruby using Apache Arrow’s data structures and operations • I speculated that Apache Arrow may improve the performance of the application of ActiveRecord as I investigated the implementation of Apache Arrow C++ library

Slide 11

Slide 11 text

How does AR manage the result of query MYSQL_RES Result AR::Relation Extracted mysqld • fields : An array of the information of columns • rows : An array of the records in a result from DBMS AR::Result fields rows

Slide 12

Slide 12 text

Use Apache Arrow in AR::Result MYSQL_RES Result AR::Relation Extracted mysqld • fields : An array of the information of columns • columns : An array of the column arrays in a result from DBMS Arrow::Table AR::Result fields columns

Slide 13

Slide 13 text

Apache Arrow’s beneﬁt • Apache Arrow uses column-major layout • It has an advantage on the cache efficiency and SIMD-instruction applicability for scanning columns to filter records • The memory consumption can be much efficient than row-major layout • Almost all operations are implemented in C++ so they are faster than things written in Ruby • It is much parallelizable than things written in Ruby because these operations can perform with releasing RubyVM GIL

Slide 14

Slide 14 text

Row-major vs Column-major id name height weight 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 2 3 4 5 6 A B C D E F 172 185 166 158 190 178 67 78 60 62 75 69 Row-major layout Column-major layout

Slide 15

Slide 15 text

Selecting height ≥ 170 id name height weight 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 A 172 67 2 B 185 78 3 C 166 60 4 D 158 62 5 E 190 75 6 F 178 69 1 2 3 4 5 6 A B C D E F 172 185 166 158 190 178 67 78 60 62 75 69 Row-major layout Column-major layout ↑Need to scan the height column Low efficiency High efficiency

Slide 16

Slide 16 text

Slide 17

Slide 17 text

I performed a PoC experiment • Making rough implementation of a new subclass of AR::Result, called ArrowResult class, that uses Arrow::Table • Making rough implementation of the new connection adapter that generates ArrowResult object as the result of a query • Comparing execution time and memory consumption of pluck method between the original and the Arrow versions

Slide 18

Slide 18 text

Why pluck? • pluck method is simple • It just converts AR::Result to Array of Array • It is easy to understand • It is easy to appeal the superiority of Apache Arrow

Slide 19

Slide 19 text

The result of PoC • I already reported at Rails Developer Meetup on 22 Mar 2019: • The execution time was not increased • The memory consumption was EXTREMELY reduced!! • It’s amazing!!! I didn’t expected such the result. But …

Slide 20

Slide 20 text

The PoC Code was Broken!!!

Slide 21

Slide 21 text

The reported result is wrong

Slide 22

Slide 22 text

I ﬁxed the code and perform the experiments again for today

Slide 23

Slide 23 text

The new right result • The execution time was much reduced • The memory consumption was increased • And the additional results: • pluck method got highly parallelizable than before

Slide 24

Slide 24 text

ActiveRecord and the pluck method

Slide 25

Slide 25 text

ActiveRecord classes and modules AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc. AR::Relation AR::Base creates query AR::Result query DBMS Mysql2::Result,  PG::Result, etc. query AR::Calculations (module) include pluck is here! result

Slide 26

Slide 26 text

What I tweaked for the experiments AR::ConnectionAdapter Mysql2::Client, Pg::Connection, etc. AR::Relation AR::Base creates query AR::Result query DBMS Mysql2::Result,  PG::Result, etc. query AR::Calculations (module) include result

Slide 27

Slide 27 text

Breaking down pluck method columns a SQL string a Mysql2::Result an AR::Result an Array of records build SQL query fetch cast values args return value these stages occupy almost execution time  and almost consumed memory

Slide 28

Slide 28 text

Focusing 3 stages query fetch cast_values a SQL string a Mysql2::Result an AR::Result an Array of records Mysql2::Client#query Mysql2::Result#to_a AR::Result#cast_values running without GIL (waiting I/O, etc.) => other threads can run simultaneously

Slide 29

Slide 29 text

Table data structure in Apache Arrow

Slide 30

Slide 30 text

Arrow::Table • A Table object consists of a Schema object and a vector of Column objects • A Schema consists of a vector of Field, and a Field consists of its data type and its name • A Column consists of a Field object and a ChunkedArray object • The field of the column at a certain index is same as the field at the same index in the schema • Each column has the same length Schema Table vector ↑Field := (data_type, name) f1 f2 f3 f4 f5 f1 f2 f3 f4 f5 ↑Column := (field, chunked_array) ChunkedArray ChunkedArray ChunkedArray ChunkedArray ChunkedArray same field

Slide 31

Slide 31 text

ChunkedArray • ChunkedArray consists of multiple arrays with the same value type • Each array has a contiguous buffer for storing values • Each array can have different length • It allows us to treat multiple arrays as a single array • It allows us to build a table by divide and conquer method in parallel ChunkedArray a1 a2 a3 a4 a5 type type type type type Buffer Buffer Buffer Buffer Buffer Arrays

Slide 32

Slide 32 text

Experiment and results

Slide 33

Slide 33 text

The Experiment Setting • Comparing the execution time and the memory consumption between the two versions of pluck method: original vs arrow • In the arrow version, the result of the query is divided multiple batches and they are processed in parallel •

Slide 34

Slide 34 text

Table definition and query statement CREATE TABLE IF NOT EXISTS mysql2_test ( null_test VARCHAR(10), bit_test BIT, tiny_int_test TINYINT, small_int_test SMALLINT, medium_int_test MEDIUMINT, int_test INT, big_int_test BIGINT, float_test FLOAT(10,3), float_zero_test FLOAT(10,3), double_test DOUBLE(10,3), decimal_test DECIMAL(10,3), decimal_zero_test DECIMAL(10,3), date_test DATE, date_time_test DATETIME, timestamp_test TIMESTAMP, time_test TIME, year_test YEAR(4), char_test CHAR(10), varchar_test VARCHAR(10), binary_test BINARY(10), varbinary_test VARBINARY(10), tiny_blob_test TINYBLOB, tiny_text_test TINYTEXT, blob_test BLOB, text_test TEXT, medium_blob_test MEDIUMBLOB, medium_text_test MEDIUMTEXT, long_blob_test LONGBLOB, long_text_test LONGTEXT, enum_test ENUM('val1', 'val2'), set_test SET('val1', 'val2') ) DEFAULT CHARSET=utf8 Mysql2Test.limit(n).pluck( :int_test, :double_test, :varchar_test, :text_test, )

Slide 35

Slide 35 text

Changes of 3 stages query fetch cast_values SQL Mysql::Result AR::Result Array SQL Mysql::Result ArrowResult (Arrow::Table) Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version

Slide 36

Slide 36 text

Mysql2::Result#to_arrow In [1]: client = Mysql2::Client.new(host: 'localhost', username: 'root', database: 'test') result = client.query('select * from ruby_committers', as: :array) records = result.to_a records[0..2] Out[1]: [[1, "matz", "3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900], [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900], [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]] In [2]: result = client.query('select * from ruby_committers', as: :array) table = result.to_arrow Out[2]: # id svn first_commit 0 1 matz 3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4 1 2 aamine 6f4751f5f63d177534f34e0c278967883e11c035 2 3 shugo 6d36661c5eb87831cc60fa416470c8459d2a16c0 : : : : The result of Mysql2::Result#to_a is an array of records. The result of Mysql2::Result#to_arrow is an Arrow::Table.

Slide 37

Slide 37 text

Arrow::Table#raw_records In [3]: records2 = table.raw_records records2[0..2] Out[3]: [[1, "matz", "3db12e8b236ac8f88db8eb4690d10e4a3b8dbcd4", 1998-01-16 21:13:05 +0900], [2, "aamine", "6f4751f5f63d177534f34e0c278967883e11c035", 1999-12-18 00:00:13 +0900], [3, "shugo", "6d36661c5eb87831cc60fa416470c8459d2a16c0", 1999-12-20 12:27:48 +0900]] In [4]: records == records2 Out[4]: true The result of Arrow::Table#raw_records is an array of records. This equals to the results of Mysql2::Result#to_a.

Slide 38

Slide 38 text

Execution time Execution time [sec] Total number of records fetched from DB

Slide 39

Slide 39 text

Ratio of Arrow / Original Actual batch size for 0 Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084

Slide 40

Slide 40 text

Maximum RSS Maximum RSS [MB] Total number of records fetched from DB

Slide 41

Slide 41 text

Ratio of Arrow / Original Actual batch size for 0 Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084

Slide 42

Slide 42 text

Best batch sizes for total records ←Better Worse→ ←Better Worse→ Actual batch size for 0 Total records Batch size 1000 42 2000 84 3000 125 5000 209 10000 417 20000 834 30000 1250 50000 2084 ** This may depends on machine spec **

Slide 43

Slide 43 text

Breaking down the execution time query fetch cast_values SQL Mysql::Result AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version

Slide 44

Slide 44 text

Breaking down the execution time Data transformation stage (query → fetch → cast_values) Execution time [msec]

Slide 45

Slide 45 text

GIL handling on each stage query fetch cast_values SQL Mysql::Result AR::Result Array SQL Mysql::Result Arrow::Table Array Mysql2::Client#query Mysql2::Client#query Mysql2::Result#to_a Mysql2::Result#to_arrow AR::Result#cast_values Arrow::Table#raw_records Original version Arrow version

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Using Arrow can release GIL for longer time • We can run other threads preemptively during the stages from the query to the fetch if we use Arrow version • We can speculate if we perform this benchmark in multiple threads, Arrow version may be faster than the original version • Let’s try with 10 threads

Slide 48

Slide 48 text

Multi-threaded execution time

Slide 49

Slide 49 text

Multi-threaded execution time

Slide 50

Slide 50 text

Breaking down the memory consumption • Use massif tool of valgrind to investigate how heap is used • Visualize it by massif-visualizer tool • Compare the memory consumption between arrow and original versions

Slide 51

Slide 51 text

Massif result of the arrow version 64.1MB 41.5MB The size of the resultant Arrow::Table The size of the builders of column arrays Building Arrow::Table shrinks memory usage The size of the working memory of MySQL client 43.4MB ≒ 40MB

Slide 52

Slide 52 text

Massif result of the original version ≒ 40MB 43.4MB

Slide 53

Slide 53 text

Memory consumption summary Memory Consumption Original Arrow MySQL Client 43MB 43MB Array Builders 0 64MB Arrow::Table 0 42MB Ruby objects 40MB 40MB These can be vanished if MySQL returns Array::Table.

Slide 54

Slide 54 text

Summary of Experiments • Arrow version has advantages about both speed and parallelism on the case for processing large amount of data obtained from DB • These advantages are important especially for batch jobs running on Sidekiq workers • The increase of memory consumption is at most x2 • The memory consumption can be reduced if MySQL returns Arrow::Table object instead of MYSQL_RES object

Slide 55

Slide 55 text

We should let MySQL support Apahe Arrow’s data format

Slide 56

Slide 56 text

Future development plan

Slide 57

Slide 57 text

Slide 58

Slide 58 text

We should let MySQL support Apahe Arrow’s data format

Slide 59

Slide 59 text

We should let any DBMSs support Apahe Arrow’s data format

Slide 60

Slide 60 text

ARROW-3193

Slide 61

Slide 61 text

ARROW-1106

Slide 62

Slide 62 text

ARROW-3102

Slide 63

Slide 63 text

ARROW-3155

Slide 64

Slide 64 text

We should let any DBMSs support Apahe Arrow’s data format

Slide 65

Slide 65 text

Welcome to  Red Data Tools project

Slide 66

Slide 66 text

The other perspective

Slide 67

Slide 67 text

Making Red Arrow a solid data frame library

Slide 68

Slide 68 text

Using data frame in ActiveRecord • Replacing AR::Result with Arrow::Table and rewriting the operation of AR::Relation using Arrow’s operation will improve the performance of all operations of Active Record • e.g. we can do the following things without issuing query to DBMS: • We can pick up certain records in an AR::Relation using SQL statement • We can join the records in multiple AR::Relation objects • etc.

Slide 69

Slide 69 text

Making Red Arrow as a solid data frame library • Making Arrow::Table object like pandas.DataFrame • Making Arrow::Column object like pandas.Series • etc. • We’ve already started this work.

Slide 70

Slide 70 text

Welcome to  Red Data Tools project

Slide 71

Slide 71 text

Conclusions

Slide 72

Slide 72 text

Conclusion • We performed experiments to replace the internal data of AR::Result with Arrow::Table object • This reduced the execution time of pluck method • This improved the parallelizability of pluck method • The memory consumption was increased x2-x3 • But it can be reduced by letting MySQL support to return results in Arrow format