Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ActiveRecordでの大量データとの付き合い方 / How to handle a lot of records with ActiveRecord

ActiveRecordでの大量データとの付き合い方 / How to handle a lot of records with ActiveRecord

ActiveRecordで数十万から数億程度のデータを扱ったときにしたことを社内勉強会で話したスライド

Yuji Hanamura

August 30, 2019
Tweet

More Decks by Yuji Hanamura

Other Decks in Programming

Transcript

  1. 複数件のレコードを取得するコードの例 以下のコードは1⾏⽬の時点ではクエリーは発⾏されない ActiveRecord::Relation ( Account::ActiveRecord_Relation ) のインスタンス が返る 2⾏⽬の accounts.each

    で Account のインスタンスが必要になるのでクエリーが 発⾏される 作成された Account のインスタンスは Account::ActiveRecord_Relation 内 にキャッシュされる 5⾏⽬の accounts.each では Account のインスタンスがキャッシュ済みなのでク エリーは発⾏されない accounts = Account.where(is_active: true) accounts.each do |account| # do something end accounts.each do |account| # do something end 9
  2. find_each 1レコードずつ処理する場合に使⽤する デフォルトでは1000件ずつメモリーにロードしてクラスのインスタンスを1件ずつ ブロックに渡す 例: 1アカウント毎にメール送信 Account.where(is_active: true).find_each do |account|

    account.send_mail_magazine end SELECT "accounts".* FROM "accounts" WHERE "accounts"."is_active" = 1 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts".* FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 1943 ORDER BY "accounts"."id" ASC LIMIT 1000 12
  3. find_in_batches 複数レコードをまとめて処理したい場合に使⽤する デフォルトでは1000件ずつメモリーにロードしてクラスのインスタンスの配列を ブロックに渡す 例: 1000件ごとにトランザクションかけて処理する Account.where(is_active: true).find_in_batches do |accounts|

    Account.transaction do accounts.each do |account| # do something end end end SELECT "accounts".* FROM "accounts" WHERE "accounts"."is_active" = 1 ORDER BY "accounts"."id" ASC LIMIT 1000 BEGIN ... COMMIT SELECT "accounts".* FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 1943 ORDER BY "accounts"."id" ASC LIMIT 1000 BEGIN ... COMMIT 13
  4. in_batches レコード群ごとにクエリーを発⾏したい場合に使⽤する デフォルトで1000件ずつ処理するActiveRecord::Relationをブロックに渡す 例: 1000件ごとにUPDATEクエリーを発⾏する Account.where(is_active: true).in_batches do |account_relation| #

    account_relationはAccount::ActiveRecord_Relationのインスタンス account_relation.update_all(is_active: false, updated_at: Time.current) end SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 ORDER BY "accounts"."id" ASC LIMIT 1000 UPDATE "accounts" SET "is_active" = 0, "updated_at" = '2019-08-28 14:57:06.163225' WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (4, 5, ..., 1942, 1943) SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 1943 ORDER BY "accounts"."id" ASC LIMIT 1000 UPDATE "accounts" SET "is_active" = 0, "updated_at" = '2019-08-28 14:57:06.192346' WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (1946, 1949, ..., 2495, 2500) 14
  5. ActiveRecord::Calculations#pluck を使う pluck を使うことでAccountクラスのインスタンスを作らずにカラムの値だけ読 み込むことができる ⽣SQLとは違い型変換は⾏われる Account.where(is_active: true).in_batches do |accounts|

    accounts.pluck(:name, :email).each do |name, email| # do something end end SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts"."name", "accounts"."email" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (1, 3, ..., 2056, 2059) SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 2059 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts"."name", "accounts"."email" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (2061, 2062, ..., 2499, 2500) 16
  6. pluck と Struct を組み合わせて使う pluck の結果はそのままだと扱いにくい 普通の2次元配列(1次元の場合もあるが) Struct で軽量なクラスを作って使⽤する SimpleAccount

    = Struct.new(:name, :email) Account.where(is_active: true).in_batches do |account_relation| accounts = account_relation.pluck(:name, :email).map! { |columns| SimpleAccount.new(*columns) } accounts.each do |account| # do something end end SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts"."name", "accounts"."email" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (1, 3, ..., 2056, 2059) SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 2059 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts"."name", "accounts"."email" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (2061, 2062, ..., 2499, 2500) 17
  7. なぜ Struct なのか︖ 組み込みライブラリー オブジェクトっぽく扱いたい Hash だとコードを追わないと構造がわかりにくい Hash はtypoに弱い お⼿軽に定義できる

    POROでもいいけどちょっと定義が⾯倒 任意のメソッドを追加できる そこそこ⾼速 似たようなクラスに OpenStruct があるが遅い OpenStruct は Hash 同様にtypoに弱い 18
  8. Bluk insert ActiveRecord-Import gemを使う Rails 6.0からは insert_all や upsert_all など標準で機能が提供されるよう

    になった Account.bulk_import accounts INSERT INTO "accounts" ("id","name","email","is_active","created_at","updated_at","created_by","updated_by") VALUES (NULL,'new account 0','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 1','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 2','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 3','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 4','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 5','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 6','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 7','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 8','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL), (NULL,'new account 9','[email protected]',NULL,'2019-08-28 15:05:13.555747','2019-08-28 15:05:13.555747',NULL,NULL) 20
  9. 並列処理 Parallel gemを使う Parallel.each(Account.where(is_active: true).in_batches, in_threads: 10) do |account_relation| ActiveRecord::Base.connection_pool.with_connection

    do account_relation.update_all(is_active: false, updated_at: Time.current) end end SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 2059 ORDER BY "accounts"."id" ASC LIMIT 1000 ... SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 22232 ORDER BY "accounts"."id" ASC LIMIT 1000 UPDATE "accounts" SET "is_acive" = 0, "updated_at" = '2019-08-29 15:26:53.659967' WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (1, 3, ..., 2056, 2059) ... UPDATE "accounts" SET "is_active" = 0, "updated_at" = '2019-08-29 15:27:23.589577' WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (4185, 4188, ..., 6154, 6158) UPDATE "accounts" SET "is_active" = 0, "updated_at" = '2019-08-29 15:27:23.907770' WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (20253, 20257, ..., 22231, 22232) 21
  10. 何が起きたのか︖ Parallel.each は第⼀引数を配列に変換して保持する Account::ActiveRecord_Relation の配列が保持される ActiveRecord::Relation はインスタンスが必要になったタイミングでレコードを ロードしてキャッシュする 処理対象の数百万件のレコードがすべてモデルのインスタンスになってメモリー 上に載ってしまった

    結果としてメモリー不⾜で処理が詰まりOOMKillerの餌⾷になった SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 2059 ORDER BY "accounts"."id" ASC LIMIT 1000 ... SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 20252 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts"."id" FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" > 22232 ORDER BY "accounts"."id" ASC LIMIT 1000 SELECT "accounts".* FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (1, 3, ..., 2056, 2059) SELECT "accounts".* FROM "accounts" WHERE "accounts"."is_active" = 1 AND "accounts"."id" IN (2061, 2062, ..., 4183, 4184) ... 24
  11. どうやって解決したか 最初に処理対象のレコードの id をメモリーにロード ロードした処理対象の id を each_slice で分割して少しずつ読み込みながら処理 した

    account_ids = Account.where(is_active: true).pluck(:id) Parallel.each(account_ids.each_slice(1000), in_threads: 10) do |ids| ActiveRecord::Base.connection_pool.with_connection do Account.where(id: ids).each do |account| # do something end end end 25
  12. 何が起きたのか︖ 予想よりもだいぶ遅かった in_batches の対象選択クエリーでJOINするとJOIN対象のレコード数が多くなる in_batches の対象選択と更新とそれぞれでJOINが⾏われた SELECT "favorites"."id" FROM "favorites"

    INNER JOIN "accounts" ON "accounts"."id" = "favorites"."account_id" WHERE "favorites"."is_public" = 0 AND "accounts"."is_active" = 1 ORDER BY "favorites"."id" ASC LIMIT 1000 UPDATE "favorites" SET "is_public" = 1, "updated_at" = '2019-08-29 11:34:41.676789' WHERE "favorites"."id" IN ( SELECT "favorites"."id" FROM "favorites" INNER JOIN "accounts" ON "accounts"."id" = "favorites"."account_id" WHERE "favorites"."is_public" = 0 AND "accounts"."is_active" = 1 AND "favorites"."id" IN (1, 2, ..., 2034, 2036) ) SELECT "favorites"."id" FROM "favorites" INNER JOIN "accounts" ON "accounts"."id" = "favorites"."account_id" WHERE "favorites"."is_public" = 0 AND "accounts"."is_active" = 1 AND "favorites"."id" > 2036 ORDER BY "favorites"."id" ASC LIMIT 1000 UPDATE "favorites" SET "is_public" = 1, "updated_at" = '2019-08-29 11:34:41.676789' WHERE "favorites"."id" IN ( SELECT "favorites"."id" FROM "favorites" INNER JOIN "accounts" ON "accounts"."id" = "favorites"."account_id" WHERE "favorites"."is_public" = 0 AND "accounts"."is_active" = 1 AND "favorites"."id" IN (2041, 2043, ..., 4002, 4005) ) ... 29
  13. まとめ ActiveRecordのクラスのインスタンス⽣成は重い ⼤量データを扱うときは ActiveRecord::Batches のメソッド群を使い分ける pluck や Struct を使ってメモリー消費を抑える ⽤法容量を守る

    レコード数が少ないのが分かり切っているのに find_each などを使わない︕ pluck などを多⽤して可読性を落とさないように注意︕ 実際にどういうクエリーが発⾏されるのかを確認しながら開発するの⼤事 31