snlp2023_beyond_neural_scaling_laws

Slide 1

Slide 1 text

Beyond neural scaling laws: beating power law scaling via data pruning Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari Morcos NeurIPS 2022 読む⼈︓⾼瀬翔（LINE） 2023/8/28 1

Slide 2

Slide 2 text

(Neural) Scaling law • ニューラルモデルの性能は – 訓練事例数 – パラメータ数 • に対数⽐例するという経験的な知⾒ – 特に⼤規模⾔語モデルで周辺でよく⾔われる印象 2 Dataset Size tokens Parameters non-embedding Compute PF-days, non-embedding Test Loss [Kaplan+ 20]

Slide 3

Slide 3 text

本論⽂の取り組み • 訓練データ数の Scaling law を⾒直す – 訓練データを上⼿く選択する効果を検証 • Toy data + 実際の画像認識データセットで検証 • 既存のデータクリニーング⼿法の効果も検証 – ImageNet を 80% くらいまでなら削れるかも • 理想的な状況ほど効率的にはならない印象 • タイトルから期待される効果は実データでは難しそう 3

Slide 4

Slide 4 text

Toy data での検証⼿続き • タスク︓⼆値分類 • モデル︓単純パーセプトロン • ⼿続き – 難易度の⾼いN%の事例を抽出する • モデルを学習する • スコア（⼊⼒ベクトルと重みの内積）の逆順にソート – 分類が難しい順にソート • 全体の事例数のN%を難易度の⾼い事例として抽出 – 上記とは別のモデルをN%の事例で学習 • テストデータでの性能（誤り率）を評価 4

Slide 5

Slide 5 text

Toy data での検証 5 Pareto frontier Test error (%) Theory Simulation 100% Frac. data kept 77% 60% 46% 36% 28% 22% 17% 13% 10% Training examples per parameter ( ) Keep hard ResNet18 on CIFAR-10 D Keep easy examples Keep hard examples Perceptron in teacher-student setting Perceptron in teacher-student s B C A Frac. data kept t Total examples per parameter ( 抽出した訓練データの量全訓練データの何%を残すかパーセンテージが⼩さいと⾼難易度のデータのみ全訓練データから特定の量を抽出し，抽出した訓練データの量はそろえているつまりパーセンテージが⼩さい⽅が全訓練データは多い設定（のハズ）訓練データ量が少ない場合（左側） → 難易度の低いデータが多い⽅が良い訓練データ量が多い場合（右側） → 難易度の⾼いデータが多いほど良い訓練データを抽出する際の最適な割合は全訓練データ量に依存する

Slide 6

Slide 6 text

難易度の判定が難しいときは︖ 6 前スライドの例と同⼀難易度判定が難しい難易度判定が難しい場合は訓練データをあまり削らない⽅が良い（判定が難しいほど削りづらくなる） B C D Theory Simulation Theory Simulation 20% Frac. data kept 30% 40% 50% 60% 70% 80% 90% 100% Training examples per parameter ( ) Training examples per parameter ( ) Training examples per parameter ( ) Error ning with an imperfect metric. A: Weight vectors and decision boundaries for a

Slide 7

Slide 7 text

画像認識データセットでの検証 Pareto frontier ResNet18 on SVHN Training examples Training examples Pareto frontier Test error (top-5 %) Test error (%) setting B C D Pareto frontier ResNet18 on CIFAR-10 ResNet50 on ImageNet eory mulation 20% Frac. data kept 30% 40% 50% 60% 70% 80% 90% 100% 20% Frac. data kept 30% 40% 50% 60% 70% 80% 90% 100% 1% Frac. data kept 2% 3% 5% 8% 13% 22% 36% 60% 100% Training examples meter power law scaling in practice. A–D: Curves of test error against pruned dataset runing scores were EL2N [10] for CIFAR-10 and SVHN and memorization [13] App. B for all pruning/training details and App. D for similar ImageNet plots 7 同⼀の訓練データからは抽出するパーセンテージが⼩さいと性能が低いまま（訓練データが⼩さいので性能が上がりきらない）⾼瀬の意⾒実際のデータセットに対しては7-8割を残す程度が妥当に⾒えるデータセットによっては6割程度まで削ってしまっても良さそう

Slide 8

Slide 8 text

事前学習でも成⽴するか︖ • 事前学習でもデータを削る効果はある – 上⼿く削ると性能が上がることも 8 事前学習データをどの程度削るか ImageNet で学習 → CIFAR-10でチューニング事前学習のデータ（ImageNet）を削る ImageNet を 7-8割程度まで削っても・性能はあまり変化しない・データの選択法によっては性能が上がる en larger datasets, scaling could improve further (e.g., dashed lines in A). E pruning (f = 1) are labeled with their best-fit power law scaling ⇠ ↵ ⌫. (N an asymptotic constant error E(P ! 1) = 1.1% is subtracted from each of he power law scaling more clearly.) B Frac. of ImageNet used for pretraining ResNet50 netuned on CIFAR-10 Test accuracy (%) 0.75 1.0 0.5 0.25 0 Figure 4: Data pruning improves trans learning. A: CIFAR-10 performance a ViT pre-trained on all of ImageNet2 and fine-tuned on different pruned subs of CIFAR-10 under the EL2N metric. CIFAR-10 performance of ResNet50s trained on different pruned subsets of I geNet1K and fine-tuned on all of CIFA 10.

Slide 9

Slide 9 text

データクリーニングの戦略 • （紹介論⽂での）画像認識での検証はスキップします • 代わりに⼤規模⾔語モデルの学習の話をします – Deduplicating Training Data Makes Language Models Better [Lee+ 22] – ウェブデータには（部分的に）重複している⽂書が多い – 重複⽂書を除去することで • 学習効率が良くなる – 学習データ量に対して⾼い性能向上 • 性能が上がることもある s and C4 ex- g dataset con- nt to capture xt commonly Wiki-40B, we e text identi- nerated. The 9 重複削除を⾏って学習した場合⾔語モデルのPPLは・性能に悪影響がない・データによっては性能向上近似重複削除重複削除

Slide 10

Slide 10 text

まとめ • 訓練データ数の Scaling law を⾒直す – 訓練データを上⼿く選択する効果を検証 • Toy data + 実際の画像認識データセットで検証 • 訓練データが上⼿く選択できれば効率的 – 実データでは7-8割辺りまでは削れそう • ⼤規模⾔語モデル学習では重複⽂書除去をすべき – 学習効率は良くなる + 性能が上がる場合もある 10