Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

EMNLP2022 読む人：木山朔 1

Abst • GPT の few-shot における demo は実際どのくらい効いているのか？ • 分類タスクにおいて，正解ラベルをランダムに変えても性能は低下しない
◦ demo の形式が few-shot で用意できれば十分 ◦ 正解ラベルはいらないのでは？ 2

Related work (old) • noisy channel for few-shot [Min+, ACL2022]
◦ 雑音のある通信路（出力から入力を予測する）を few-shot に適応 • MetaICL [Min+, NAACL2022] ◦ 複数のタスクを ICL の形式で学習させる 3

setup • モデル ◦ GPT シリーズ（図を参照） • 評価データ ◦ 26のデータセットから抽出
◦ 分類タスクと選択式タスク ◦ 多様性があるように用意 • その他の設定 ◦ 分類タスク：Macro-F1 ◦ 選択式タスク：Accuracy ◦ k=16 として few-shot を実施 ◦ 5回平均の結果をのせる 4

gold label vs random label 比較手法は3つ 1. No demonstrations a.
従来の zero-shot 2. Demonstrations w/ gold labels a. 従来の few-shot 3. Demonstrations w/ random labels a. 一様分布でラベルをサンプル 5

main results • gold と random との差が no demos と比べて小さい
• 正解ラベルがなくとも demo さえあれば性能が上がることを確認 ◦ モデルが input-label 間の対応関係を回復させる能力があるのでは？ 6

Does the number of correct labels matter? • 正解ラベルの割合を変えて ablation
◦ 正解ラベルがあった方が良い ◦ 一方で，全部間違っていても性能が大きくはさがらない ◦ （モデルによって性能低下の割合は異なる） 7

Is the result consistent with varying k? • few-shot の
k でどれだけ変わるか？ ◦ k=4の場合はdemo ありが有意に良い ◦ k=8以降は性能の差は同じくらい 8

Is the result consistent with better templates? • 人手で用意したテンプレートで実験 ◦
用意したテンプレートは最小限のものであるため ◦ 傾向は変わらず同程度の性能 9

Why does ICL work? • 4つの観点から demo の要素を分析 1. The
input-label mapping 2. The distribution of the input text 3. The label space 4. The format 10

Impact of the distribution of the input text • Out
of distribution かつ random ラベルの場合を追加 ◦ OODが入るかつ random label の場合性能が大きく下がる ◦ （事前学習に入っている知識でないとむずかしい） 11

Impact of the label space • label を英単語に変えた場合との比較 ◦ Direct
モデルだと性能が下がる ▪ 事前学習モデルにおける対応づけが異なるため性能低下？ ◦ Channel モデルだと若干の性能低下 ▪ 入力と出力を逆にしているため対応づけができた？ 12

Impact of input-label pairing • 様々な場合の比較を実施（下記は具体例） 13

Impact of input-label pairing • demo において入力のみ，ラベルのみを検証 ◦ format の有無で性能が大きく変化（薄緑
vs 緑） ▪ format が大事 14

Impact of meta-training • MetaICL：ICLを目的として学習しているモデル ◦ input-label 間の対応づけはそこまで重要ではない ◦ demo
の形式は大事 ◦ meta-learning によって，demo のシンプルな部分を抽出できている ◦ 要は ICL をメインに学習しているため，事前学習で言語モデリングを行ったものよりも ICL に適応できている 15

まとめ • 分類タスクにおける ICL の性能を分析 • gold label vs random
label を比較 ◦ gold label でなくても性能は大きく低下しない • ICL の分析のために4つの観点を用意し比較 ◦ format に関わる部分が大事だとわかった ◦ 正解ラベルを用意しなくても， Prompt の format さえ用意できれば性能はでそう ◦ しかし，OODだと性能が下がることから事前学習の知識に依存 ◦ 一般的なラベルであれば性能が出ることが期待できる 16

Related work (new) • Task recognition and Task learning [Pan+,
ACL findings 2023] ◦ ICLはタスクを認識する部分とタスクを学習する部分の二つに別れるのでは？ ◦ タスク認識 ▪ demoを通じてタスクを認識し，事前学習した分布に適応できる能力 ▪ スケールしない能力（事前学習で獲得済み） ◦ タスク学習 ▪ 事前学習時にみられなかった input-label mapping を捉える能力 ▪ スケールする能力（ICLで主に学習する内容） ◦ 紹介した論文の一歩先の内容 17

Rethinking the Role of Demonstrations: What Mak...

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

hajime kiyama

More Decks by hajime kiyama

Featured

Transcript

EMNLP2022 読む人：木山朔 1

Abst • GPT の few-shot における demo は実際どのくらい効いているのか？ • 分類タスクにおいて，正解ラベルをランダムに変えても性能は低下しない

Related work (old) • noisy channel for few-shot [Min+, ACL2022]

setup • モデル ◦ GPT シリーズ（図を参照） • 評価データ ◦ 26のデータセットから抽出

gold label vs random label 比較手法は3つ 1. No demonstrations a.

main results • gold と random との差が no demos と比べて小さい

Does the number of correct labels matter? • 正解ラベルの割合を変えて ablation

Is the result consistent with varying k? • few-shot の

Is the result consistent with better templates? • 人手で用意したテンプレートで実験 ◦

Why does ICL work? • 4つの観点から demo の要素を分析 1. The

Impact of the distribution of the input text • Out

Impact of the label space • label を英単語に変えた場合との比較 ◦ Direct

Impact of input-label pairing • 様々な場合の比較を実施（下記は具体例） 13

Impact of input-label pairing • demo において入力のみ，ラベルのみを検証 ◦ format の有無で性能が大きく変化（薄緑

Impact of meta-training • MetaICL：ICLを目的として学習しているモデル ◦ input-label 間の対応づけはそこまで重要ではない ◦ demo

まとめ • 分類タスクにおける ICL の性能を分析 • gold label vs random

Related work (new) • Task recognition and Task learning [Pan+,