LLM講座2024年「Day10. LLMの分析と理論」（後半パート）

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 10. LLMの分析と理論 (後半パート)
⼤規模⾔語モデル講座 2024 講師︓⼩林悟郎 (東北⼤学坂⼝・乾研究室博⼠3年 ) 許諾なく撮影や第三者への開⽰を禁⽌します

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 2 講師の自己紹介⼩林
悟郎 § 東北⼤学坂⼝・乾研究室博⼠3年 ⁃ 東北⼤学⾃然⾔語処理研究グループ (Tohoku NLP) § 研究︓Transformer⾔語モデルの分析 ⁃ BERTが登場した頃から継続して⾔語モデルの分析 (2019~) ⁃ モデルの内部挙動・パラメータの分析など § その他の活動 ⁃ LLM 関連の研究開発・応⽤（株式会社 Preferred Networks、株式会社 zero to one） ⁃ NLP 若⼿の会（YANS）運営委員 @goro_koba

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 3 目次 1
分析の意義、後半パートでの⽬標 2 Transformer⾔語モデルのおさらい 3 Transformer⾔語モデル分析の概観・歴史 4 表現にエンコードされている情報 5 注意パターンの分析とその拡張 6 最近のトレンド

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 5 LLMを分析する意義日本の首都は
LLM 第 1 層第 2 層第 3 層中間表現 (高次元ベクトル)： [−0.34, 0.02, …, 0.10] 第 N 層 … 🤔❓ 東京 LLMの仕組み (動作原理や処理過程) を理解したい § 学術的意義 ⁃ ⾼い推論能⼒・⾔語運⽤能⼒を実現するメカニズムを明らかにする § 実⽤的意義 [1,2,3] ⁃ 安全性・公平性・透明性・説明可能性の確保 → 信頼できるAIへ ⁃ モデルの改善に繋がる⼿がかりを得る（エラーやバイアスの発⾒、スケーリング以外の改善策など） LLMのような巨⼤ニューラルネットワークを⼈間がそのまま理解するのは困難 § 例えば中間層の表現を観察したとしても解釈できない

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 6 Day10 後半パートでの目標
1. LLMの分析についての歴史と代表的な分析⼿法について理解する 2. LLMの仕組みに関する重要な知⾒を理解する 3. 分析界隈での最近のトレンドについて把握する

Transformer⾔語モデルのおさらい 3 Transformer⾔語モデル分析の概観・歴史 4 表現にエンコードされている情報 5 注意パターンの分析とその拡張 6 最近のトレンド 1 分析の意義、後半パートでの⽬標

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 8 Transformer言語モデルの構造注意機構
フィードフォワードネット層正規化層正規化予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京 … 第 # 層注意機構フィードフォワードネット層正規化層正規化埋め込み層

フィードフォワードネット層正規化層正規化予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京 … 第 # 層注意機構フィードフォワードネット層正規化層正規化埋め込み層⽇本の⾸都は 𝒙! 𝒙" 𝒙# 𝒙$ ⽇本の⾸都はトークナイズ埋め込み化

フィードフォワードネット層正規化層正規化予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京 … 第 # 層注意機構フィードフォワードネット層正規化層正規化埋め込み層⽇本の⾸都は 𝒙! 𝒙" 𝒙# 𝒙$ ⽇本の⾸都はトークナイズ埋め込み化⽂脈表現を混ぜて表現を更新 (QKVで計算) 𝒙! 𝒙" 𝒙# 𝒙$ 𝒙′! 𝒙′" 𝒙′# 𝒙′$

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 11 注意機構フィードフォワードネット
層正規化層正規化予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京 … 第 # 層注意機構フィードフォワードネット層正規化層正規化埋め込み層 Transformer言語モデルの構造⽇本の⾸都は 𝒙! 𝒙" 𝒙# 𝒙$ ⽇本の⾸都はトークナイズ埋め込み化⽂脈表現を混ぜて表現を更新 (QKVで計算) 𝒙! 𝒙" 𝒙# 𝒙$ 𝒙′! 𝒙′" 𝒙′# 𝒙′$ 𝒙! 𝒙" 𝒙# 𝒙$ 𝒙′! 𝒙′" 𝒙′# 𝒙′$ 変換変換変換変換表現を個別に変換

層正規化層正規化予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京 … 第 # 層注意機構フィードフォワードネット層正規化層正規化埋め込み層 Transformer言語モデルの構造 𝒙! 𝒙" 𝒙# 𝒙$ 𝒙′! 𝒙′" 𝒙′# 𝒙′$ 変換変換変換変換表現を個別に変換 𝒙$ トークン化東京⽇本の⾸都は 𝒙! 𝒙" 𝒙# 𝒙$ ⽇本の⾸都はトークナイズ埋め込み化⽂脈表現を混ぜて表現を更新 (QKVで計算) 𝒙! 𝒙" 𝒙# 𝒙$ 𝒙′! 𝒙′" 𝒙′# 𝒙′$

層正規化層正規化予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京 … 第 # 層注意機構フィードフォワードネット層正規化層正規化埋め込み層 Transformer言語モデルの構造⽂脈情報の参照と個別の変換を繰り返して表現を更新していくネットワーク⽂脈表現を混ぜて表現を更新 (QKVで計算) 𝒙! 𝒙" 𝒙# 𝒙$ 𝒙′! 𝒙′" 𝒙′# 𝒙′$ 𝒙! 𝒙" 𝒙# 𝒙$ 𝒙′! 𝒙′" 𝒙′# 𝒙′$ 変換変換変換変換表現を個別に変換

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 14 あれ？もう仕組みが理解できてるのでは？これだけではモデルの具体的な内部挙動や処理過程を理解できていない
§ どんな⼊⼒に対して、どの層で、どんな⽂脈情報を参照する︖ § 変換は実際のところ何をしていると解釈できる︖ § 中間層や最終層で表現にはどんな情報が含まれている︖ § 層を経るにつれて表現はどのように変化していく︖ § など

Transformer⾔語モデル分析の概観・歴史 4 表現にエンコードされている情報 5 注意パターンの分析とその拡張 6 最近のトレンド 1 分析の意義、後半パートでの⽬標 2 Transformer⾔語モデルのおさらい

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 16 § Transformerの登場
(2017年6⽉) から⾔語モデル開発が激化 § 途中まで界隈の中⼼はマスク⾔語モデル (エンコーダーモデル) だった § スケーリング則とChatGPTの登場でデコーダーモデルが⼀気に中⼼に ⁃ 体感的には2021〜2022年あたりが過渡期 Transformer言語モデルの台頭と分析の歴史 (※個人の認識を含みます) 2018 2019 2020 2021 2022 2023 2024 主要モデルの登場 BERT GPT GPT-2 T5 GPT-3 InstructGPT ChatGPT GPT-4 GPT-4o OPT Llama Llama 2 Llama 3 Mistral Mixtral RoBERTa DeBERTa ELECTRA Gopher PaLM Chinchilla BLOOM PaLM 2

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 17 § 分析の対象も移り変わっていった
§ ChatGPTの衝撃と、研究に扱いやすいオープンなLLMの公開が契機︖ ⁃ OPT や Llama など Transformer言語モデルの台頭と分析の歴史 (※個人の認識を含みます) 2018 2019 2020 2021 2022 2023 2024 主要モデルの登場 BERT GPT GPT-2 T5 GPT-3 InstructGPT ChatGPT GPT-4 GPT-4o OPT Llama Llama 2 Llama 3 Mistral Mixtral RoBERTa DeBERTa ELECTRA Gopher PaLM Chinchilla BLOOM PaLM 2 研究界隈でのメイン分析対象エンコーダー言語モデルの分析デコーダー言語モデルの分析

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 18 § ⼀部企業による分析プロジェクトは国際会議コミュニティとは独⽴に進んだ
⁃ 論⽂ではなくブログ形式で研究成果を公開 ⁃ Mechanistic Interpretability (機械論的解釈) という分野名でLLM分析の流⾏の⽕付け役 Transformer言語モデルの台頭と分析の歴史 (※個人の認識を含みます) 2018 2019 2020 2021 2022 2023 2024 主要モデルの登場 BERT GPT GPT-2 T5 GPT-3 InstructGPT ChatGPT GPT-4 GPT-4o OPT Llama Llama 2 Llama 3 Mistral Mixtral RoBERTa DeBERTa ELECTRA Gopher PaLM Chinchilla BLOOM PaLM 2 研究界隈でのメイン分析対象エンコーダー言語モデルの分析デコーダー言語モデルの分析一部企業による分析プロジェクト Anthropic によるデコーダー言語モデルの分析 [5] OpenAI による画像モデルの分析 [4]

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 19 § ⼀部企業による分析プロジェクトは国際会議コミュニティとは独⽴に進んだ
⁃ 論⽂ではなくブログ形式で研究成果を公開 ⁃ Mechanistic Interpretability (機械論的解釈) という分野名でLLM分析の流⾏の⽕付け役 Transformer言語モデルの台頭と分析の歴史 (※個人の認識を含みます) 2018 2019 2020 2021 2022 2023 2024 主要モデルの登場 BERT GPT GPT-2 T5 GPT-3 InstructGPT ChatGPT GPT-4 GPT-4o OPT Llama Llama 2 Llama 3 Mistral Mixtral RoBERTa DeBERTa ELECTRA Gopher PaLM Chinchilla BLOOM PaLM 2 研究界隈でのメイン分析対象エンコーダー言語モデルの分析デコーダー言語モデルの分析一部企業による分析プロジェクト Anthropic によるデコーダー言語モデルの分析 [5] OpenAI による画像モデルの分析 [4] ニューラルネットワーク (最近では特にLLM) を⼈間が理解できるアルゴリズムに落とし込むことを⽬的とした新しい分野として⾃称し浸透した特にモデルの内部構造やメカニズムの詳細に迫る分析を指すことが多い従来の学術界の分析研究との明確な線引きはない

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 20 § エンコーダー⾔語モデル分析の時代
(2018年~2022年) ⁃ Probing (プロービング) [6,7,8] • モデルの中間表現を取り出し、特定情報に関する分類器を構築 • 分類精度が⾼いほど、その情報が中間表現に含まれていると解釈 • 分類器のパラメータは訓練する主流な分析方法の変化⽇本の⾸都は 𝒗! 𝒗! 𝒗! தؒදݱΛநग़ͯ͠ ෼ྨਫ਼౓ΛධՁ PROPN (固有名詞) ADP (設置詞) NOUN (名詞) 品詞情報の例 Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 𝒗! ⽇本の⾸都は分類器 ADP (設置詞)

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 21 § エンコーダー⾔語モデル分析の時代
(2018年~2022年) ⁃ Probing (プロービング) [6,7,8] • モデルの中間表現を取り出し、特定情報に関する分類器を構築 • 分類精度が⾼いほど、その情報が中間表現に含まれていると解釈 • 分類器のパラメータは訓練する ⁃ 注意パターン [9,10,11] • 各層の注意機構が⽂脈情報をどのように参照したかを分析 • 注意重み (Attention weights) の観察が最も典型的な⽅法 • 後ほど再度紹介します主流な分析方法の変化 𝒙! 𝒙! 𝒙! ஫ҙػߏͰͷ จ຺ࢀরύλʔϯΛ؍࡯ “は” が “⾸都” を強く参照 Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 𝒙! ⽇本の⾸都は 𝒙′! 𝒙′! 𝒙′! 𝒙′! ⽇本の⾸都は⽇本の⾸都は Layer 1

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 22 § デコーダー⾔語モデル分析の時代
(2022年~) ⁃ Probing (プロービング) ⁃ 注意パターン ⁃ 新⼿法が続々と登場 (終盤で軽く紹介) • Logit Lens (ロジットレンズ) • Sparse Autoencoder (スパースオートエンコーダー) • など主流な分析方法の変化

表現にエンコードされている情報 5 注意パターンの分析とその拡張 6 最近のトレンド 1 分析の意義、後半パートでの⽬標 2 Transformer⾔語モデルのおさらい 3 Transformer⾔語モデル分析の概観・歴史

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 24 LLMの単語表現にはどんな情報がエンコードされている？ LLM
は単語を超⾼次元なベクトルで表現する § Llama 3.1 8B → 4,096次元 § Llama 3.1 70B → 8,192次元 § Llama 3.1 405B → 16,384次元興味 § 超⾼次元な表現はどんな⾔語情報を含んでいる︖ § 中間表現に介⼊することでモデルの出⼒を制御できるか︖ ⽇本の⾸都は 𝒗! 𝒗! 𝒗! Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 𝒗! ⽇本の⾸都は [−0.34, 0.02, −0.86, …, 0.10] 高次元

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 25 構⽂情報の1つである品詞情報がエンコードされている [12]
§ 超⾼次元な表現を次元削減して2〜3次元空間に可視化 § 中間層の単語表現は品詞ごとにクラスタをなす言語モデルの単語表現には構文情報がエンコードされている [12] Rebecca Kehlbeck et al. (2021) “Demystifying the Embedding Space of Language Models”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 26 言語モデルの単語表現には意味情報がエンコードされている意味情報がエンコードされている
[12] § 超⾼次元な表現を次元削減して2〜3次元空間に可視化 § 多義語に対しては語義に応じてクラスタをなす (banks: 銀⾏, ⼟⼿, 並び, etc.) [12] Rebecca Kehlbeck et al. (2021) “Demystifying the Embedding Space of Language Models”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 27 LLMは固有名詞の表現に地理情報や時間情報をエンコードしている [13]
Wes Gurnee & Max Tegmark (2024) “Language Models Represent Space and Time” 地名・建造物・⼈物・映画タイトルなどに対するモデルの表現には地理情報や時間情報がエンコードされている § Probing の要領で、中間表現から経緯座標または年を出⼒する線形予測器を学習 Los Angeles 𝒗! ༧ଌ஋ͱਖ਼ղ஋ͷؒͷ ૬ؔͰධՁ (34.0549, 118.2426) 経度・緯度 Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 Angeles 線形予測器

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 28 LLMは固有名詞の表現に地理情報や時間情報をエンコードしている [13]
Wes Gurnee & Max Tegmark (2024) “Language Models Represent Space and Time” 地名・建造物・⼈物・映画タイトルなどに対するモデルの表現には地理情報や時間情報がエンコードされている § Probing の要領で、中間表現から経緯座標または年を出⼒する線形予測器を学習 § Llama 2 の中間表現から、経緯座標や年をかなりうまく予測可能 (＝該当情報を含んでいる)

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 29 Time Intervention
We stu tion with art and entertainme <media> by <creator on all tokens and sweep over ﬁve tokens. Results are depic can change the next token pr LLMは固有名詞の表現に地理情報や時間情報をエンコードしている [13] Wes Gurnee & Max Tegmark (2024) “Language Models Represent Space and Time” 地名・建造物・⼈物・映画タイトルなどに対するモデルの表現には地理情報や時間情報がエンコードされている § Probing の要領で、中間表現から経緯座標または年を出⼒する線形予測器を学習 § Llama 2 の中間表現から、経緯座標や年をかなりうまく予測可能 (＝該当情報を含んでいる) § 中間表現の特定の次元に介⼊するだけで予測を変えることができるもののけ姫が公開されたのは19 𝒗! Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 9 [−0.34, 0.02, −0.86, …, 0.10] [−0.34, 0.02, 0.35, …, 0.10] 1次元だけ数値に介入入力: “The Godfather by Francis Ford Coppola was written in 19” Llma 2-7B の第19層3610次元目に介入したときの予測変化モデルの次単語予測確率介入する際に置き換える値

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 30 言語モデルの単語表現には真偽情報がエンコードされている⼊⼒が事実が誤った情報かを中間表現から判別できる
[14,15,16] § ⼿法の⼀例︓Probing の要領で、真実か誤情報かを中間表現から予測する線形予測器を学習⽇本の⾸都は宮城 𝒗! 宮城誤情報線形予測器 Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 31 言語モデルの単語表現には真偽情報がエンコードされている⼊⼒が事実が誤った情報かを中間表現から判別できる
[14,15,16] § ⼿法の⼀例︓Probing の要領で、真実か誤情報かを中間表現から予測する線形予測器を学習 § 誤情報や幻覚（ハルシネーション）の検知に使える § 推論中に中間表現に対して出⼒の事実性を向上させるように介⼊もできる

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 32 LLMの表現には拒否具合がエンコードされている LLMが回答することを拒否するかどうかが表現にエンコードされている
[17] § RLHFなどを施したLLMは有害な指⽰への回答を拒否する (例: 爆弾の作り⽅を教えて) § モデルの拒否具合を表す拒否⽅向を算出 ⁃ 有害な指⽰たちをLLMに⼊⼒していき、ある層の平均表現 𝝁 を計算 ⁃ 無害な指⽰たちをLLMに⼊⼒していき、ある層の平均表現 𝝂 を計算 ⁃ 差分 𝝁 − 𝝂 を表現空間における “拒否⽅向” として定義 "! " "# " "% " "" " "! " "# " "% " Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 "" " 受験勉強の⽅法⽇本の良い⽂化 "! " "# " "% " "" " "! " "# " "% " Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 "" " 銀⾏強盗の⽅法爆弾の作り⽅

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 33 LLMの表現には拒否具合がエンコードされている LLMが回答することを拒否するかどうかが表現にエンコードされている
[17] § RLHFなどを施したLLMは有害な指⽰への回答を拒否する (例: 爆弾の作り⽅を教えて) § モデルの拒否具合を表す拒否⽅向を算出 ⁃ 有害な指⽰たちをLLMに⼊⼒していき、ある層の平均表現 𝝁 を計算 ⁃ 無害な指⽰たちをLLMに⼊⼒していき、ある層の平均表現 𝝂 を計算 ⁃ 差分 𝝁 − 𝝂 を表現空間における “拒否⽅向” として定義 § 推論中に中間表現に拒否⽅向を⾜したり引いたりすることで、拒否具合を制御できる ⁃ 有害な指⽰も拒否せず回答させたり、無害な指⽰も拒否させたりできる [17] Andy Arditi et al. (2024) “Refusal in Language Models Is Mediated by a Single Direction” Llama-3-70B-Instruct に有害な回答を生成させた例 Gemma-7B-IT に無害な指示を拒否させた例

注意パターンの分析とその拡張 6 最近のトレンド 1 分析の意義、後半パートでの⽬標 2 Transformer⾔語モデルのおさらい 3 Transformer⾔語モデル分析の概観・歴史 4 表現にエンコードされている情報

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 35 注意機構 (Attention)
の分析注意パターンの分析 § 各層の注意機構が⽂脈情報をどのように参照したかを分析 § 注意重み (Attention weights) の観察が最も典型的な⽅法 𝒙! 𝒙! 𝒙! ஫ҙػߏͰͷ จ຺ࢀরύλʔϯΛ؍࡯ “は” が “⾸都” を強く参照 Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 𝒙! ⽇本の⾸都は 𝒙′! 𝒙′! 𝒙′! 𝒙′! ⽇本の⾸都は [18] Jay Alammar (2018) “The Illustrated Transformer” 注意重み (Attention weights) 注意機構の計算イメージ⽇本の⾸都は Layer 1

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 36 一部の注意機構では注意重みが一部の品詞に偏る GPT-2の⼀部の注意機構では注意重みが特定の品詞に強く集中する
§ 第9層の第0ヘッドは名詞に全体の4割以上の注意重みを集める § 第3層の第9ヘッドは動詞に全体の3割程度の注意重みを集める [19] Jesse Vig & Yonatan Belinkov et al. (2019) “Analyzing the Structure of Attention in a Transformer Language Model” 層ヘッド層ヘッド各セルはモデル内の1つ1つの注意機構を指す明るい色ほど注意重みが特定の品詞に偏ることを指す

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 37 一部の注意機構では注意重みが依存関係と一致する GPT-2の中盤層の注意機構では注意重みが依存関係と⽐較的強く⼀致する
§ 依存関係にある単語ペアに対して強めの注意重みが割り振られる § モデル内での⽂脈情報への参照が⾔語の性質に従っている証拠の⼀つ [19] Jesse Vig & Yonatan Belinkov et al. (2019) “Analyzing the Structure of Attention in a Transformer Language Model”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 38 2つの注意機構でソフトなコピーを実現：Induction Heads
⽂脈内に⼀度登場したパターンを2つの注意機構を使ってコピーできる例︓⼊⼒ “A B C D A” → ⽂脈の “A B” からコピーしてきて “B” を出す § 具体的なコピー⼿順 ⁃ 1つ⽬の注意機構で各表現に直前 (左隣) のトークン情報を集めさせておく例︓“B” の表現に “A” の情報を集めておく (左隣が “A” であると明⽰する) ⁃ 2つ⽬の注意機構で⽂脈内の同じトークンの直後 (右隣) のトークン情報を集める例︓左隣が “A” である表現を探すことで、“B” の情報を集める [20] Nelson Elhage et al. (2021) “A Mathematical Framework for Transformer Circuits”

⽂脈内に⼀度登場したパターンを2つの注意機構を使ってコピーできる § 3単語以上のパターンもコピー可能 § 厳密なパターンだけでなく、類似パターンもソフトにコピー可能 [21] Catherine Olsson et al. (2022) “In-context Learning and Induction Heads” 4値分類タスクでの In-Context Learning • ⽉動物: 0 • ⽉果物: 1 • ⾊動物: 2 • ⾊果物: 3 厳密なコピーソフトなコピー 4単語のソフトなコピー⾚いハイライトはある注意機構による注意重み

⽂脈内に⼀度登場したパターンを2つの注意機構を使ってコピーできる § この Induction heads が In-Context Learning 能⼒の鍵とされている § モデルの訓練過程において Induction heads が学習されるタイミングと In-Context Learning 能⼒が向上するタイミングが経験的に⼀致する [21] Catherine Olsson et al. (2022) “In-context Learning and Induction Heads”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 41 不思議な観察：多くのヘッドで注意重みは特定トークンに極端に偏るモデル内の多くの注意機構で注意重みは⼀部のトークンに極端に集中する
§ BERTでは⽂頭・⽂末・句読点に集中 [10] Kevin Clark et al. (2019) “What Does BERT Look at? An Analysis of BERT’s Attention” § Tanuki 8B では⽂頭に集中 (適当な注意ヘッド)

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 42 分析を拡張すると不自然な偏りは緩和する注意パターンの分析を「注意重み」から「注意機構全体」まで拡張
§ Valueベクトルも考慮したい ⁃ 例えばValueベクトルがほぼゼロベクトルなら⼤きな注意重みも無意味では︖ § ベクトルノルムを使って分析を拡張 𝒙′$ = $ %&' ( 𝛼$,% × 𝒗% Valueベクトル注意重み（QueryベクトルとKeyベクトルの内積から計算） 𝛼$,% の代わりに 𝛼$,% × 𝒗% で注意パターンを⾒る [22] Goro Kobayashi et al. (2020) “Attention is Not Only a Weight: Analyzing Transformers with Vector Norms”

§ Valueベクトルも考慮 § BERT では注意パターンの不⾃然な偏りが消えた層層平均的な注⽬度合い ΞςϯγϣϯॏΈ ஫ҙػߏશମͷ෼ੳ 特殊トークンや句読点に過剰に注目する謎の傾向特殊トークンや句読点に特に強く注目していない [22] Goro Kobayashi et al. (2020) “Attention is Not Only a Weight: Analyzing Transformers with Vector Norms”

§ Valueベクトルも考慮 § BERT では注意パターンの不⾃然な偏りが消えた § Tanuki 8B でも緩和 (ある層での注意パターン例)

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 45 偏りが起きるメカニズム：ゴミ箱この現象は注意重みの
“ゴミ箱機能” § 注意機構は⽂頭などの⼀部トークンに⼤きな重み 𝛼$,% を割り振る § しかし、それらの Value ベクトルは⼩さくしておくこの現象は注意機構の制約と関係 § ソフトマックスは必ず注意重みを⽂脈全体に合計1で割り振る § 特定のペア (e.g., 主語と動詞) に対しては重み付けしたいが、それ以外が来たら「何もしない」を実現したい場合に注意重みを捨てる必要がある「何もしない (no-operation)」を実現 (欲しい情報がない場合に注意重みを捨てる) ⼀部のトークンを注意重みのゴミ箱にする [22] Goro Kobayashi et al. (2020) “Attention is Not Only a Weight: Analyzing Transformers with Vector Norms”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 46 余談：ゴミ箱の重要性と制約の解消参照先を固定幅にした軽量な注意機構では⽂頭への参照可否
(ゴミ箱の有無) が重要 § ⽂頭にアテンション重みを計算できるようにしておくだけで圧倒的に性能が良いソフトマックスの制約を解消する案 § 分⺟に1を⾜す → 合計1の制約を解消 softmax' 𝒙 $ = exp 𝑥$ 1 + ∑% exp 𝑥% 各 𝑥" が⼗分に⼩さければ (−10以下など) 全体にほぼゼロを割り振れる [23] Guangxuan Xiao et al. (2024) “Efficient Streaming Language Models with Attention Sinks” [24] Evan Miller (2023) “Attention Is Off By One”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 47 各層の注意機構はそこまで大きな変化を与えていない残差結合
(Reisdual connection) を意識してモデルの⾒⽅を変える注意機構フィードフォワードネット層正規化層正規化予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京 … 第 # 層注意機構フィードフォワードネット層正規化層正規化埋め込み層予測ヘッド第 1 層⽇本の⾸都は⽇本の⾸都は東京第 # 層埋め込み層注意機構フィードフォワードネット層正規化層正規化注意機構フィードフォワードネット層正規化層正規化 Residual Stream

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 48 各層の注意機構はそこまで大きな変化を与えていない注意機構と残差結合の役割を考える
§ 注意機構︓⽂脈を参照して情報を集める § 残差結合︓下から来た情報を残す (＝⾃⾝の表現を参照する) 第 1 層⽇本の⾸都は埋め込み層注意機構フィードフォワードネット層正規化層正規化 Residu 残差結合

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 49 各層の注意機構はそこまで大きな変化を与えていない注意機構と残差結合の役割を考える
§ 注意機構︓⽂脈を参照して情報を集める § 残差結合︓下から来た情報を残す (＝⾃⾝の表現を参照する) 第 1 層⽇本の⾸都は埋め込み層注意機構フィードフォワードネット層正規化層正規化 Residu 残差結合残差結合も注意パターンの分析に含める § 注意機構と残差結合を数式で表すと § 先ほどと同様にベクトルノルムで測れそう AttnRes# = ( $%& ' 𝛼#,$ × 𝒗$ + 𝒙# 𝛼#,$ × 𝒗$ + 𝛿#,$ 𝒙# [25] Goro Kobayashi et al. (2021) “Incorporating Residual and Normalization Layers into Analysis of Masked Language Models”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 50 各層の注意機構はそこまで大きな変化を与えていない残差結合も注意パターンの分析に含める
§ 残差結合が⾃⾝の情報を残す⼒が⾮常に強く、各層の注意機構による⽂脈の参照は少しずつ⾏われている § Tanuki 8B に適当なテキストを⼊⼒したときのある層での注意パターン例注意重み注意機構全体注意機構と残差結合 [25] Goro Kobayashi et al. (2021) “Incorporating Residual and Normalization Layers into Analysis of Masked Language Models”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 51 さらに分析範囲を広げていく試みフィードフォワードネットも注意パターンの分析に加える
[26] § フィードフォワードネットも⽂脈情報の参照に関与していたことを発⾒ ⁃ 頻出な⾔い回しや固有表現などへの参照を強めていた § モジュール同⼠が作⽤を打ち消し合う傾向を発⾒ ⁃ フィードフォワードネットの働きは前後の層正規化で強く打ち消されるモデル全体まで分析対象を広げる試み [27] § 分析をモデル全体まで拡張し、予測に対する各⼊⼒単語の影響度を計算 § 「どの⼊⼒が予測に効いたか」を可視化する Input Attribution ⼿法として使える ⁃ 画像処理分野で提案されてきた勾配ベースの⼿法たちよりも良い § ※ 現状BERT⽤の実装しかなく、デコーダーモデルに使える実装はない感情分類タスクを解く際の各⼊⼒単語の影響度

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 53 ① フィードフォワードネットを知識保存モジュールとして見る
[28] Mor Geva et al. (2021) “Transformer Feed-Forward Layers Are Key-Value Memories” フィードフォワードネットを記憶装置とみなす § フィードフォワードネット (2層MLP) は注意機構と似ている Feed-Forward Network FFN(key) Activation inner product weighted sum FFN(val) FFN Output Hidden State The capital of Ireland is [MASK] Self-Attention Layer Feed-Forward Network Dublin … … … … Knowledge Neurons … … 𝐿 × Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. Attention head Attention weights Key vectors Value vectors !! !" !# … weighted sum inner product … … … … … dule in a Transformer block works as a key-value memory. The first linear neurons through inner product. Taking the activation of these neurons as al) integrates value vectors through weighted sum. We hypothesize that re responsible for expressing factual knowledge. nowledge at- g and ampli- ffects the ex- dge. Second, a fact tend to g knowledge- e knowledge g prompts re- ually express om activating lation. erage knowl- al knowledge y fine-tuning. pdating facts, ng the knowledge surgery in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) Q K V ʢॏΈߦྻʣ ʢॏΈߦྻʣ ⾮常に類似 Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. the effectiveness of the proposed knowledge attribution method. First, suppressing and ampli- fying knowledge neurons notably affects the ex- pression of the corresponding knowledge. Second, we find that knowledge neurons of a fact tend to be activated more by corresponding knowledge- expressing prompts. Third, given the knowledge neurons of a fact, the top activating prompts re- trieved from open-domain texts usually express the corresponding fact, while the bottom activating prompts do not express the correct relation. In our case studies, we try to leverage knowledge neurons to explicitly edit factual knowledge in pretrained Transformers without any fine-tuning. We present two preliminary studies: updating facts, and erasing relations. After identifying the knowledge neurons, we perform a knowledge surgery for pretrained Transformers by directly modify- in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) where W Q h , W K h , W V h , W1, W2 are parameter ma- trices; Self-Atth(X) computes a single attention Query vector

[28] Mor Geva et al. (2021) “Transformer Feed-Forward Layers Are Key-Value Memories” フィードフォワードネットを記憶装置とみなす § フィードフォワードネット (2層MLP) は注意機構と似ている Feed-Forward Network FFN(key) Activation inner product weighted sum FFN(val) FFN Output Hidden State The capital of Ireland is [MASK] Self-Attention Layer Feed-Forward Network Dublin … … … … Knowledge Neurons … … 𝐿 × Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. Attention head Attention weights Key vectors Value vectors !! !" !# … weighted sum inner product … … … … … dule in a Transformer block works as a key-value memory. The first linear neurons through inner product. Taking the activation of these neurons as al) integrates value vectors through weighted sum. We hypothesize that re responsible for expressing factual knowledge. nowledge at- g and ampli- ffects the ex- dge. Second, a fact tend to g knowledge- e knowledge g prompts re- ually express om activating lation. erage knowl- al knowledge y fine-tuning. pdating facts, ng the knowledge surgery in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) Q K V ʢॏΈߦྻʣ ʢॏΈߦྻʣ ⾮常に類似 Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. the effectiveness of the proposed knowledge attribution method. First, suppressing and ampli- fying knowledge neurons notably affects the ex- pression of the corresponding knowledge. Second, we find that knowledge neurons of a fact tend to be activated more by corresponding knowledge- expressing prompts. Third, given the knowledge neurons of a fact, the top activating prompts re- trieved from open-domain texts usually express the corresponding fact, while the bottom activating prompts do not express the correct relation. In our case studies, we try to leverage knowledge neurons to explicitly edit factual knowledge in pretrained Transformers without any fine-tuning. We present two preliminary studies: updating facts, and erasing relations. After identifying the knowledge neurons, we perform a knowledge surgery for pretrained Transformers by directly modify- in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) where W Q h , W K h , W V h , W1, W2 are parameter ma- trices; Self-Atth(X) computes a single attention Query vector 注意機構 • Queryベクトルが⼊⼒される • Keyベクトルたち (⽂脈情報たち) との内積で注意重みを計算 • 各Valueベクトルに対応する注意重みをかけながら総和

[28] Mor Geva et al. (2021) “Transformer Feed-Forward Layers Are Key-Value Memories” フィードフォワードネットを記憶装置とみなす § フィードフォワードネット (2層MLP) は注意機構と似ている Feed-Forward Network FFN(key) Activation inner product weighted sum FFN(val) FFN Output Hidden State The capital of Ireland is [MASK] Self-Attention Layer Feed-Forward Network Dublin … … … … Knowledge Neurons … … 𝐿 × Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. Attention head Attention weights Key vectors Value vectors !! !" !# … weighted sum inner product … … … … … dule in a Transformer block works as a key-value memory. The first linear neurons through inner product. Taking the activation of these neurons as al) integrates value vectors through weighted sum. We hypothesize that re responsible for expressing factual knowledge. nowledge at- g and ampli- ffects the ex- dge. Second, a fact tend to g knowledge- e knowledge g prompts re- ually express om activating lation. erage knowl- al knowledge y fine-tuning. pdating facts, ng the knowledge surgery in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) Q K V ʢॏΈߦྻʣ ʢॏΈߦྻʣ ⾮常に類似 Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. the effectiveness of the proposed knowledge attribution method. First, suppressing and ampli- fying knowledge neurons notably affects the ex- pression of the corresponding knowledge. Second, we find that knowledge neurons of a fact tend to be activated more by corresponding knowledge- expressing prompts. Third, given the knowledge neurons of a fact, the top activating prompts re- trieved from open-domain texts usually express the corresponding fact, while the bottom activating prompts do not express the correct relation. In our case studies, we try to leverage knowledge neurons to explicitly edit factual knowledge in pretrained Transformers without any fine-tuning. We present two preliminary studies: updating facts, and erasing relations. After identifying the knowledge neurons, we perform a knowledge surgery for pretrained Transformers by directly modify- in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) where W Q h , W K h , W V h , W1, W2 are parameter ma- trices; Self-Atth(X) computes a single attention Query vector フィードフォワードネット • 中間表現が⼊⼒される • 1つ⽬の重み⾏列の列たちとの内積で活性化値を計算 • 2つ⽬の重み⾏列の各列に対応する活性化値をかけながら総和

フィードフォワードネットを記憶装置とみなす § フィードフォワードネット (2層MLP) は注意機構と似ている Feed-Forward Network FFN(key) Activation inner product weighted sum FFN(val) FFN Output Hidden State The capital of Ireland is [MASK] Self-Attention Layer Feed-Forward Network Dublin … … … … Knowledge Neurons … … 𝐿 × Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. Attention head Attention weights Key vectors Value vectors !! !" !# … weighted sum inner product … … … … … dule in a Transformer block works as a key-value memory. The first linear neurons through inner product. Taking the activation of these neurons as al) integrates value vectors through weighted sum. We hypothesize that re responsible for expressing factual knowledge. nowledge at- g and ampli- ffects the ex- dge. Second, a fact tend to g knowledge- e knowledge g prompts re- ually express om activating lation. erage knowl- al knowledge y fine-tuning. pdating facts, ng the knowledge surgery in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) Q K V ʢॏΈߦྻʣ ʢॏΈߦྻʣ Query vector 周囲の単語表現から情報を集める重みパラメータから情報を集める

フィードフォワードネットを記憶装置とみなす § フィードフォワードネット (2層MLP) は注意機構と似ている § 1つ⽬の重み⾏列との内積（活性化値）は特定の⼊⼒パターンに反応 ⁃ n-gram ⁃ トピック § 2つ⽬の重み⾏列には特定の単語の予測を導くような列が含まれている Feed-Forward Network FFN(key) Activation inner product weighted sum FFN(val) FFN Output Hidden State The capital of Ireland is [MASK] Self-Attention Layer Feed-Forward Network Dublin … … … … Knowledge Neurons … … 𝐿 × Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The ﬁrst linear layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that knowledge neurons in the FFN module are responsible for expressing factual knowledge. ʢॏΈߦྻʣ ʢॏΈߦྻʣ 重みパラメータから情報を集める関係知識や頻出表現などの情報をパラメータに保存し、⼊⼒に応じて必要な情報を表現に付加

フィードフォワードネットに介⼊して知識編集 (Knowledge editing) § 重み⾏列を編集したり、パラメータを追加してモデルが持つ知識を編集する ⁃ KD[29]、ROME[30]、MEMIT[31] など § 意図しない無関係な知識も同時に更新してしまう懸念もある Feed-Forward Network FFN(key) Activation inner product weighted sum FFN(val) FFN Output Hidden State The capital of Ireland is [MASK] Self-Attention Layer Feed-Forward Network Dublin … … … … Kno Ne … … 𝐿 × Figure 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The ﬁrst layer FFN(key) computes intermediate neurons through inner product. Taking the activation of these neur weights, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesiz knowledge neurons in the FFN module are responsible for expressing factual knowledge. ⼤⾕翔平の所属チームは︖ LLM イチローの所属チームは︖ エンゼルスですマリナーズです⼤⾕翔平の所属チームは︖ LLM イチローの所属チームは︖ ドジャースですドジャースです知識編集⼤⾕翔平の所属チームをエンゼルスからドジャースに更新

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 59 ② 埋め込み空間への射影による解釈
途中層での中間表現を最後の予測ヘッドに渡してしまうことで解釈 § Logit Lens [32] ⁃ モデルの各層でどんな予測段階にあるのかを感覚的に掴める⽇本の⾸都は 𝒗! は東京予測ヘッド Layer 6 Layer 5 Layer 4 Layer 3 Layer 2 Layer 1 予測ヘッド東京 🧐 結構早い段階で正しい予測に至っていたのか

各層でのトップ予測トークンモデルの最終予測トークンが各層では第何候補か層

モデルパラメータを予測ヘッドの埋め込み⾏列に射影して解釈 [33] § 各パラメータの役割を語彙に紐付けて解釈できる § ⼀部の注意機構パラメータは特定の情報を捉えていた ⁃ 性別、地理、法律など § ⼀部のフィードフォワードネットパラメータは特定のトピックに関連 ⁃ ⽉、⼈名、スポーツなど [33] Guy Dar et al. (2023) “Analyzing Transformers in Embedding Space”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 62 ③ モデル全体での情報や処理過程を追跡
モデル全体を通してどういうフローで予測が⾏われているか [34] § 事実知識を予測するケースについて分析 ⁃ 例︓“Beats Music is owned by” → “Apple” § 埋め込み空間への射影や層の削除などを通じて重要な要素やその働きを特定 § 3ステップで⾏われている模様 1. 序盤層のフィードフォワードネットで主語 “Beats Music” の情報をリッチにする • 主語を構成する最後のトークンの表現で⾏う • パラメータから知識を付加 2. 序盤層の注意機構で関係 “is owned by” の情報を最後尾の表現に集める 3. 終盤層の注意機構で “Beats Music” のリッチな情報から “Apple” の情報を集める [34] Mor Geva et al. (2023) “Dissecting Recall of Factual Associations in Auto-Regressive Language Models”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 63 ④ Sparse
Autoencoder による解釈モデル内部の活性化状態をより細かい粒度で観察する § 従来︓ニューロンの観察 [35, 36, 37] ⁃ たくさんのテキストを⼊⼒し、各ニューロンが特に活性化した⼊⼒たちを観察することで、各ニューロンと⾔語現象を関連付けて解釈する ⁃ 1つのニューロンが複数の特徴量に対して活性化する重ね合わせ (Superposition) が起きるので、解釈が難しい • 例︓あるニューロンが「HTTPリクエスト」と「韓国語テキスト」の両⽅に反応する [36] Neel Nanda “Neuroscope: A Website for Mechanistic Interpretability of Language Models”

Autoencoder による解釈モデル内部の活性化状態をより細かい粒度で観察する § 最近の試み︓Sparse Autoencoder でニューロンを細かく解 (ほど) く [38] ⁃ より⾼次元で疎な特徴量に変換するオートエンコーダを学習する • モデルにテキストを⼊⼒し、ある層のフィードフォワードネットの活性化列 𝒛 (例えば100次元) を取り出す • これをより⾼次元で疎なベクトル 𝐡(𝒛) (例えば500次元) に⼀度変換し、それをまた 𝒛 に戻すようなオートエンコーダを学習する ⁃ Sparse Autoencoder に通して得られる特徴量 (𝐡(𝒛)の各要素) を観察することで、重ね合っていた複数の特徴量を切り離し、より細かく解釈しやすい粒度で観察可能 [1] Javier Ferrando et al. (2024) “A Primer on the Inner Workings of Transformer-based Language Models”

Autoencoder による解釈モデル内部の活性化状態をより細かい粒度で観察する § 最近の試み︓Sparse Autoencoder でニューロンを細かく解 (ほど) く [38] ⁃ OpenAI, Anthropic, Google がそれぞれ⾃社モデルのための Sparse Autoencoder デモやパラメータを公開 [39, 40, 41] GPT-4 の Sparse Autoencoder 分析例 [39] Leo Gao et al. (2024) “Scaling and evaluating sparse autoencoders” Demo URL

Autoencoder による解釈 Claude 3 Sonnet の Sparse Autoencoder 分析例 [40] Adly Templeton et al. (2024) “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” Demo URL モデル内部の活性化状態をより細かい粒度で観察する § 最近の試み︓Sparse Autoencoder でニューロンを細かく解 (ほど) く [38] ⁃ OpenAI, Anthropic, Google がそれぞれ⾃社モデルのための Sparse Autoencoder デモやパラメータを公開 [39, 40, 41]

Autoencoder による解釈 Gemma 2 の Sparse Autoencoder 分析例 [41] Tom Lieberum et al. (2024) “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2” Demo URL モデル内部の活性化状態をより細かい粒度で観察する § 最近の試み︓Sparse Autoencoder でニューロンを細かく解 (ほど) く [38] ⁃ OpenAI, Anthropic, Google がそれぞれ⾃社モデルのための Sparse Autoencoder デモやパラメータを公開 [39, 40, 41]

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 68 Day10 後半パートでの目標（達成できましたか？）
1. LLMの分析についての歴史と代表的な分析⼿法について理解する 2. LLMの仕組みに関する重要な知⾒を理解する 3. 分析界隈での最近のトレンドについて把握する

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 69 今回 (深く)
触れられなかった内容 § ⼊⼒帰属⼿法 (Input Attribution methods) ⁃ GradientNorm [42,43], Input×Gradient [44,45], SmoothGrad [46], IntegratedGrad [47], LIME [48], SHAP [49] など § 中間表現への介⼊による分析・制御 ⁃ Activation patching 系 [50,51] ⁃ Steering 系 [52,53] § Logit Lens の派⽣ ⁃ Attention Lens [54], Tuned Lens [55] など § モデルは関係知識を線形に表現するシステムなのか ⁃ Linear relational embedding (LRE) [56,57] § Transformer の冗⻑性・枝刈り (Pruning) [58,59] § Physics of Language Models [60,61,62,63,64,65]

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 70 謝辞本スライドの作成にあたり、有益な情報の提供や助⾔をくださった皆さんに
深く感謝いたします。 § ⾼橋良允さん (東北⼤) § 坂⽥将樹さん (東北⼤) § 横井祥さん (東北⼤) § 栗林樹⽣さん (MBZUAI) § 東北⼤学⾃然⾔語処理研究グループ (Tohoku NLP) の皆さん

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 71 Reference ⁃
[1] Javier Ferrando et al. (2024) “A Primer on the Inner Workings of Transformer-based Language Models” arXiv cs.CL/2405.00208 ⁃ [2] Marta Costa-jussà et al. (2023) “Toxicity in Multilingual Machine Translation at Scale” Findings of EMNLP 2023 ⁃ [3] Dennis Wei et al. (2022) “On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach” NeurIPS 2022 ⁃ [4] Nick Cammarata et al. (2020.3-2021.4) “Distill Circuits Thread” ⁃ [5] Anthropic (2021.12-now) “Transformer Circuits Thread” ⁃ [6] Ian Tenney et al. (2019) “BERT Rediscovers the Classical NLP Pipeline” ACL 2019 ⁃ [7] Yongjie Lin et al. (2019) “Open Sesame: Getting Inside BERT's Linguistic Knowledge” BlackboxNLP 2019 ⁃ [8] John Hewitt & Percy Liang (2019) “Designing and Interpreting Probes with Control Tasks” EMNLP-IJCNLP 2019 ⁃ [9] David Mareček & Rudolf Rosa (2019) “From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self- Attentions” BlackboxNLP 2019 ⁃ [10] Kevin Clark et al. (2019) “What Does BERT Look at? An Analysis of BERT’s Attention” BlackboxNLP 2019

[11] Olga Kovaleva et al. (2019) “Revealing the Dark Secrets of BERT” EMNLP-IJCNLP 2019 ⁃ [12] Rebecca Kehlbeck al. (2021) “Demystifying the Embedding Space of Language Models” VisxAI 2021 ⁃ [13] Wes Gurnee & Max Tegmark (2024) “Language Models Represent Space and Time” ICLR 2024 ⁃ [14] Kenneth Li et al. (2023) “Inference-Time Intervention: Eliciting Truthful Answers from a Language Model” NeurIPS 2023 ⁃ [15] Junteng Liu et al. (2024) “On the Universal Truthfulness Hyperplane Inside LLMs” arXiv cs.CL/2407.085.82 ⁃ [16] Lennart Bürger et al. (2024) “Truth is Universal: Robust Detection of Lies in LLMs” NeurIPS 2024 ⁃ [17] Andy Arditi et al. (2024) “Refusal in Language Models Is Mediated by a Single Direction” arXiv cs.CL/2406.11717 ⁃ [18] Jay Alammar (2018) “The Illustrated Transformer” Blog post ⁃ [19] Jesse Vig & Yonatan Belinkov et al. (2019) “Analyzing the Structure of Attention in a Transformer Language Model” BlackboxNLP 2019 ⁃ [20] Nelson Elhage et al. (2021) “A Mathematical Framework for Transformer Circuits” Transformer Circuits Threads 2021

[21] Catherine Olsson et al. (2022) “In-context Learning and Induction Heads” Transformer Circuits Threads 2022 ⁃ [22] Goro Kobayashi et al. (2020) “Attention is Not Only a Weight: Analyzing Transformers with Vector Norms” EMNLP 2020 ⁃ [23] Guangxuan Xiao et al. (2024) “Efficient Streaming Language Models with Attention Sinks” ICLR 2024 ⁃ [24] Evan Miller (2023) “Attention Is Off By One” Blog Post ⁃ [25] Goro Kobayashi et al. (2021) “Incorporating Residual and Normalization Layers into Analysis of Masked Language Models” EMNLP 2021 ⁃ [26] Goro Kobayashi et al. (2024) “Analyzing Feed-Forward Blocks in Transformers through the Lens of Attention Maps” ICLR 2024 ⁃ [27] Ali Modarressi et al. (2023) “DecompX: Explaining Transformers Decisions by Propagating Token Decomposition” ⁃ [28] Mor Geva et al. (2021) “Transformer Feed-Forward Layers Are Key-Value Memories” EMNLP 2021 ⁃ [29] Damai Dai et al. (2022) “Knowledge Neurons in Pretrained Transformers” ACL 2022 ⁃ [30] Kevin Meng et al. (2022) “Locating and Editing Factual Associations in GPT” NeurIPS 2022

[31] Kevin Meng et al. (2023) “Mass-Editing Memory in a Transformer” ICLR 2023 ⁃ [32] nostalgebraist (2020) “interpreting GPT: the logit lens” Blog post ⁃ [33] Guy Dar et al. (2023) “Analyzing Transformers in Embedding Space” ACL 2023 ⁃ [34] Mor Geva et al. (2023) “Dissecting Recall of Factual Associations in Auto-Regressive Language Models” EMNLP 2023 ⁃ [35] Daisuke Oba et al. (2021) “Exploratory Model Analysis Using Data-Driven Neuron Representations” BlackboxNLP 2021 ⁃ [36] Neel Nanda “Neuroscope: A Website for Mechanistic Interpretability of Language Models” Website ⁃ [37] Wes Gurnee et al. (2024) “Universal Neurons in GPT2 Language Models” TMLR 2024 ⁃ [38] Trenton Bricken et al. (2024) “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” Transformer Circuits Thread by Anthropic 2024 ⁃ [39] Leo Gao et al. (2024) “Scaling and evaluating sparse autoencoders” arXiv cs.LG/2406.04093 Demo URL ⁃ [40] Adly Templeton et al. (2024) “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet” Transformer Circuits Thread by Anthropic 2024 Demo URL

[41] Tom Lieberum et al. (2024) “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2” BlackboxNLP 2024 Demo URL ⁃ [42] Karen Simonyan et al. (2014) “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps” ICLR (Workshop Poster) 2014 ⁃ [43] Jiwei Li et al. (2016) “Visualizing and Understanding Neural Models in NLP” NAACL-HLT 2016 ⁃ [44] Avanti Shrikumar et al. (2016) “Not Just a Black Box: Learning Important Features Through Propagating Activation Differences” arXiv cs.LG/1605.01713 ⁃ [45] Misha Denil et al. (2014) “Extraction of Salient Sentences from Labelled Documents” arXiv cs.CL/1412.6815 ⁃ [46] Daniel Smilkov et al. (2017) “SmoothGrad: removing noise by adding noise” arXiv cs.LG/1706.03825 ⁃ [47] Mukund Sundararajan et al. (2017) “Axiomatic Attribution for Deep Networks” ICML 2017 ⁃ [48] Marco Tulio Ribeiro et al. (2016) “Why Should I Trust You?: Explaining the Predictions of Any Classifier” KDD 2016 ⁃ [49] Scott M. Lundberg et al. (2017) “A Unified Approach to Interpreting Model Predictions” NeurIPS 2017 ⁃ [50] Stefan Heimersheim et al. (2024) “How to use and interpret activation patching” arXiv cs.LG/2404.15255

[51] Asna Ghandeharioun et al. (2024) “Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models” ICML 2024 ⁃ [52] Nishant Subramani et al. (2022) “Extracting Latent Steering Vectors from Pretrained Language Models” ACL Findings 2022 ⁃ [53] Alexander Matt Turner et al. (2023) “Steering Language Models with Activation Engineering” arXiv cs.CL/2308.10248 ⁃ [54] Mansi Sakarvadia et al. (2023) “Attention Lens: A Tool for Mechanistically Interpreting the Attention Head Information Retrieval Mechanism” ATTRIB 2023 ⁃ [55] Nora Belrose et al. (2023) “Eliciting Latent Predictions from Transformers with the Tuned Lens” arXiv cs.LG/2303.08112 ⁃ [56] Evan Hernandez et al. (2024) “Linearity of Relation Decoding in Transformer Language Models” ICLR 2024 ⁃ [57] David Chanin et al. (2024) “Identifying Linear Relational Concepts in Large Language Models” NAACL-HLT 2024 ⁃ [58] Andry Gromov et al. (2024) “The Unreasonable Ineffectiveness of the Deeper Layers” arXiv cs.CL/2403.17887 ⁃ [59] Shwai He et al. (2024) “What Matters in Transformers? Not All Attention is Needed” arXiv cs.LG/2406.15786 ⁃ [60] Zeyuan Allen-Zhu et al. (2023) “Physics of Language Models: Part 1, Learning Hierarchical Language Structures” arXiv cs.CL/2305.13673

[61] Tian Ye et al. (2024) “Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process” arXiv cs.AI/2407.20311 ⁃ [62] Tian Ye et al. (2024) “Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems” arXiv cs.CL/2407.20311 ⁃ [63] Zeyuan Allen-Zhu et al. (2023) “Physics of Language Models: Part 3.1, Knowledge Storage and Extraction” arXiv cs.CL/2309.14316 ⁃ [64] Zeyuan Allen-Zhu et al. (2023) “Physics of Language Models: Part 3.2, Knowledge Manipulation” arXiv cs.CL/2309.14402 ⁃ [65] Zeyuan Allen-Zhu et al. (2023) “Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws” arXiv cs.CL/2404.05405

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 78 有益なチュートリアル §
Yonatan Belinkov et al. (2020) “ACL 2020 Tutorial: Interpretability and Analysis in Neural NLP” Slide: https://sebastiangehrmann.com/assets/files/acl_2020_interpretability_tutorial.pdf § Eric Wallace et al. (2020) “EMNLP 2020 Tutorial: Interpreting Predictions of NLP Models” Slide: https://github.com/Eric-Wallace/interpretability-tutorial-emnlp2020/blob/master/tutorial_slides.pdf Video: https://www.youtube.com/watch?v=gprIzglUW1s § Hosein Mohebbi et al. (2024) “EACL 2024 Turial: Transformer-specific Interpretability” Slides: https://github.com/interpretingdl/eacl2024_transformer_interpretability_tutorial/tree/main/slides Notebooks: https://github.com/interpretingdl/eacl2024_transformer_interpretability_tutorial/tree/main/notebooks § Ningyu Zhang et al. (2023, 2024) “AACL 2023 Tutorial: Editing Large Language Models” Slide: https://github.com/zjunlp/KnowledgeEditingPapers/blob/main/AACL2023%40Tutorial_Editing%20LLMs.pdf “AAAI 2024 Tutorial: Knowledge Editing for Large Language Models” Slide: https://github.com/zjunlp/KnowledgeEditingPapers/blob/main/AAAI2024%40Tutorial_Knowledge%20Editing%20for%20LLMs.pdf “LREC-COLING 2024 Tutorial: Knowledge Editing for Large Language Models” Slide: https://github.com/zjunlp/KnowledgeEditingPapers/blob/main/COLING2024%40Tutorial_Knowledge%20Editing%20for%20LLMs.pdf “IJCAI 2024 Tutorial: Knowledge Editing for Large Language Models” Slide: https://github.com/zjunlp/KnowledgeEditingPapers/blob/main/IJCAI2024%40Tutorial_Knowledge%20Editing%20for%20LLMsl.pdf § ICML 2024 Tutorial: Physics of Language Models Project page: https://physics.allen-zhu.com/ Video: https://youtu.be/yBL7J0kgldU

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 79 有益なサーベイ論文 §
Javier Ferrando et al. (2024) “A Primer on the Inner Workings of Transformer-based Language Models” arXiv cs.CL/2405.00208 § Qing Lyu et al. (2024) “Towards Faithful Model Explanation in NLP: A Survey” Computational Linguistics § Andreas Madsen et al. (2022) “Post-hoc Interpretability for Neural NLP: A Survey” ACM CS § Leonard Bereska et al. (2024) “Mechanistic Interpretability for AI Safety - A Review” TMLR 2024 § Zifan Zheng et al. (2024) “Attention Heads of Large Language Models: A Survey” arXiv cs.CL/2409.03752 § Haiyan Zhao et al. (2024) “Explainability for Large Language Models: A Survey” ACM TIST § Haoyan Luo et al. (2024) ”From Understanding to Utilization: A Survey on Explainability for Large Language Models” arXiv cs.CL/2401.12874

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 80 有益なまとめサイト §
Neel Nanda (2023?) “A Comprehensive Mechanistic Interpretability Explainer & Glossary” § Xhoni Shollaj (2023) “Awesome LLM Interpretability” § Ryota Takatsuki (2024) “機械論的解釈可能性の紹介”

LLM 大規模言語モデル講座講義資料 © 2024 by 東京大学松尾・岩澤研究室 81 有益な動画 §
3Blue1BrownJapan (2024) “LLMはどう知識を記憶しているか | Chapter 7, 深層学習” § Effective Altruism Japan (2024) “第4回マンスリーEA ⾼槻瞭⼤「機械論的解釈可能性の紹介」” § NLPコロキウム (2024) “第52回: Transformer⾔語モデルを内部挙動から理解する”

LLM講座2024年「Day10. LLMの分析と理論」（後半パート）

LLM講座2024年「Day10. LLMの分析と理論」（後半パート）

More Decks by Kogoro

Featured

Transcript