[論文紹介]Differentiation and Specialization of Attention Heads via the Refined LLC(rLLC)

[論文紹介]Differentiation and Specialization of Attention Heads via the Refined LLC(rLLC)
George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, Daniel Murfet https://arxiv.org/abs/2410.02984 2025/12/28 xiangze

概要 • refined Local Learning Coefficient(rLLC)という量を用いて、2層、Attentionのみ (MLPなし)のネットワークでAttentionの特徴あるHead(Induction headなど)の性質、学習過程による現れ方の違いを説明している。 •
関連する言語の入れ子構造のTransformerによる学習の研究も紹介する

Agenda LLC(局所学習係数)の復習 refined LCC Weight-refined LCC、Data-refined LCC そこからみられるAttention Headの特徴様々なHeadとその分類
Induction Head(帰納head) Stagewise development 関連研究、情報

特異モデルと実対数閾値λ(RLCT, 学習係数) (負の)対数尤度 KL divergence: ゼータ関数; 特異な(Hessianの退化した)K-K0に対して特異点解消定理を適用し λの意味: 自由エネルギーFnのデータ数の対数lognの係数が実対数閾値(RLCT, 学習係数)
W w

局所学習係数(LLC)の定義と性質アイデア: 学習途中/完了後のDNNの性質を見るには事後分布が高い局所解の周辺だけ考えれば良い体積V(ε),ε→0に対して有理数λ(w∗ )と正の整数m(w∗ ) が存在するとも書くことができる。物理的理解はデータ数Nと体積の関係
正規モデルの場合λ=d/2は(コルモゴロフ)複雑さの次元でその拡張になっている LLCは既存の誤差εが十分小さい場合に、ε をさらに半分にするために必要な追加ビット数 B 移項して

4.LLCの推定手法分配関数事前分布φ(w)をガウシアンとし、パラメータγ、温度βとして固定した期待値で wをw*中心に展開するとこの事後分布を使って以下を定義し、さらに β*=1/lognととるとと書ける。前ページから

4.LLCの推定手法 Stochastic Gradient Langevin Dynamics(SGLD) を用いる Langevin方程式のStep Δwtを以下のようにして HMC NUTS
pyro(pytorch)を用いる方法もある(Stepsizeを最適にしてくれるが計算コストが高い) →線形ニューラルネット(行列分解モデル)では厳密解が得られるのでそれと比較する

Refined LCCの形式的定義分布関数qをq’、loss関数 lをl’に置き換え、wの定義域W=U×V, w*=(u*,v*)と分割する場合を考える。lossに対する体積vol に対しWeight and Data refined LCC
λは以下のように定義されるただし事後分布をに置き換える W=VのときはData-refined LLC、q’=qのときはWeight-refined LLC 注意点: LLCはあくまで学習係数の推定値に過ぎないが論文ではSGLD-based LLCの推定値は十分使える(mature)としている

Refined LCCの意味、分化と特化 Weight-refined LLCs reveal how attention heads differentiate across
training 層ごとにwrLLCが定義される Data-refined LLCs reveal how attention heads specification across training 入力データごとにdrLLCが定義される

様々なAttention Head 文法ヘッド、逐次ヘッド、検索ヘッド、反復ヘッド、帰納ヘッド , 関数ベクトル, 反復ヘッド文法ヘッド LLM のアテンションと外挿(https://joisino.hatenablog.com/entry/heads)より引用
”関数ベクトルを受け取り、その関数を実行する汎用変換器、あるいは高階関数のような役割をもつMLP”

帰納ヘッド(In-context Learning and Induction Heads) “過去に同じトークンになったときに次はどうなったかを参照”する [A*][B*] … [A] →
[B] blue predicted tokens orange current tokens

Weight-refined LCCで見られるHeadの分化 2x8=16個のHeadを時系列クラスタリングした結果色がHeadの種類

Data-refined LLCで見られるHeadの特化 The PILE(自然言語が主)に比べGithub(codeparrot)はHeadの特化が小さい

入れ子の括弧の学習とDyck言語、S式との関係異なる種類の括弧で入れ子を表現する(Dyck言語)→S式と等価になる (Chomsky-Schützenbergerの表現定理) 入れ子の括弧を識別するHead1:5 blue predicted tokens,orange current tokens

Appendix Appendix A: provides theoretical background on the local learning
coefficient. Appendix B: provides further details on the head classification. We describe the methodology we followed to manually classified each head, and offer more explanation and examples of each head’s classification and specializations. Appendix C: discusses the significance of critical points, increases, and decreases in the (r)LLC in relation to stagewise development in artificial and biological neural networks. We provide additional results for the full-weights data-refined LLC for a variety of common datasets. Appendix D: compares the rLLC to the Hessian trace, Fisher Information Matrix (FIM) trace, and Hessian rank. We show that the LLC consistently outperforms these other techniques. Appendix E: compares the rLLC against ablation-based metrics that are common to mechanistic interpretability (zero ablations, mean ablations, and resample ablations). We discuss the strengths and weaknesses of each of these methods in relation to the rLLC. Appendix F: provides more experimental details on the architecture, training setup, LLC hyperparameters, using models as the generating process for data-refined LLCs, composition scores, and automated clustering analysis. Appendix G: examines the consistency of our findings across different random initializations, supporting the generality of our conclusions. This analysis further supports the robustness of our observations across various model components. Additional figures & data can be found at the following repository: https://github.com/timaeus-research/paper-rllcs- 2024.

Classifying tokens in context (Appendix B) • Induction patterns (Appendix
B.2) • Dyck patterns (Appendix B.3) • Skip n-grams (Appendix B.4) • n-grams (Appendix B.5)

Headの種類(Appendix B) 2x8=16個のHeadを大まかに3種類に分類 - current/previous token - Muitlgram -
Incudtion

2x8=16個のHeadにおけるtokenを大まかに分類 • Remaning(current/previous token) • Muitlgram • Skipgram • Dyck
• Induction

Stagewise development(Appendix C) • The Pile(自然言語が主) • Arxiv • Github/codeparrot
のデータセットに対して学習過程の HeadごとのrLLCを図示、クラスタリングした結果 • refined LLCの上昇 ◦ 通常のSLTの描像(N->∞)でLLC(学習係数)が大きい鞍点に遷移する • refined LLCの下降 ◦ 複雑さの減少(情報の整理？) ◦ 記憶、学習の”臨界期”と関連付けている。 ◦ Pruning in neuroscience

Stagewise development(Appendix C) LM=stage level 平坦になる箇所があるがやや傾向が見えづらい

Hessianを指標とした研究との関連(Append D) • Hessian trace ◦ 鞍点上ではHutchinson trace estimationという高速で計算できる方法が知られている(SAMや量子化でも使われる)。
◦ LLCより得られる情報は少ない(当然) • Fisher Information Matrix (FIM) trace ◦ Hessianよりも雑な情報しか得られない • Hessian rank ◦ 大規模行列に対する効率的な計算方法、行列の多項式を用いる、0閾値を訓練中一定にするか適応的に変動させるか ◦ LLCより情報量は少ないが特異性が表現できる ◦ 退化の度合いが減るのが見えるがstageの境界が見えづらい↓ “It remains unclear whether the problem lies in the theoretical underpinnings of the Hessian rank approach (e.g., its potential inability to capture higher-order degeneracy), in the practical implementation (due to flawed estimation methodology), or both. “ rLLCでしか見えない量がある？

Ablation-based metrics(切除指標)による区別(Appendix E) • Method ◦ Zero ablation: Setting the
targeted activations to zero. ◦ Mean ablation: Replacing the targeted activations with their mean value across the dataset. ◦ Resampling ablation: Replacing the targeted activations with those from a randomly selected different input ◦ Path patching: first running two forward passes: one on uncorrupted inputs/activations, and one involving a corrupted inputs/ablations. 　　　one runs a final forward pass, patching in the uncorrupted & corrupted activations so as to isolate the role of a specific computational path. • Result ◦ The previous-token heads 0:1 and 0:4 can be identified by an increase in the ablation scores during stage LM4. ◦ The current-token head 0:5 is also distinguishable by its increase across the ablation scores starting towards the end of LM2 until it reaches a peak during LM4, after which it decreases. This is especially pronounced in the resampling ablation scores. ◦ The induction head 1:7 is clearly distinguished by the increase of the ablation scores in LM4. The other induction head 1:6 is less clearly distinct, though its ablation scores do have a different shape from the multigram heads. ◦ The multigram head 0:0 has similar rLLC curves to the other layer 0 multigram heads (though it is relatively larger, see Figure 2). The ablation scores suggest that 0:0 is a distinct type of head throughout much of training, with substantially higher values and a qualitatively distinct shape. It is not until LM5 that this heads ablation scores settle to a value comparable to the other layer 0 multigram heads. This complements the analysis in Appendix B.6 that suggests 0:0 starts out as a “space” head that attends to whitespace tokens.

wrLLC and Ablation-based metrics それぞれの手法の相補性について wrLLC operates in weight space,
while ablation methods work in activation space(complementary). • The ablation scores are better at identifying a discrepancy in 0:0. On the other hand, the wrLLC is better at identifying that this head ultimately matures into multigram head (which we confirm separately). • Some heads (e.g., induction head 1:6 and the current-token head 0:5 ) are more distinguishable in the wrLLC than in ablations. • Ablation methods are generally computationally more efficient. However, they lack the theoretical grounding of the wrLLC (which is reflected in the existence of many different possible choices of ablation).

Mechanistic Interpretability : 解釈可能性研究の新たな潮流 https://www.jstage.jst.go.jp/article/pjsai/JSAI2025/0/JSAI2025_3L6OS3201/_pdf/-c har/ja ”バイナリコードをプログラマーが読めるソースコードにリバースエンジニアリングするのと同じように，ニュー
ラルネットワークによって実行される計算をリバースエンジニアリングして擬似コードに変換すること” Attention Visualizationはその一部

個人的考察 • 対象の絞り込み ◦ 2層、Attentionのみ(MLPなし)のネットワークを対象としている (Induction Headと同じ)。2x8個のHeadを全て列挙可能 ◦ GPT2 tokenizerを使用
◦ データセットはThe Pile(自然言語、Githubなど多様なtext系データ) ◦ Multiheadとその個数はどの程度有効か？今のところ実用的にも 8で十分らしい • Headの分類と可視化 ◦ 単語単位の関係しか見ることができない。高次の関係性、名詞句、分詞などの複数トークンからなるグループは (Sparse Autoencoders, Transformer circuit, Neuronpediaに実装されている ) ◦ トークン群間の関係 (What does BERT look at?で既に使われている )、高階関数(Function vector) • Stagewise development ◦ 平坦領域(plateau)の存在は鞍点を移動していることを意味していると言えるのか？ ◦ LLCを汎化指標と見ると低下するのは過学習？ early stopping?の再来？ • Hessianとの関係(Appendix D) ◦ HessianよりRCLT(LLC)の方がより多くの情報を記述できるが、関数 loss(w,x)の鞍点∇wloss(w*,x)=0となるw*のHessianの特異点の構造は普通の代数多様体より単純かもしれない (0固有値、退化した固有ベクトルの組み合わせで表現できる？ )　f(x)=0 　∇・f(x) ◦ 退化した方向にある別の特異点とのつながりと global mimimaの関係

関連研究、情報 Learning Capacity: A Measure of the Effective Dimensionality of
a Model (解説)  Compressibility Measures Complexity: MDL Meets SL T(Timaeus)  In-context Learning and Induction Heads, Transformer circuit(Anthropic)    TransformerによるDyck言語の解釈可能性の研究 • Theoretical Limitations of Self-Attention in Neural Sequence Models • Self-Attention Networks Can Process Bounded Hierarchical Languages 2層でstack machineを表現可能 • Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding Chomsky-Schützenbergerの表現定理(文脈自由文法は正規文法とDyck言語で表現できる)　文脈依存文法言語の再帰性を実現する方法〜 Attention, GNN, 関数マップ句や節の係り受けなど高次の関係を tokenに依存せず表現可能なことを示す S式を(Attention only)Transformerに解釈させ、Attention mapを見る試みhttps://github.com/xiangze/Transformer_learns_sexp

余談　ChatGPTの奇妙なコード LLCの計算方法 “体積法”　https://github.com/xiangze/RLCT_extimation/blob/master/llc_training_experiment.py#L256 手順 1. Treat current model parameters as
θ*. 2. Run SGLD around θ* to sample θ_i. 3. Compute ΔL_i = L(θ_i) - L(θ*). 4. For thresholds ε_j, compute V(ε_j) ≈ P[ΔL_i <= ε_j]. 5. Fit log V(ε) vs log ε, slope ~ λ (local learning coefficient). # コード　Fit log V ≈ λ * log ε + c log_eps = np.log(epsilons) log_V = np.log(Vs) A = np.vstack([log_eps, np.ones_like(log_eps)]).T lam, c = np.linalg.lstsq(A, log_V, rcond=None)[0] # Solve least squares 収束しづらいが原理的には合っている。 https://github.com/suswei/RLCT　から学んだのか…（コード解説）

[論文紹介]Differentiation and Specialization of Att...

[論文紹介]Differentiation and Specialization of Attention Heads via the Refined LLC(rLLC)

xiangze

More Decks by xiangze

Featured

Transcript