Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

[論文紹介]Differentiation and Specialization of Att...

Avatar for xiangze xiangze
December 27, 2025
5

[論文紹介]Differentiation and Specialization of Attention Heads via the Refined LLC(rLLC)

特異学習理論に基づいた統計モデル、ニューラルネットの性質説明する量である実対数閾値の数値近似LLC(local learning coefficient)を用いて2層AttentionのみのTransformerのHeadの性質と学習による変化を分類した研究

Avatar for xiangze

xiangze

December 27, 2025
Tweet

Transcript

  1. [論文紹介]Differentiation and Specialization of Attention Heads via the Refined LLC(rLLC)

    George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, Daniel Murfet https://arxiv.org/abs/2410.02984 2025/12/28 xiangze
  2. 局所学習係数(LLC)の定義と性質 アイデア: 学習途中/完了後のDNNの性質を見るには 事後分布が高い局所解の周辺だけ考えれば良い 体積V(ε),ε→0に対して有理数λ(w∗ )と正の整数m(w∗ ) が存在する とも書くことができる。物理的理解は データ数Nと体積の関係

    正規モデルの場合λ=d/2は(コルモゴロフ)複雑さ の次元でその拡張になっている LLCは既存の誤差εが十分小さい場合に、ε をさらに半分にするために必要な追加ビット 数 B 移項して
  3. 4.LLCの推定手法 Stochastic Gradient Langevin Dynamics(SGLD) を用いる Langevin方程式のStep Δwtを以下のようにして HMC NUTS

    pyro(pytorch)を用いる方法もある(Stepsizeを最適にしてくれるが計算コストが高い) →線形ニューラルネット(行列分解モデル)では厳密解が得られるのでそれと比較する
  4. Refined LCCの形式的定義 分布関数qをq’、loss関数 lをl’に置き換え、wの定義域W=U×V, w*=(u*,v*)と分割する場合を考える。lossに対する体積vol に対しWeight and Data refined LCC

    λは 以下のように定義される ただし事後分布 を に置き換える W=VのときはData-refined LLC、q’=qのときはWeight-refined LLC 注意点: LLCはあくまで学習係数の推定値に過ぎないが論文ではSGLD-based LLCの推定値は十分使える(mature)としている
  5. Refined LCCの意味、分化と特化 Weight-refined LLCs reveal how attention heads differentiate across

    training 層ごとにwrLLCが定義される Data-refined LLCs reveal how attention heads specification across training 入力データごとにdrLLCが定義される
  6. Appendix Appendix A: provides theoretical background on the local learning

    coefficient. Appendix B: provides further details on the head classification. We describe the methodology we followed to manually classified each head, and offer more explanation and examples of each head’s classification and specializations. Appendix C: discusses the significance of critical points, increases, and decreases in the (r)LLC in relation to stagewise development in artificial and biological neural networks. We provide additional results for the full-weights data-refined LLC for a variety of common datasets. Appendix D: compares the rLLC to the Hessian trace, Fisher Information Matrix (FIM) trace, and Hessian rank. We show that the LLC consistently outperforms these other techniques. Appendix E: compares the rLLC against ablation-based metrics that are common to mechanistic interpretability (zero ablations, mean ablations, and resample ablations). We discuss the strengths and weaknesses of each of these methods in relation to the rLLC. Appendix F: provides more experimental details on the architecture, training setup, LLC hyperparameters, using models as the generating process for data-refined LLCs, composition scores, and automated clustering analysis. Appendix G: examines the consistency of our findings across different random initializations, supporting the generality of our conclusions. This analysis further supports the robustness of our observations across various model components. Additional figures & data can be found at the following repository: https://github.com/timaeus-research/paper-rllcs- 2024.
  7. Classifying tokens in context (Appendix B) • Induction patterns (Appendix

    B.2) • Dyck patterns (Appendix B.3) • Skip n-grams (Appendix B.4) • n-grams (Appendix B.5)
  8. Stagewise development(Appendix C) • The Pile(自然言語が主) • Arxiv • Github/codeparrot

    のデータセットに対して学習過程の HeadごとのrLLCを図示、クラスタリング した結果 • refined LLCの上昇 ◦ 通常のSLTの描像(N->∞)でLLC(学習係数)が大きい鞍点に遷移する • refined LLCの下降 ◦ 複雑さの減少(情報の整理?) ◦ 記憶、学習の”臨界期”と関連付けている。 ◦ Pruning in neuroscience
  9. Hessianを指標とした研究との関連(Append D) • Hessian trace ◦ 鞍点上ではHutchinson trace estimationという高速で計算でき る方法が知られている(SAMや量子化でも使われる)。

    ◦ LLCより得られる情報は少ない(当然) • Fisher Information Matrix (FIM) trace ◦ Hessianよりも雑な情報しか得られない • Hessian rank ◦ 大規模行列に対する効率的な計算方法、行列の多項式を用い る、0閾値を訓練中一定にするか適応的に変動させるか ◦ LLCより情報量は少ないが特異性が表現できる ◦ 退化の度合いが減るのが見えるがstageの境界が見えづらい↓ “It remains unclear whether the problem lies in the theoretical underpinnings of the Hessian rank approach (e.g., its potential inability to capture higher-order degeneracy), in the practical implementation (due to flawed estimation methodology), or both. “ rLLCでしか見えない量がある?
  10. Ablation-based metrics(切除指標)による区別(Appendix E) • Method ◦ Zero ablation: Setting the

    targeted activations to zero. ◦ Mean ablation: Replacing the targeted activations with their mean value across the dataset. ◦ Resampling ablation: Replacing the targeted activations with those from a randomly selected different input ◦ Path patching: first running two forward passes: one on uncorrupted inputs/activations, and one involving a corrupted inputs/ablations.    one runs a final forward pass, patching in the uncorrupted & corrupted activations so as to isolate the role of a specific computational path. • Result ◦ The previous-token heads 0:1 and 0:4 can be identified by an increase in the ablation scores during stage LM4. ◦ The current-token head 0:5 is also distinguishable by its increase across the ablation scores starting towards the end of LM2 until it reaches a peak during LM4, after which it decreases. This is especially pronounced in the resampling ablation scores. ◦ The induction head 1:7 is clearly distinguished by the increase of the ablation scores in LM4. The other induction head 1:6 is less clearly distinct, though its ablation scores do have a different shape from the multigram heads. ◦ The multigram head 0:0 has similar rLLC curves to the other layer 0 multigram heads (though it is relatively larger, see Figure 2). The ablation scores suggest that 0:0 is a distinct type of head throughout much of training, with substantially higher values and a qualitatively distinct shape. It is not until LM5 that this heads ablation scores settle to a value comparable to the other layer 0 multigram heads. This complements the analysis in Appendix B.6 that suggests 0:0 starts out as a “space” head that attends to whitespace tokens.
  11. wrLLC and Ablation-based metrics それぞれの手法の相補性について wrLLC operates in weight space,

    while ablation methods work in activation space(complementary). • The ablation scores are better at identifying a discrepancy in 0:0. On the other hand, the wrLLC is better at identifying that this head ultimately matures into multigram head (which we confirm separately). • Some heads (e.g., induction head 1:6 and the current-token head 0:5 ) are more distinguishable in the wrLLC than in ablations. • Ablation methods are generally computationally more efficient. However, they lack the theoretical grounding of the wrLLC (which is reflected in the existence of many different possible choices of ablation).
  12. 個人的考察 • 対象の絞り込み ◦ 2層、Attentionのみ(MLPなし)のネットワークを対象としている (Induction Headと同じ)。2x8個のHeadを全て列挙可能 ◦ GPT2 tokenizerを使用

    ◦ データセットはThe Pile(自然言語、Githubなど多様なtext系データ) ◦ Multiheadとその個数はどの程度有効か? 今のところ実用的にも 8で十分らしい • Headの分類と可視化 ◦ 単語単位の関係しか見ることができない。高次の関係性、名詞句、分詞などの複数トークンからなるグループは (Sparse Autoencoders, Transformer circuit, Neuronpediaに実装されている ) ◦ トークン群間の関係 (What does BERT look at?で既に使われている )、高階関数(Function vector) • Stagewise development ◦ 平坦領域(plateau)の存在は鞍点を移動していることを意味していると言えるのか? ◦ LLCを汎化指標と見ると低下するのは過学習? early stopping?の再来? • Hessianとの関係(Appendix D) ◦ HessianよりRCLT(LLC)の方がより多くの情報を記述できるが、関数 loss(w,x)の鞍点∇wloss(w*,x)=0となるw*のHessianの特異点 の構造は普通の代数多様体より単純かもしれない (0固有値、退化した固有ベクトルの組み合わせで表現できる? ) f(x)=0  ∇・f(x) ◦ 退化した方向にある別の特異点とのつながりと global mimimaの関係
  13. 関連研究、情報 Learning Capacity: A Measure of the Effective Dimensionality of

    a Model (解説)
 Compressibility Measures Complexity: MDL Meets SL T(Timaeus)
 In-context Learning and Induction Heads, Transformer circuit(Anthropic)
 
 TransformerによるDyck言語の解釈可能性の研究 • Theoretical Limitations of Self-Attention in Neural Sequence Models • Self-Attention Networks Can Process Bounded Hierarchical Languages 2層でstack machineを表現可能 • Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding Chomsky-Schützenbergerの表現定理(文脈自由文法は正規文法とDyck言語で表現できる) 文脈依存文法 言語の再帰性を実現する方法〜 Attention, GNN, 関数マップ 句や節の係り受けなど高次の関係を tokenに依存せず表現可能なことを示す S式を(Attention only)Transformerに解釈させ、Attention mapを見る試みhttps://github.com/xiangze/Transformer_learns_sexp
  14. 余談 ChatGPTの奇妙なコード LLCの計算方法 “体積法” https://github.com/xiangze/RLCT_extimation/blob/master/llc_training_experiment.py#L256 手順 1. Treat current model parameters as

    θ*. 2. Run SGLD around θ* to sample θ_i. 3. Compute ΔL_i = L(θ_i) - L(θ*). 4. For thresholds ε_j, compute V(ε_j) ≈ P[ΔL_i <= ε_j]. 5. Fit log V(ε) vs log ε, slope ~ λ (local learning coefficient). # コード Fit log V ≈ λ * log ε + c log_eps = np.log(epsilons) log_V = np.log(Vs) A = np.vstack([log_eps, np.ones_like(log_eps)]).T lam, c = np.linalg.lstsq(A, log_V, rcond=None)[0] # Solve least squares 収束しづらいが原理的には合っている。 https://github.com/suswei/RLCT から学んだのか…(コード解説)