Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Knowledge Neurons in Pretrained Transformers (for SNLP2022)

Kogoro
September 20, 2022

Knowledge Neurons in Pretrained Transformers (for SNLP2022)

2022/09/26-27, 第14回最先端NLP勉強会
https://sites.google.com/view/snlp-jp/home/2022

Dai et al., Knowledge Neurons in Pretrained Transformers (ACL 2022) の論文紹介です。
https://aclanthology.org/2022.acl-long.581/

Kogoro

September 20, 2022
Tweet

Other Decks in Research

Transcript

  1. Knowledge Neurons in Pretrained Transformers (ACL2022) Damai Dai, Li Dong,

    Yaru Hao, Zhifang Sui, Baobao Chang, Furu Wei 紹介者: ⼩林 悟郎 (東北⼤ 乾研 D1) 2022/09/26-27 第14回最先端NLP勉強会 セッション: 知識と⾔語モデル
  2. 概要: 事前学習済みモデルの フィードフォワードネットに知識が格納されている 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  BERT 💡 The capital

    of Ireland is [MASK] Dublin Figure 1: The Transformer - model architecture. wise fully connected feed-forward network. We employ a residual connection [10] around each of … … • 事前学習済みモデル (e.g., BERT) 内部で知識の出⼒を担う “知識ニューロン” という概念を導⼊・その識別⽅法を提案 • アイデア: フィードフォワードネットの重みを知識の格納先 とみなす
  3. 背景 & 先⾏研究 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ 

  4. 背景: 事前学習済みモデルから上⼿く知識を 取り出せる • 事前学習済み BERT からは追加学習なしで訓練データ中の 事実知識を取り出せる [Petroni+ʼ19;Jiang+ʼ20] •

    モデルを⼤きくするほど多くの知識が保存可能 [Roberts+ʼ20] 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  知識はモデル内部にどう格納されている︖ BERT The capital of Ireland is [MASK] Dublin
  5. 先⾏研究: フィードフォワードネットを 記憶装置 (key-value memory) とみなす [Geva+ʼ21] 重要なアイデアを含むので、やや丁寧に説明していきます 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

     Transformer Feed-Forward Layers Are Key-Value Memories Mor Geva1,2 Roei Schuster1,3 Jonathan Berant1,2 Omer Levy1 1Blavatnik School of Computer Science, Tel-Aviv University 2Allen Institute for Artificial Intelligence 3Cornell Tech {morgeva@mail,joberant@cs,levyomer@cs}.tau.ac.il, rs864@cornell.edu Abstract Feed-forward layers constitute two-thirds of a transformer model’s parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer- based language models operate as key-value memories, where each key correlates with tex- tual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the (EMNLP 2021)
  6. 先⾏研究: フィードフォワードネットを 記憶装置 (key-value memory) とみなす [Geva+ʼ21] フィードフォワードネット (=2層MLP) は注意機構と似ている

    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  Attention head Attention weights Key vectors Value vectors !! !" !# … weighted sum inner product … … … … … mer block works as a key-value memory. The first linear ner product. Taking the activation of these neurons as e vectors through weighted sum. We hypothesize that expressing factual knowledge. in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) ʢॏΈߦྻʣ ʢॏΈߦྻʣ ඇৗʹྨࣅ 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as s, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that edge neurons in the FFN module are responsible for expressing factual knowledge. fectiveness of the proposed knowledge at- on method. First, suppressing and ampli- knowledge neurons notably affects the ex- on of the corresponding knowledge. Second, d that knowledge neurons of a fact tend to ivated more by corresponding knowledge- ssing prompts. Third, given the knowledge ns of a fact, the top activating prompts re- d from open-domain texts usually express rresponding fact, while the bottom activating pts do not express the correct relation. our case studies, we try to leverage knowl- neurons to explicitly edit factual knowledge trained Transformers without any fine-tuning. esent two preliminary studies: updating facts, asing relations. After identifying the knowl- neurons, we perform a knowledge surgery in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) where W Q h , W K h , W V h , W1, W2 are parameter ma-
  7. 先⾏研究: フィードフォワードネットを 記憶装置 (key-value memory) とみなす [Geva+ʼ21] フィードフォワードネット (=2層MLP) は注意機構と似ている

    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  ඇৗʹྨࣅ mer block works as a key-value memory. The first linear ner product. Taking the activation of these neurons as e vectors through weighted sum. We hypothesize that expressing factual knowledge. in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) 2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as s, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that edge neurons in the FFN module are responsible for expressing factual knowledge. fectiveness of the proposed knowledge at- on method. First, suppressing and ampli- knowledge neurons notably affects the ex- on of the corresponding knowledge. Second, d that knowledge neurons of a fact tend to ivated more by corresponding knowledge- ssing prompts. Third, given the knowledge ns of a fact, the top activating prompts re- d from open-domain texts usually express rresponding fact, while the bottom activating pts do not express the correct relation. our case studies, we try to leverage knowl- neurons to explicitly edit factual knowledge trained Transformers without any fine-tuning. esent two preliminary studies: updating facts, asing relations. After identifying the knowl- neurons, we perform a knowledge surgery in Transformers, even without any fine-tuning. 2 Background: Transformer Transformer (Vaswani et al., 2017) is one of the most popular and effective NLP architectures. A Transformer encoder is stacked with L identical blocks. Each Transformer block mainly contains two modules: a self-attention module, and a feed- forward network (abbreviated as FFN) module. Let X 2 Rn⇥d denote the input matrix, two modules can be formulated as follows: Qh = XW Q h ,Kh = XW K h , Vh = XW V h , (1) Self-Atth(X) = softmax QhK T h Vh, (2) FFN(H) = gelu (HW1) W2, (3) where W Q h , W K h , W V h , W1, W2 are parameter ma- Attention head Attention weights Key vectors Value vectors !! !" !# … weighted sum inner product … … … … … ʢॏΈߦྻʣ ʢॏΈߦྻʣ 1. 入力をqueryとして,各keyとの内積で重みを計算 2. この重みをかけながら各valueを総和 (重み付け和)
  8. Attention head Attention weights Key vectors Value vectors !! !"

    !# … weighted sum inner product … … … … … ʢॏΈߦྻʣ ʢॏΈߦྻʣ key と value に相当するものが違う 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  入力文の単語数分の Key/Value ベクトル 中間表現の次元数分の 重みパラメータベクトル 先⾏研究: フィードフォワードネットを 記憶装置 (key-value memory) とみなす [Geva+ʼ21]
  9. Attention head Attention weights Key vectors Value vectors !! !"

    !# … weighted sum inner product … … … … … ʢॏΈߦྻʣ ʢॏΈߦྻʣ key と value に相当するものが違う 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  周囲の単語表現から 情報を集める モデルの重みパラメータ から情報を集める 先⾏研究: フィードフォワードネットを 記憶装置 (key-value memory) とみなす [Geva+ʼ21]
  10. Attention head Attention weights Key vectors Value vectors !! !"

    !# … weighted sum inner product … … … … … ʢॏΈߦྻʣ ʢॏΈߦྻʣ ⼊⼒に応じて情報を取り出す 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  モデルの重み行列に 格納された情報を取り出す 入力表現に応じて重み (活性値) を決定 先⾏研究: フィードフォワードネットを 記憶装置 (key-value memory) とみなす [Geva+ʼ21]
  11. 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  Transformer-based left-to-right LM で実験 • 各 key

    は特定の⼊⼒パターンに反応する • 特定の n-gram (e.g., “substitutes” という単語に反応) • トピック (e.g., TVショーに関する⼊⼒に反応) • 特に後半層の各 value は正しい単語の 予測を導くようなベクトルになっている フィードフォワードネットは⼊⼒に応じて、 重みに格納してある情報を取り出すことで予測に貢献 先⾏研究: フィードフォワードネットを 記憶装置 (key-value memory) とみなす [Geva+ʼ21]
  12. 提案: 知識ニューロン 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ 

  13. 知識ニューロンの導⼊ • 知識 がフィードフォワードネットに記憶されていると仮定 • モデルに関係知識を⽳埋め出⼒させる際に活性化して予測 に寄与する中間ニューロンを 知識ニューロン と呼ぶ •

    (FF as key-value memory を MLM にも持ってきたのも新規性) 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  BERT 💡 The capital of Ireland is [MASK] Dublin Figure 1: The Transformer - model architecture. … …
  14. 知識ニューロンの識別法 (1/2) 1. 関係知識 <head, relation, tail> の tail 部分を

    [MASK] に 置換したプロンプト⽂をモデルに⼊⼒ − e.g., <Irerand, capital, Dublin> → “The capital of Irerand is [MASK]” 2. 各中間ニューロン 𝑤! (#) について正しい予測への寄与を計算 − Integrated Gradients を使⽤ 3. 寄与スコアが閾値を超えたニューロンを集める 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  Attr 𝑤! " = & 𝑤! " ' #$% & 𝜕𝑃' (𝛼& 𝑤! " ) 𝜕𝑤 ! " 𝑑𝛼 説明を省きます🙇 注目しているニューロンの 活性値 𝑤! " を 0 から元の値 & 𝑤! " まで変えた際の微分値を積分 [Sundararajan+ʼ17] 𝑃# : 正しく穴埋め予測する確率
  15. 知識ニューロンの識別法 (2/2) 知識とは関係のないニューロン (e.g., 構⽂情報や表層情報に反応) が混じっている可能性があるので、更に絞り込む I. 同じ関係知識を4種類以上の異なるプロンプトで⼊⼒ − “The

    capital of Irerand is [MASK]” − “[MASK] is the capital of Irerand” − “Irerand, which has the capital city [MASK]” … II. 各プロンプトで各ニューロンの寄与スコアを計算 III.プロンプト同⼠で共通して寄与スコアが⾼いニューロン 2~5個だけをその関係知識の知識ニューロンとする 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ 
  16. 実験 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ 

  17. 実験設定 • 分析対象モデル: BERT-base cased • 12層 • フィードフォワードネットの中間表現は3,072次元 →

    36,864個のニューロンが候補 • データ: • 専⾨家によって作成された関係知識データセット • 27,738種類の関係知識 (34種類の relation) を使⽤ • ベースライン⼿法: 単に活性値が⾼いニューロンを選ぶ • 注意機構で⾔うところの 注意重みに対応するベースライン 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  ge number of knowledge e run our experiments on Us. On average, it costs knowledge neurons for a mpts. neurons through the fill- based on the PARAREL 21). PARAREL is curated arious prompt templates T-REx dataset (ElSahar some example templates tional fact, we fill in the mplates and leave the tail ct. In order to guarantee Table 2: Statistics of knowledge neurons. T the intersection of knowledge neurons of f “rel.” is the shorthand of relation. Our meth tifies more exclusive knowledge neurons. knowledge neurons. For a fair compari employ the same method to choose th parameters t and p% for the baseline to the average number of knowledge neurons relation lies in [2, 5]. The method based on neuron activation sonable baseline. It is motivated by FFNs’s with the self-attention mechanism (as desc Section 2), because self-attention scores ally used as a strong attribution baseline (K
  18. 知識ニューロンの性質 提案法で識別した知識ニューロンは • 後半層に多く分布 • 🤔 Transformer の層の深さに よる勾配消失 [Takase+ʼ22]︖

    • より排他的 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  P463 (member_of) [X] is a member of [Y] [X] belongs to the organization of [Y] [X] is affiliated with [Y] P407 (language_of_work) [X] was written in [Y] The language of [X] is [Y] [X] was a [Y]-language work able 1: Example prompt templates of three relations in PARAREL. [X] and [Y] are the placeholders for the head nd tail entities, respectively. Owing to the page width, we show only three templates for each relation. Prompt mplates in PARAREL produce 253,448 knowledge-expressing prompts in total for 27,738 relational facts. Experiments .1 Experimental Settings We conduct experiments for BERT-base-cased (De- lin et al., 2019), one of the most widely-used pre- ained models. It contains 12 Transformer blocks, here the hidden size is 768 and the FFN inner idden size is 3,072. Notice that our method is ot limited to BERT and can be easily general- zed to other pretrained models. For each prompt, e set the attribution threshold t to 0.2 times the maximum attribution score. For each relation, we nitialize the refining threshold p% (Section 3.3) s 0.7. Then, we increase or decrease it by 0.05 t a time until the average number of knowledge eurons lies in [2, 5]. We run our experiments on NVIDIA Tesla V100 GPUs. On average, it costs 3.3 seconds to identify knowledge neurons for a Figure 3: Percentage of knowledge neurons identified by our method in each Transformer layer. Type of Neurons Ours Baseline Knowledge neurons 4.13 3.96 T of intra-rel. fact pairs 1.23 2.85 T of inter-rel. fact pairs 0.09 1.92 Table 2: Statistics of knowledge neurons. T denotes the intersection of knowledge neurons of fact pairs. “rel.” is the shorthand of relation. Our method iden- tifies more exclusive knowledge neurons. 共通の relation を持つ 関係知識同⼠で共有されうる • <Irerand, capital, Dublin> と <France, capital, Paris> 異なる relation を持つ 関係知識同⼠では共有されない • <Irerand, capital, Dublin> と <macOS, developer, Apple>
  19. 知識ニューロンは確かに知識の出⼒に寄与する 知識ニューロンの活性値を上書きした際の影響 • ゼロにすると予測が著しく悪化する • 2倍にすると予測が改善する 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  Relationͷछྨผʹूܭ

    ༧ଌมԽ཰ ༧ଌมԽ཰
  20. ケーススタディ1: 知識の更新 知識ニューロン周辺の簡易操作で モデルから所望の知識だけを更新できるか • ⼿順 • 更新したい関係知識の知識ニューロンを探す • 対応する重みベクトル

    (value に相当) を 更新先単語の埋め込みに近づける • 結果 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  Erased Relations Perplexity (Erased Relation) Perplexity (Other Rela Before Erasing After Erasing Before Erasing After E P19 (place_of_birth) 1450.0 2996.0 (+106.6%) 120.3 121.6 ( P27 (country_of_citizenship) 28.0 38.3 (+36.7%) 143.6 149.5 ( P106 (occupation) 2279.0 5202.0 (+128.2%) 120.1 125.3 ( P937 (work_location) 58.0 140.0 (+141.2%) 138.0 151.9 ( Table 5: Case studies of erasing relations. The influence on knowledge expression is measured by the change. The knowledge erasing operation significantly affects the erased relation, and has just a moderate on the expression of other knowledge. Metric Knowledge Neurons Random Neurons Change rate" 48.5% 4.7% Success rate" 34.4% 0.0% Intra-rel. PPL# 8.4 10.1 Inter-rel. PPL# 7.2 4.3 Table 6: Case studies of updating facts. " means the higher the better, and # means the lower the better. “rel.” is the shorthand of relation. Keeping a moder- Setup We conduct experiments on P For each relation, we randomly sample learned by the pretrained model. For hh, r, ti, we randomly choose a differen with the same type as t (e.g., both t and to city), and then update t 0 as the targ We only manipulate about four top knowl rons as in Section 4.4. For reference purp Dublin Tokyo … … … … Embedding matrix + − <Irerand, capital, Dublin> <Irerand, capital, Tokyo> 他の知識には ほぼ影響なし 該当知識の 更新に成功!
  21. ケーススタディ2: 知識の削除 知識ニューロン周辺の簡易操作で モデルから所望の知識だけを削除できるか (e.g., 個⼈情報) • ⼿順 • 更新したい関係知識の知識ニューロンを探す

    • 対応する重みベクトル (value に相当) を ゼロベクトルに置換 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  Zero vector Replace
  22. ケーススタディ2: 知識の削除 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ  Zero vector Replace 知識ニューロン周辺の簡易操作で モデルから所望の知識だけを削除できるか

    (e.g., 個⼈情報) • ⼿順 • 更新したい関係知識の知識ニューロンを探す • 対応する重みベクトル (value に相当) を ゼロベクトルに置換 • 結果 Erased Relations Perplexity (Erased Relation) Perplexity (Other Relations) Before Erasing After Erasing Before Erasing After Erasing P19 (place_of_birth) 1450.0 2996.0 (+106.6%) 120.3 121.6 (+1.1%) P27 (country_of_citizenship) 28.0 38.3 (+36.7%) 143.6 149.5 (+4.2%) P106 (occupation) 2279.0 5202.0 (+128.2%) 120.1 125.3 (+4.3%) P937 (work_location) 58.0 140.0 (+141.2%) 138.0 151.9 (+10.1%) Table 5: Case studies of erasing relations. The influence on knowledge expression is measured by the perplexity change. The knowledge erasing operation significantly affects the erased relation, and has just a moderate influence on the expression of other knowledge. 他の知識には ほぼ影響なし 該当知識を 削除する 方向に変化
  23. まとめ 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ 

  24. まとめ • 知識ニューロンの導⼊ & 識別⽅法を提案 • アイデア: フィードフォワードネットの重みを知識の格納先 とみなす •

    関係知識の⽳埋め予測に寄与する知識ニューロンを識別 • 知識ニューロン周辺の簡易操作によって、モデルが記憶して いる知識の更新・削除の可能性を⽰した 感想・コメント • Integrated Gradients を使った識別⽅法は妥当か︖序盤層 が不利になっていないか︖ • 幅広いモデルでの調査に期待 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ 
  25. 参考⽂献 • [Petroni+ʼ19] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick

    S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller. “Language Models as Knowledge Bases?” (EMNLP 2019) • [Jiang+ʼ20] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. “How Can We Know What Language Models Know?” (TACL 2020) • [Geva+ʼ21] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. “Transformer Feed-Forward Layers Are Key-Value Memories” (EMNLP 2021) • [Sundararajan+ʼ17] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. “Axiomatic Attribution for Deep Networks” (ICML 2017) • [Takase+ʼ22] Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. “On Layer Normalizations and Residual Connections in Transformers” (arXiv 2022) 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ