Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Knowledge Neurons in Pretrained Transformers (for SNLP2022)

Kogoro
September 20, 2022

Knowledge Neurons in Pretrained Transformers (for SNLP2022)

2022/09/26-27, 第14回最先端NLP勉強会
https://sites.google.com/view/snlp-jp/home/2022

Dai et al., Knowledge Neurons in Pretrained Transformers (ACL 2022) の論文紹介です。
https://aclanthology.org/2022.acl-long.581/

Kogoro

September 20, 2022
Tweet

Other Decks in Research

Transcript

  1. Knowledge Neurons
    in Pretrained Transformers
    (ACL2022)
    Damai Dai, Li Dong, Yaru Hao,
    Zhifang Sui, Baobao Chang, Furu Wei
    紹介者: ⼩林 悟郎 (東北⼤ 乾研 D1)
    2022/09/26-27
    第14回最先端NLP勉強会 セッション: 知識と⾔語モデル

    View Slide

  2. 概要: 事前学習済みモデルの
    フィードフォワードネットに知識が格納されている
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    BERT
    💡
    The capital of Ireland is [MASK]
    Dublin
    Figure 1: The Transformer - model architecture.
    wise fully connected feed-forward network. We employ a residual connection [10] around each of


    • 事前学習済みモデル (e.g., BERT) 内部で知識の出⼒を担う
    “知識ニューロン” という概念を導⼊・その識別⽅法を提案
    • アイデア: フィードフォワードネットの重みを知識の格納先
    とみなす

    View Slide

  3. 背景 & 先⾏研究
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    View Slide

  4. 背景: 事前学習済みモデルから上⼿く知識を
    取り出せる
    • 事前学習済み BERT からは追加学習なしで訓練データ中の
    事実知識を取り出せる [Petroni+ʼ19;Jiang+ʼ20]
    • モデルを⼤きくするほど多くの知識が保存可能 [Roberts+ʼ20]
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    知識はモデル内部にどう格納されている︖
    BERT
    The capital of Ireland is [MASK]
    Dublin

    View Slide

  5. 先⾏研究: フィードフォワードネットを
    記憶装置 (key-value memory) とみなす [Geva+ʼ21]
    重要なアイデアを含むので、やや丁寧に説明していきます
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Transformer Feed-Forward Layers Are Key-Value Memories
    Mor Geva1,2
    Roei Schuster1,3
    Jonathan Berant1,2
    Omer Levy1
    1Blavatnik School of Computer Science, Tel-Aviv University
    2Allen Institute for Artificial Intelligence
    3Cornell Tech
    {[email protected],[email protected],[email protected]}.tau.ac.il, [email protected]
    Abstract
    Feed-forward layers constitute two-thirds of a
    transformer model’s parameters, yet their role
    in the network remains under-explored. We
    show that feed-forward layers in transformer-
    based language models operate as key-value
    memories, where each key correlates with tex-
    tual patterns in the training examples, and each
    value induces a distribution over the output
    vocabulary. Our experiments show that the
    (EMNLP 2021)

    View Slide

  6. 先⾏研究: フィードフォワードネットを
    記憶装置 (key-value memory) とみなす [Geva+ʼ21]
    フィードフォワードネット (=2層MLP) は注意機構と似ている
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Attention head
    Attention
    weights
    Key vectors
    Value vectors
    !!
    !"
    !#
    … weighted sum
    inner product





    mer block works as a key-value memory. The first linear
    ner product. Taking the activation of these neurons as
    e vectors through weighted sum. We hypothesize that
    expressing factual knowledge.
    in Transformers, even without any fine-tuning.
    2 Background: Transformer
    Transformer (Vaswani et al., 2017) is one of the
    most popular and effective NLP architectures. A
    Transformer encoder is stacked with L identical
    blocks. Each Transformer block mainly contains
    two modules: a self-attention module, and a feed-
    forward network (abbreviated as FFN) module. Let
    X 2 Rn⇥d denote the input matrix, two modules
    can be formulated as follows:
    Qh = XW
    Q
    h ,Kh = XW
    K
    h , Vh = XW
    V
    h , (1)
    Self-Atth(X) = softmax QhK
    T
    h Vh, (2)
    FFN(H) = gelu (HW1) W2, (3)
    ʢॏΈߦྻʣ
    ʢॏΈߦྻʣ
    ඇৗʹྨࣅ
    2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear
    FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as
    s, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that
    edge neurons in the FFN module are responsible for expressing factual knowledge.
    fectiveness of the proposed knowledge at-
    on method. First, suppressing and ampli-
    knowledge neurons notably affects the ex-
    on of the corresponding knowledge. Second,
    d that knowledge neurons of a fact tend to
    ivated more by corresponding knowledge-
    ssing prompts. Third, given the knowledge
    ns of a fact, the top activating prompts re-
    d from open-domain texts usually express
    rresponding fact, while the bottom activating
    pts do not express the correct relation.
    our case studies, we try to leverage knowl-
    neurons to explicitly edit factual knowledge
    trained Transformers without any fine-tuning.
    esent two preliminary studies: updating facts,
    asing relations. After identifying the knowl-
    neurons, we perform a knowledge surgery
    in Transformers, even without any fine-tuning.
    2 Background: Transformer
    Transformer (Vaswani et al., 2017) is one of the
    most popular and effective NLP architectures. A
    Transformer encoder is stacked with L identical
    blocks. Each Transformer block mainly contains
    two modules: a self-attention module, and a feed-
    forward network (abbreviated as FFN) module. Let
    X 2 Rn⇥d denote the input matrix, two modules
    can be formulated as follows:
    Qh = XW
    Q
    h ,Kh = XW
    K
    h , Vh = XW
    V
    h , (1)
    Self-Atth(X) = softmax QhK
    T
    h Vh, (2)
    FFN(H) = gelu (HW1) W2, (3)
    where W
    Q
    h , W
    K
    h , W
    V
    h , W1, W2 are parameter ma-

    View Slide

  7. 先⾏研究: フィードフォワードネットを
    記憶装置 (key-value memory) とみなす [Geva+ʼ21]
    フィードフォワードネット (=2層MLP) は注意機構と似ている
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    ඇৗʹྨࣅ
    mer block works as a key-value memory. The first linear
    ner product. Taking the activation of these neurons as
    e vectors through weighted sum. We hypothesize that
    expressing factual knowledge.
    in Transformers, even without any fine-tuning.
    2 Background: Transformer
    Transformer (Vaswani et al., 2017) is one of the
    most popular and effective NLP architectures. A
    Transformer encoder is stacked with L identical
    blocks. Each Transformer block mainly contains
    two modules: a self-attention module, and a feed-
    forward network (abbreviated as FFN) module. Let
    X 2 Rn⇥d denote the input matrix, two modules
    can be formulated as follows:
    Qh = XW
    Q
    h ,Kh = XW
    K
    h , Vh = XW
    V
    h , (1)
    Self-Atth(X) = softmax QhK
    T
    h Vh, (2)
    FFN(H) = gelu (HW1) W2, (3)
    2: Illustration of how an FFN module in a Transformer block works as a key-value memory. The first linear
    FFN(key) computes intermediate neurons through inner product. Taking the activation of these neurons as
    s, the second linear layer FFN(val) integrates value vectors through weighted sum. We hypothesize that
    edge neurons in the FFN module are responsible for expressing factual knowledge.
    fectiveness of the proposed knowledge at-
    on method. First, suppressing and ampli-
    knowledge neurons notably affects the ex-
    on of the corresponding knowledge. Second,
    d that knowledge neurons of a fact tend to
    ivated more by corresponding knowledge-
    ssing prompts. Third, given the knowledge
    ns of a fact, the top activating prompts re-
    d from open-domain texts usually express
    rresponding fact, while the bottom activating
    pts do not express the correct relation.
    our case studies, we try to leverage knowl-
    neurons to explicitly edit factual knowledge
    trained Transformers without any fine-tuning.
    esent two preliminary studies: updating facts,
    asing relations. After identifying the knowl-
    neurons, we perform a knowledge surgery
    in Transformers, even without any fine-tuning.
    2 Background: Transformer
    Transformer (Vaswani et al., 2017) is one of the
    most popular and effective NLP architectures. A
    Transformer encoder is stacked with L identical
    blocks. Each Transformer block mainly contains
    two modules: a self-attention module, and a feed-
    forward network (abbreviated as FFN) module. Let
    X 2 Rn⇥d denote the input matrix, two modules
    can be formulated as follows:
    Qh = XW
    Q
    h ,Kh = XW
    K
    h , Vh = XW
    V
    h , (1)
    Self-Atth(X) = softmax QhK
    T
    h Vh, (2)
    FFN(H) = gelu (HW1) W2, (3)
    where W
    Q
    h , W
    K
    h , W
    V
    h , W1, W2 are parameter ma-
    Attention head
    Attention
    weights
    Key vectors
    Value vectors
    !!
    !"
    !#
    … weighted sum
    inner product





    ʢॏΈߦྻʣ
    ʢॏΈߦྻʣ
    1. 入力をqueryとして,各keyとの内積で重みを計算
    2. この重みをかけながら各valueを総和 (重み付け和)

    View Slide

  8. Attention head
    Attention
    weights
    Key vectors
    Value vectors
    !!
    !"
    !#
    … weighted sum
    inner product





    ʢॏΈߦྻʣ
    ʢॏΈߦྻʣ
    key と value に相当するものが違う
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    入力文の単語数分の
    Key/Value ベクトル
    中間表現の次元数分の
    重みパラメータベクトル
    先⾏研究: フィードフォワードネットを
    記憶装置 (key-value memory) とみなす [Geva+ʼ21]

    View Slide

  9. Attention head
    Attention
    weights
    Key vectors
    Value vectors
    !!
    !"
    !#
    … weighted sum
    inner product





    ʢॏΈߦྻʣ
    ʢॏΈߦྻʣ
    key と value に相当するものが違う
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    周囲の単語表現から
    情報を集める
    モデルの重みパラメータ
    から情報を集める
    先⾏研究: フィードフォワードネットを
    記憶装置 (key-value memory) とみなす [Geva+ʼ21]

    View Slide

  10. Attention head
    Attention
    weights
    Key vectors
    Value vectors
    !!
    !"
    !#
    … weighted sum
    inner product





    ʢॏΈߦྻʣ
    ʢॏΈߦྻʣ
    ⼊⼒に応じて情報を取り出す
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    モデルの重み行列に
    格納された情報を取り出す
    入力表現に応じて重み
    (活性値) を決定
    先⾏研究: フィードフォワードネットを
    記憶装置 (key-value memory) とみなす [Geva+ʼ21]

    View Slide

  11. 2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Transformer-based left-to-right LM で実験
    • 各 key は特定の⼊⼒パターンに反応する
    • 特定の n-gram (e.g., “substitutes” という単語に反応)
    • トピック (e.g., TVショーに関する⼊⼒に反応)
    • 特に後半層の各 value は正しい単語の
    予測を導くようなベクトルになっている
    フィードフォワードネットは⼊⼒に応じて、
    重みに格納してある情報を取り出すことで予測に貢献
    先⾏研究: フィードフォワードネットを
    記憶装置 (key-value memory) とみなす [Geva+ʼ21]

    View Slide

  12. 提案: 知識ニューロン
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    View Slide

  13. 知識ニューロンの導⼊
    • 知識 がフィードフォワードネットに記憶されていると仮定
    • モデルに関係知識を⽳埋め出⼒させる際に活性化して予測
    に寄与する中間ニューロンを 知識ニューロン と呼ぶ
    • (FF as key-value memory を MLM にも持ってきたのも新規性)
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    BERT
    💡
    The capital of Ireland is [MASK]
    Dublin
    Figure 1: The Transformer - model architecture.


    View Slide

  14. 知識ニューロンの識別法 (1/2)
    1. 関係知識 の tail 部分を [MASK] に
    置換したプロンプト⽂をモデルに⼊⼒
    − e.g., → “The capital of Irerand is [MASK]”
    2. 各中間ニューロン 𝑤!
    (#) について正しい予測への寄与を計算
    − Integrated Gradients を使⽤
    3. 寄与スコアが閾値を超えたニューロンを集める
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Attr 𝑤!
    " = &
    𝑤!
    " '
    #$%
    & 𝜕𝑃'
    (𝛼&
    𝑤!
    " )
    𝜕𝑤
    !
    "
    𝑑𝛼
    説明を省きます🙇
    注目しているニューロンの
    活性値 𝑤!
    " を 0 から元の値 &
    𝑤!
    "
    まで変えた際の微分値を積分
    [Sundararajan+ʼ17]
    𝑃# : 正しく穴埋め予測する確率

    View Slide

  15. 知識ニューロンの識別法 (2/2)
    知識とは関係のないニューロン (e.g., 構⽂情報や表層情報に反応)
    が混じっている可能性があるので、更に絞り込む
    I. 同じ関係知識を4種類以上の異なるプロンプトで⼊⼒
    − “The capital of Irerand is [MASK]”
    − “[MASK] is the capital of Irerand”
    − “Irerand, which has the capital city [MASK]” …
    II. 各プロンプトで各ニューロンの寄与スコアを計算
    III.プロンプト同⼠で共通して寄与スコアが⾼いニューロン
    2~5個だけをその関係知識の知識ニューロンとする
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    View Slide

  16. 実験
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    View Slide

  17. 実験設定
    • 分析対象モデル: BERT-base cased
    • 12層
    • フィードフォワードネットの中間表現は3,072次元
    → 36,864個のニューロンが候補
    • データ:
    • 専⾨家によって作成された関係知識データセット
    • 27,738種類の関係知識 (34種類の relation) を使⽤
    • ベースライン⼿法:
    単に活性値が⾼いニューロンを選ぶ
    • 注意機構で⾔うところの
    注意重みに対応するベースライン
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    ge number of knowledge
    e run our experiments on
    Us. On average, it costs
    knowledge neurons for a
    mpts.
    neurons through the fill-
    based on the PARAREL
    21). PARAREL is curated
    arious prompt templates
    T-REx dataset (ElSahar
    some example templates
    tional fact, we fill in the
    mplates and leave the tail
    ct. In order to guarantee
    Table 2: Statistics of knowledge neurons.
    T
    the intersection of knowledge neurons of f
    “rel.” is the shorthand of relation. Our meth
    tifies more exclusive knowledge neurons.
    knowledge neurons. For a fair compari
    employ the same method to choose th
    parameters t and p% for the baseline to
    the average number of knowledge neurons
    relation lies in [2, 5].
    The method based on neuron activation
    sonable baseline. It is motivated by FFNs’s
    with the self-attention mechanism (as desc
    Section 2), because self-attention scores
    ally used as a strong attribution baseline (K

    View Slide

  18. 知識ニューロンの性質
    提案法で識別した知識ニューロンは
    • 後半層に多く分布
    • 🤔 Transformer の層の深さに
    よる勾配消失 [Takase+ʼ22]︖
    • より排他的
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    P463 (member_of) [X] is a member of [Y] [X] belongs to the organization of [Y] [X] is affiliated with [Y]
    P407 (language_of_work) [X] was written in [Y] The language of [X] is [Y] [X] was a [Y]-language work
    able 1: Example prompt templates of three relations in PARAREL. [X] and [Y] are the placeholders for the head
    nd tail entities, respectively. Owing to the page width, we show only three templates for each relation. Prompt
    mplates in PARAREL produce 253,448 knowledge-expressing prompts in total for 27,738 relational facts.
    Experiments
    .1 Experimental Settings
    We conduct experiments for BERT-base-cased (De-
    lin et al., 2019), one of the most widely-used pre-
    ained models. It contains 12 Transformer blocks,
    here the hidden size is 768 and the FFN inner
    idden size is 3,072. Notice that our method is
    ot limited to BERT and can be easily general-
    zed to other pretrained models. For each prompt,
    e set the attribution threshold t to 0.2 times the
    maximum attribution score. For each relation, we
    nitialize the refining threshold p% (Section 3.3)
    s 0.7. Then, we increase or decrease it by 0.05
    t a time until the average number of knowledge
    eurons lies in [2, 5]. We run our experiments on
    NVIDIA Tesla V100 GPUs. On average, it costs
    3.3 seconds to identify knowledge neurons for a
    Figure 3: Percentage of knowledge neurons identified
    by our method in each Transformer layer.
    Type of Neurons Ours Baseline
    Knowledge neurons 4.13 3.96
    T
    of intra-rel. fact pairs 1.23 2.85
    T
    of inter-rel. fact pairs 0.09 1.92
    Table 2: Statistics of knowledge neurons.
    T
    denotes
    the intersection of knowledge neurons of fact pairs.
    “rel.” is the shorthand of relation. Our method iden-
    tifies more exclusive knowledge neurons.
    共通の relation を持つ
    関係知識同⼠で共有されうる
    • と

    異なる relation を持つ
    関係知識同⼠では共有されない
    • と

    View Slide

  19. 知識ニューロンは確かに知識の出⼒に寄与する
    知識ニューロンの活性値を上書きした際の影響
    • ゼロにすると予測が著しく悪化する
    • 2倍にすると予測が改善する
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Relationͷछྨผʹूܭ
    ༧ଌมԽ཰
    ༧ଌมԽ཰

    View Slide

  20. ケーススタディ1: 知識の更新
    知識ニューロン周辺の簡易操作で
    モデルから所望の知識だけを更新できるか
    • ⼿順
    • 更新したい関係知識の知識ニューロンを探す
    • 対応する重みベクトル (value に相当) を
    更新先単語の埋め込みに近づける
    • 結果
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Erased Relations
    Perplexity (Erased Relation) Perplexity (Other Rela
    Before Erasing After Erasing Before Erasing After E
    P19 (place_of_birth) 1450.0 2996.0 (+106.6%) 120.3 121.6 (
    P27 (country_of_citizenship) 28.0 38.3 (+36.7%) 143.6 149.5 (
    P106 (occupation) 2279.0 5202.0 (+128.2%) 120.1 125.3 (
    P937 (work_location) 58.0 140.0 (+141.2%) 138.0 151.9 (
    Table 5: Case studies of erasing relations. The influence on knowledge expression is measured by the
    change. The knowledge erasing operation significantly affects the erased relation, and has just a moderate
    on the expression of other knowledge.
    Metric Knowledge Neurons Random Neurons
    Change rate" 48.5% 4.7%
    Success rate" 34.4% 0.0%
    Intra-rel. PPL# 8.4 10.1
    Inter-rel. PPL# 7.2 4.3
    Table 6: Case studies of updating facts. " means the
    higher the better, and # means the lower the better.
    “rel.” is the shorthand of relation. Keeping a moder-
    Setup We conduct experiments on P
    For each relation, we randomly sample
    learned by the pretrained model. For
    hh, r, ti, we randomly choose a differen
    with the same type as t (e.g., both t and
    to city), and then update t
    0 as the targ
    We only manipulate about four top knowl
    rons as in Section 4.4. For reference purp
    Dublin Tokyo
    … …


    Embedding matrix
    +



    他の知識には
    ほぼ影響なし
    該当知識の
    更新に成功!

    View Slide

  21. ケーススタディ2: 知識の削除
    知識ニューロン周辺の簡易操作で
    モデルから所望の知識だけを削除できるか (e.g., 個⼈情報)
    • ⼿順
    • 更新したい関係知識の知識ニューロンを探す
    • 対応する重みベクトル (value に相当) を
    ゼロベクトルに置換
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Zero vector
    Replace

    View Slide

  22. ケーススタディ2: 知識の削除
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    Zero vector
    Replace
    知識ニューロン周辺の簡易操作で
    モデルから所望の知識だけを削除できるか (e.g., 個⼈情報)
    • ⼿順
    • 更新したい関係知識の知識ニューロンを探す
    • 対応する重みベクトル (value に相当) を
    ゼロベクトルに置換
    • 結果
    Erased Relations
    Perplexity (Erased Relation) Perplexity (Other Relations)
    Before Erasing After Erasing Before Erasing After Erasing
    P19 (place_of_birth) 1450.0 2996.0 (+106.6%) 120.3 121.6 (+1.1%)
    P27 (country_of_citizenship) 28.0 38.3 (+36.7%) 143.6 149.5 (+4.2%)
    P106 (occupation) 2279.0 5202.0 (+128.2%) 120.1 125.3 (+4.3%)
    P937 (work_location) 58.0 140.0 (+141.2%) 138.0 151.9 (+10.1%)
    Table 5: Case studies of erasing relations. The influence on knowledge expression is measured by the perplexity
    change. The knowledge erasing operation significantly affects the erased relation, and has just a moderate influence
    on the expression of other knowledge.
    他の知識には
    ほぼ影響なし
    該当知識を
    削除する
    方向に変化

    View Slide

  23. まとめ
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    View Slide

  24. まとめ
    • 知識ニューロンの導⼊ & 識別⽅法を提案
    • アイデア: フィードフォワードネットの重みを知識の格納先
    とみなす
    • 関係知識の⽳埋め予測に寄与する知識ニューロンを識別
    • 知識ニューロン周辺の簡易操作によって、モデルが記憶して
    いる知識の更新・削除の可能性を⽰した
    感想・コメント
    • Integrated Gradients を使った識別⽅法は妥当か︖序盤層
    が不利になっていないか︖
    • 幅広いモデルでの調査に期待
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    View Slide

  25. 参考⽂献
    • [Petroni+ʼ19] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick
    S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller.
    “Language Models as Knowledge Bases?” (EMNLP 2019)
    • [Jiang+ʼ20] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham
    Neubig. “How Can We Know What Language Models Know?” (TACL
    2020)
    • [Geva+ʼ21] Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy.
    “Transformer Feed-Forward Layers Are Key-Value Memories” (EMNLP
    2021)
    • [Sundararajan+ʼ17] Mukund Sundararajan, Ankur Taly, and Qiqi Yan.
    “Axiomatic Attribution for Deep Networks” (ICML 2017)
    • [Takase+ʼ22] Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun
    Suzuki. “On Layer Normalizations and Residual Connections in
    Transformers” (arXiv 2022)
    2022/09/27 ୈ14ճ࠷ઌ୺NLPษڧձ

    View Slide