Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[email protected]コロキウム

Ikuya Yamada
January 18, 2022

 [email protected]コロキウム

NLPコロキウムでの発表資料です。

Ikuya Yamada

January 18, 2022
Tweet

More Decks by Ikuya Yamada

Other Decks in Research

Transcript

  1. Ikuya Yamada1,2, Akari Asai3, Hiroyuki Shindo4,2, Hideaki Takeda5, and Yuji Matsumoto2
    :
    Deep Contextualized Entity Representations
    with Entity-aware Self-attention
    1Studio Ousia 2RIKEN AIP 3University of Washington 4Nara Institute of Science and Technology
    5National Institute of Informatics

    View Slide

  2. 自己紹介
    山田 育矢 (@ikuyamada)
    Studio Ousia 共同創業者チーフサイエンティスト
    ソフトウェアエンジニア、連続起業家、研究者
    理化学研究所AIP 客員研究員(知識獲得チーム、言語情報アクセス技術チーム)
    ● 大学入学時に、学生ベンチャー企業を起業し売却(2000年〜2006年)
    ○ インターネットの基盤技術(
    Peer to Peer通信におけるNAT越え問題)の研究開発を推進
    ○ 売却先企業は株式上場
    ● Studio Ousiaを共同創業し、自然言語処理に取り組む(2007年〜)
    ○ 質問応答を中心とした自然言語処理の研究開発を推進
    ● プログラミングが好き
    ○ 最近よく使うライブラリ:
    PyTorch、PyTorch-lightning、transformers、Wikipedia2Vec
    ● コンペティション・シェアードタスクにいろいろ出場
    ○ 優勝したタスク:#Microposts @ WWW2015, W-NUT Task #1 @ ACL 2015, HCQA @ NAACL 2016,
    HCQA @ NIPS 2017, Semantic Web Challenge @ ISWC 2020 2

    View Slide

  3. Overview
    ● LUKE is new contextualized representations of words and entities with an improved
    transformer architecture and a novel entity-aware self-attention mechanism
    3

    View Slide

  4. Overview
    ● LUKE is new contextualized representations of words and entities with an improved
    transformer architecture and a novel entity-aware self-attention mechanism
    ● The effectiveness of LUKE is demonstrated by achieving
    state-of-the-art results on five important entity-related tasks:
    SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity
    4

    View Slide

  5. Overview
    ● LUKE is new contextualized representations of words and entities with an improved
    transformer architecture and a novel entity-aware self-attention mechanism
    ● The effectiveness of LUKE is demonstrated by achieving
    state-of-the-art results on five important entity-related tasks:
    SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity
    ● LUKE is officially supported by Huggingface Transformers
    5

    View Slide

  6. Overview
    ● LUKE is new contextualized representations of words and entities with an improved
    transformer architecture and a novel entity-aware self-attention mechanism
    ● The effectiveness of LUKE is demonstrated by achieving
    state-of-the-art results on five important entity-related tasks:
    SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity
    ● LUKE is officially supported by Huggingface Transformers
    ● LUKE has been cited more than 100 times within a year
    6

    View Slide

  7. Contextualized word representations (CWR) don’t represent entities in text well
    ○ CWR do not provide the span-level representations of entities
    ○ Difficult to capture relationship between entities splitted into multiple tokens
    ○ The pretraining task of CWRs is not suitable for entities
    Background
    7
    Bert....? Elmo…? The Force is
    not strong with them.
    Mark Hamill by
    Gage Skidmore 2

    View Slide

  8. Contextualized word representations (CWR) don’t represent entities in text well
    ○ CWR do not provide the span-level representations of entities
    ○ Difficult to capture relationship between entities splitted into multiple tokens
    ○ The pretraining task of CWRs is not suitable for entities
    Background
    8
    predicting “Rings” given “The Lord of the [MASK]” is
    clearly easier than predicting the entire entity

    View Slide

  9. LUKE is pretrained contextualized representations based on a transformer
    ● New architecture that treats both words and entities as tokens
    ● New pretraining strategy: randomly masking and predicting words and entities
    ● Entity-aware self-attention mechanism
    LUKE: Language Understanding with Knowledge-based Embeddings
    9
    Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles

    View Slide

  10. The Architecture of LUKE
    ● LUKE treats words and entities as independent tokens
    ● Because entities are treated as tokens:
    ○ LUKE provides span-level entity representations
    ○ The relationships between entities can be directly captured in the transformer
    10
    Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles
    Computing Input Representations

    View Slide

  11. ● Token embedding: representing the corresponding token in the vocabulary
    ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U
    ● Position embedding: representing the position of the token in a word sequence
    ○ Entities containing multiple tokens are represented as the average of the corresponding
    position embedding vectors
    ● Entity type embedding: representing that the token is an entity
    Input Representations: Three Types of Embeddings
    11

    View Slide

  12. Input Representations: Three Types of Embeddings
    12
    ● Token embedding: representing the corresponding token in the vocabulary
    ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U
    ● Position embedding: representing the position of the token in a word sequence
    ○ Entities containing multiple tokens are represented as the average of the corresponding
    position embedding vectors
    ● Entity type embedding: representing that the token is an entity

    View Slide

  13. Input Representations: Three Types of Embeddings
    13
    ● Token embedding: representing the corresponding token in the vocabulary
    ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U
    ● Position embedding: representing the position of the token in a word sequence
    ○ Entities containing multiple tokens are represented as the average of the corresponding
    position embedding vectors
    ● Entity type embedding: representing that the token is an entity

    View Slide

  14. Input Representations: Three Types of Embeddings
    14
    ● Token embedding: representing the corresponding token in the vocabulary
    ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U
    ● Position embedding: representing the position of the token in a word sequence
    ○ Entities containing multiple tokens are represented as the average of the corresponding
    position embedding vectors
    ● Entity type embedding: representing that the token is an entity

    View Slide

  15. Input Representations: Three Types of Embeddings
    15
    ● Token embedding: representing the corresponding token in the vocabulary
    ○ The entity token embedding is represented by two small matrices, B (projection matrix) and U
    ● Position embedding: representing the position of the token in a word sequence
    ○ Entities containing multiple tokens are represented as the average of the corresponding
    position embedding vectors
    ● Entity type embedding: representing that the token is an entity

    View Slide

  16. Input Representations: Word Input Representation
    16
    ● Word input representation:
    token embedding + position embedding
    ● Entity input representation:
    token embedding + position embedding + entity type embedding

    View Slide

  17. Input Representations: Entity Input Representation
    17
    ● Word input representation:
    token embedding + position embedding
    ● Entity input representation:
    token embedding + position embedding + entity type embedding

    View Slide

  18. Pretraining: Masking Words and Entities
    18
    Wikipedia hyperlinks are treated as
    entity annotations
    LUKE is trained to predict randomly masked words and entities in
    an entity-annotated corpus obtained from Wikipedia
    15% of random words and entities
    are replaced with the [MASK] words
    and the [MASK] entities
    Born and raised in Houston, Texas,
    Beyoncé performed in various singing
    and dancing competitions as a child. She
    rose to fame in the late 1990s as the
    lead singer of Destiny's Child
    Born and [MASK] in Houston, Texas,
    [MASK] performed in various [MASK]
    and dancing competitions as a [MASK].
    She rose to fame in the [MASK] 1990s as
    the lead singer of Destiny's Child

    View Slide

  19. Pretraining: Task
    19
    LUKE is trained to
    ● predict the original words of masked words from the whole words in the vocabulary
    ● predict the original entities of masked entities from the whole entities in the vocabulary
    LUKE is trained to predict randomly masked words and entities in
    an entity-annotated corpus obtained from Wikipedia

    View Slide

  20. The attention weight (αij
    ) is computed based on the dot product of two vectors:
    Given the input vector sequence x
    1
    ,x
    2
    …,x
    k
    , the output vector y
    i
    corresponding to the i-th token is computed based on the weighted sum of
    the projected input vectors of all tokens
    Background:
    Transformer’s Self-attention Mechanism
    20
    The transformer’s self-attention mechanism relates tokens each other
    based on the attention weight between each pair of tokens
    ○ Qx
    i
    : The input vector corresponding to the attending token projected by
    query matrix Q
    ○ Kx
    j
    : The input vector corresponding to the token attended to projected by
    key matrix K

    View Slide

  21. The attention weight (αij
    ) is computed based on the dot product of two vectors:
    Given the input vector sequence x
    1
    ,x
    2
    …,x
    k
    , the output vector y
    i
    corresponding to the i-th token is computed based on the weighted sum of
    the projected input vectors of all tokens
    Background:
    Transformer’s Self-attention Mechanism
    21
    The transformer’s self-attention mechanism relates tokens each other
    based on the attention weight between each pair of tokens
    ○ Qx
    i
    : The input vector corresponding to the attending token projected by
    query matrix Q
    ○ Kx
    j
    : The input vector corresponding to the token attended to projected by
    key matrix K

    View Slide

  22. ● We extend the self-attention mechanism by using a different query matrix for
    each possible pair of token types of x
    i
    and x
    j
    Proposed Method:
    Entity-aware Self-attention Mechanism
    22
    A simple extension of the self-attention mechanism allowing the model to
    use the information of target token types when computing attention weights
    Original self-attention mechanism Entity-aware self-attention mechanism

    View Slide

  23. Experiments: Overview
    We advance state of the art on five diverse tasks using similar architectures for all tasks
    based on a linear classifier on top of the representations of words, entities, or both
    23
    Dataset Task
    Open Entity Entity typing
    TACRED Relation classification
    CoNLL-2003 Named entity recognition
    ReCoRD Cloze-style QA
    SQuAD Extractive QA

    View Slide

  24. How to Compute Entity Representations in Downstream Tasks
    24
    Entity representations can be computed by
    ● using the [MASK] entity as input token(s)
    ○ The model gathers the information regarding the entities from the input text
    ○ Used in the all tasks except for the extractive QA (SQuAD)
    ● using the Wikipedia entity as input token(s)
    ○ The entity representations are computed based on the information stored in the entity
    token embeddings
    ○ The word representations are enriched by the entity representations inside transformer
    ○ Used in the extractive QA (SQuAD) task

    View Slide

  25. Entity representations can be computed by
    ● using the [MASK] entity as input token(s)
    ○ The model gathers the information regarding the entities from the input text
    ○ Used in the all tasks except for the extractive QA (SQuAD)
    ● using the Wikipedia entity as input token(s)
    ○ The entity representations are computed based on the information stored in the entity
    token embeddings
    ○ The word representations are enriched by the entity representations inside transformer
    ○ Used in the extractive QA (SQuAD) task
    How to Compute Entity Representations in Downstream Tasks
    25

    View Slide

  26. Experiments: Entity Typing, Relation Classification, Cloze-style QA
    26
    Model:
    A linear classifier with the output entity representation(s)
    as input feature
    Model inputs:
    ● Words in the target sentence
    ● [MASK] entity representing the target entity span(s)
    SOTA on three important entity-related tasks
    Results on Open Entity
    Results on TACRED
    Results on ReCoRD
    Datasets:
    ● Open Entity (entity typing)
    ● TACRED (relation classification)
    ● ReCoRD (cloze-style QA)

    View Slide

  27. Experiments: Named Entity Recognition (CoNLL-2003)
    27
    Model:
    1. Enumerate all possible spans in the input text as entity name
    candidates
    2. Classify them into entity types or non-entity type using
    a linear classifier based on the entity representation and the
    word representations of the first and last words in the span
    3. Greedily select a span based on the logits
    Model inputs:
    ● Words in the input text
    ● [MASK] entities corresponding to all possible entity name
    candidates
    SOTA on CoNLL-2003 named entity recognition dataset
    Results on CoNLL-2003

    View Slide

  28. Analysis: Named Entity Recognition (CoNLL-2003)
    28
    http://explainaboard.nlpedia.ai/leaderboard/task-ner/

    View Slide

  29. Analysis: Named Entity Recognition (CoNLL-2003)
    29
    http://explainaboard.nlpedia.ai/leaderboard/task-ner/

    View Slide

  30. Experiments: Extractive Question Answering (SQuAD v1.1)
    30
    Model:
    Two linear classifiers on top of the output word
    representations to predict the start and end
    positions of the answer
    Model inputs:
    ● Words in the question and the passage
    ● Wikipedia entities in the passage
    ○ Automatically generated based on a
    heuristic entity linking method
    SOTA on SQuAD v1.1 extractive question answering dataset
    Results on SQuAD v1.1
    LUKE got #1 on leaderboard

    View Slide

  31. Ablation Study (1): Entity Representations
    31
    When addressing the task without inputting entities,
    the performance degrades significantly on CoNLL-2003 and SQuAD v1.1

    View Slide

  32. Ablation Study (1): Entity Representations
    32
    When addressing the task without inputting entities,
    the performance degrades significantly on CoNLL-2003 and SQuAD v1.1
    Using [MASK]
    entities as inputs
    Using Wikipedia
    entities as inputs

    View Slide

  33. Ablation Study (2): Entity-aware Self-attention
    33
    Our entity-aware self-attention mechanism consistently outperforms
    the original mechanism across all tasks

    View Slide

  34. Adding LUKE to Huggingface Transformers
    ● LUKE is officially supported by Huggingface Transformers
    ● The state-of-the-art results reported in the paper can now be easily reproduced using
    Transformers on Colab notebooks!
    ○ NER on CoNLL-2003
    ○ Relation extraction on TACRED
    ○ Entity Typing on Open Entity
    34
    https://github.com/studio-ousia/luke/issues/38

    View Slide

  35. Summary
    ● LUKE is new contextualized representations of words and entities with an
    improved transformer architecture and a novel entity-aware self-attention
    mechanism
    ● The effectiveness of LUKE is demonstrated by achieving
    state-of-the-art results on five important entity-related tasks
    35
    [email protected]
    Paper:
    Code:
    @ikuyamada
    https://arxiv.org/abs/2010.01057
    https://github.com/studio-ousia/luke
    Paper: Code:

    View Slide