LUKE@NLPコロキウム

Slide 1

Slide 1 text

Ikuya Yamada1,2, Akari Asai3, Hiroyuki Shindo4,2, Hideaki Takeda5, and Yuji Matsumoto2 : Deep Contextualized Entity Representations with Entity-aware Self-attention 1Studio Ousia 2RIKEN AIP 3University of Washington 4Nara Institute of Science and Technology 5National Institute of Informatics

Slide 2

Slide 2 text

自己紹介山田育矢 (@ikuyamada) Studio Ousia 共同創業者チーフサイエンティストソフトウェアエンジニア、連続起業家、研究者理化学研究所AIP 客員研究員（知識獲得チーム、言語情報アクセス技術チーム） ● 大学入学時に、学生ベンチャー企業を起業し売却（2000年〜2006年） ○ インターネットの基盤技術（ Peer to Peer通信におけるNAT越え問題）の研究開発を推進 ○ 売却先企業は株式上場 ● Studio Ousiaを共同創業し、自然言語処理に取り組む（2007年〜） ○ 質問応答を中心とした自然言語処理の研究開発を推進 ● プログラミングが好き ○ 最近よく使うライブラリ： PyTorch、PyTorch-lightning、transformers、Wikipedia2Vec ● コンペティション・シェアードタスクにいろいろ出場 ○ 優勝したタスク：#Microposts @ WWW2015, W-NUT Task #1 @ ACL 2015, HCQA @ NAACL 2016, HCQA @ NIPS 2017, Semantic Web Challenge @ ISWC 2020 2

Slide 3

Slide 3 text

Overview ● LUKE is new contextualized representations of words and entities with an improved transformer architecture and a novel entity-aware self-attention mechanism 3

Slide 4

Slide 4 text

Overview ● LUKE is new contextualized representations of words and entities with an improved transformer architecture and a novel entity-aware self-attention mechanism ● The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on ﬁve important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity 4

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Contextualized word representations (CWR) don’t represent entities in text well ○ CWR do not provide the span-level representations of entities ○ Difﬁcult to capture relationship between entities splitted into multiple tokens ○ The pretraining task of CWRs is not suitable for entities Background 7 Bert....? Elmo…? The Force is not strong with them. Mark Hamill by Gage Skidmore 2

Slide 8

Slide 8 text

Contextualized word representations (CWR) don’t represent entities in text well ○ CWR do not provide the span-level representations of entities ○ Difﬁcult to capture relationship between entities splitted into multiple tokens ○ The pretraining task of CWRs is not suitable for entities Background 8 predicting “Rings” given “The Lord of the [MASK]” is clearly easier than predicting the entire entity

Slide 9

Slide 9 text

LUKE is pretrained contextualized representations based on a transformer ● New architecture that treats both words and entities as tokens ● New pretraining strategy: randomly masking and predicting words and entities ● Entity-aware self-attention mechanism LUKE: Language Understanding with Knowledge-based Embeddings 9 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles

Slide 10

Slide 10 text

The Architecture of LUKE ● LUKE treats words and entities as independent tokens ● Because entities are treated as tokens: ○ LUKE provides span-level entity representations ○ The relationships between entities can be directly captured in the transformer 10 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles Computing Input Representations

Slide 11

Slide 11 text

● Token embedding: representing the corresponding token in the vocabulary ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U ● Position embedding: representing the position of the token in a word sequence ○ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors ● Entity type embedding: representing that the token is an entity Input Representations: Three Types of Embeddings 11

Slide 12

Slide 12 text

Input Representations: Three Types of Embeddings 12 ● Token embedding: representing the corresponding token in the vocabulary ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U ● Position embedding: representing the position of the token in a word sequence ○ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors ● Entity type embedding: representing that the token is an entity

Slide 13

Slide 13 text

Input Representations: Three Types of Embeddings 13 ● Token embedding: representing the corresponding token in the vocabulary ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U ● Position embedding: representing the position of the token in a word sequence ○ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors ● Entity type embedding: representing that the token is an entity

Slide 14

Slide 14 text

Input Representations: Three Types of Embeddings 14 ● Token embedding: representing the corresponding token in the vocabulary ○ The entity token embedding is represented by two small matrices: B (projection matrix) and U ● Position embedding: representing the position of the token in a word sequence ○ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors ● Entity type embedding: representing that the token is an entity

Slide 15

Slide 15 text

Input Representations: Three Types of Embeddings 15 ● Token embedding: representing the corresponding token in the vocabulary ○ The entity token embedding is represented by two small matrices, B (projection matrix) and U ● Position embedding: representing the position of the token in a word sequence ○ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors ● Entity type embedding: representing that the token is an entity

Slide 16

Slide 16 text

Input Representations: Word Input Representation 16 ● Word input representation: token embedding + position embedding ● Entity input representation: token embedding + position embedding + entity type embedding

Slide 17

Slide 17 text

Input Representations: Entity Input Representation 17 ● Word input representation: token embedding + position embedding ● Entity input representation: token embedding + position embedding + entity type embedding

Slide 18

Slide 18 text

Pretraining: Masking Words and Entities 18 Wikipedia hyperlinks are treated as entity annotations LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia 15% of random words and entities are replaced with the [MASK] words and the [MASK] entities Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the lead singer of Destiny's Child Born and [MASK] in Houston, Texas, [MASK] performed in various [MASK] and dancing competitions as a [MASK]. She rose to fame in the [MASK] 1990s as the lead singer of Destiny's Child

Slide 19

Slide 19 text

Pretraining: Task 19 LUKE is trained to ● predict the original words of masked words from the whole words in the vocabulary ● predict the original entities of masked entities from the whole entities in the vocabulary LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia

Slide 20

Slide 20 text

The attention weight (αij ) is computed based on the dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 20 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ○ Qx i : The input vector corresponding to the attending token projected by query matrix Q ○ Kx j : The input vector corresponding to the token attended to projected by key matrix K

Slide 21

Slide 21 text

The attention weight (αij ) is computed based on the dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 21 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ○ Qx i : The input vector corresponding to the attending token projected by query matrix Q ○ Kx j : The input vector corresponding to the token attended to projected by key matrix K

Slide 22

Slide 22 text

● We extend the self-attention mechanism by using a different query matrix for each possible pair of token types of x i and x j Proposed Method: Entity-aware Self-attention Mechanism 22 A simple extension of the self-attention mechanism allowing the model to use the information of target token types when computing attention weights Original self-attention mechanism Entity-aware self-attention mechanism

Slide 23

Slide 23 text

Experiments: Overview We advance state of the art on ﬁve diverse tasks using similar architectures for all tasks based on a linear classiﬁer on top of the representations of words, entities, or both 23 Dataset Task Open Entity Entity typing TACRED Relation classification CoNLL-2003 Named entity recognition ReCoRD Cloze-style QA SQuAD Extractive QA

Slide 24

Slide 24 text

How to Compute Entity Representations in Downstream Tasks 24 Entity representations can be computed by ● using the [MASK] entity as input token(s) ○ The model gathers the information regarding the entities from the input text ○ Used in the all tasks except for the extractive QA (SQuAD) ● using the Wikipedia entity as input token(s) ○ The entity representations are computed based on the information stored in the entity token embeddings ○ The word representations are enriched by the entity representations inside transformer ○ Used in the extractive QA (SQuAD) task

Slide 25

Slide 25 text

Entity representations can be computed by ● using the [MASK] entity as input token(s) ○ The model gathers the information regarding the entities from the input text ○ Used in the all tasks except for the extractive QA (SQuAD) ● using the Wikipedia entity as input token(s) ○ The entity representations are computed based on the information stored in the entity token embeddings ○ The word representations are enriched by the entity representations inside transformer ○ Used in the extractive QA (SQuAD) task How to Compute Entity Representations in Downstream Tasks 25

Slide 26

Slide 26 text

Experiments: Entity Typing, Relation Classification, Cloze-style QA 26 Model: A linear classifier with the output entity representation(s) as input feature Model inputs: ● Words in the target sentence ● [MASK] entity representing the target entity span(s) SOTA on three important entity-related tasks Results on Open Entity Results on TACRED Results on ReCoRD Datasets: ● Open Entity (entity typing) ● TACRED (relation classification) ● ReCoRD (cloze-style QA)

Slide 27

Slide 27 text

Experiments: Named Entity Recognition (CoNLL-2003) 27 Model: 1. Enumerate all possible spans in the input text as entity name candidates 2. Classify them into entity types or non-entity type using a linear classiﬁer based on the entity representation and the word representations of the ﬁrst and last words in the span 3. Greedily select a span based on the logits Model inputs: ● Words in the input text ● [MASK] entities corresponding to all possible entity name candidates SOTA on CoNLL-2003 named entity recognition dataset Results on CoNLL-2003

Slide 28

Slide 28 text

Analysis: Named Entity Recognition (CoNLL-2003) 28 http://explainaboard.nlpedia.ai/leaderboard/task-ner/

Slide 29

Slide 29 text

Analysis: Named Entity Recognition (CoNLL-2003) 29 http://explainaboard.nlpedia.ai/leaderboard/task-ner/

Slide 30

Slide 30 text

Experiments: Extractive Question Answering (SQuAD v1.1) 30 Model: Two linear classiﬁers on top of the output word representations to predict the start and end positions of the answer Model inputs: ● Words in the question and the passage ● Wikipedia entities in the passage ○ Automatically generated based on a heuristic entity linking method SOTA on SQuAD v1.1 extractive question answering dataset Results on SQuAD v1.1 LUKE got #1 on leaderboard

Slide 31

Slide 31 text

Ablation Study (1): Entity Representations 31 When addressing the task without inputting entities, the performance degrades signiﬁcantly on CoNLL-2003 and SQuAD v1.1

Slide 32

Slide 32 text

Ablation Study (1): Entity Representations 32 When addressing the task without inputting entities, the performance degrades signiﬁcantly on CoNLL-2003 and SQuAD v1.1 Using [MASK] entities as inputs Using Wikipedia entities as inputs

Slide 33

Slide 33 text

Ablation Study (2): Entity-aware Self-attention 33 Our entity-aware self-attention mechanism consistently outperforms the original mechanism across all tasks

Slide 34

Slide 34 text

Adding LUKE to Huggingface Transformers ● LUKE is ofﬁcially supported by Huggingface Transformers ● The state-of-the-art results reported in the paper can now be easily reproduced using Transformers on Colab notebooks! ○ NER on CoNLL-2003 ○ Relation extraction on TACRED ○ Entity Typing on Open Entity 34 https://github.com/studio-ousia/luke/issues/38

Slide 35

Slide 35 text

Summary ● LUKE is new contextualized representations of words and entities with an improved transformer architecture and a novel entity-aware self-attention mechanism ● The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on ﬁve important entity-related tasks 35 [email protected] Paper: Code: @ikuyamada https://arxiv.org/abs/2010.01057 https://github.com/studio-ousia/luke Paper: Code: