LUKE@NLPコロキウム

Ikuya Yamada1,2, Akari Asai3, Hiroyuki Shindo4,2, Hideaki Takeda5, and Yuji
Matsumoto2 : Deep Contextualized Entity Representations with Entity-aware Self-attention 1Studio Ousia 2RIKEN AIP 3University of Washington 4Nara Institute of Science and Technology 5National Institute of Informatics

自己紹介山田育矢 (@ikuyamada) Studio Ousia 共同創業者チーフサイエンティストソフトウェアエンジニア、連続起業家、研究者理化学研究所AIP 客員研究員（知識獲得チーム、言語情報アクセス技術チーム）
• 大学入学時に、学生ベンチャー企業を起業し売却（2000年〜2006年） ◦ インターネットの基盤技術（ Peer to Peer通信におけるNAT越え問題）の研究開発を推進 ◦ 売却先企業は株式上場 • Studio Ousiaを共同創業し、自然言語処理に取り組む（2007年〜） ◦ 質問応答を中心とした自然言語処理の研究開発を推進 • プログラミングが好き ◦ 最近よく使うライブラリ： PyTorch、PyTorch-lightning、transformers、Wikipedia2Vec • コンペティション・シェアードタスクにいろいろ出場 ◦ 優勝したタスク：#Microposts @ WWW2015, W-NUT Task #1 @ ACL 2015, HCQA @ NAACL 2016, HCQA @ NIPS 2017, Semantic Web Challenge @ ISWC 2020 2

Overview • LUKE is new contextualized representations of words and
entities with an improved transformer architecture and a novel entity-aware self-attention mechanism 3

entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on ﬁve important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity 4

entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on ﬁve important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is ofﬁcially supported by Huggingface Transformers 5

entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on ﬁve important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is ofﬁcially supported by Huggingface Transformers • LUKE has been cited more than 100 times within a year 6

Contextualized word representations (CWR) don’t represent entities in text well
◦ CWR do not provide the span-level representations of entities ◦ Difﬁcult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 7 Bert....? Elmo…? The Force is not strong with them. Mark Hamill by Gage Skidmore 2

Contextualized word representations (CWR) don’t represent entities in text well
◦ CWR do not provide the span-level representations of entities ◦ Difﬁcult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 8 predicting “Rings” given “The Lord of the [MASK]” is clearly easier than predicting the entire entity

LUKE is pretrained contextualized representations based on a transformer •
New architecture that treats both words and entities as tokens • New pretraining strategy: randomly masking and predicting words and entities • Entity-aware self-attention mechanism LUKE: Language Understanding with Knowledge-based Embeddings 9 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles

The Architecture of LUKE • LUKE treats words and entities
as independent tokens • Because entities are treated as tokens: ◦ LUKE provides span-level entity representations ◦ The relationships between entities can be directly captured in the transformer 10 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles Computing Input Representations

• Token embedding: representing the corresponding token in the vocabulary
◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity Input Representations: Three Types of Embeddings 11

Input Representations: Three Types of Embeddings 12 • Token embedding:
representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity

representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices, B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity

Input Representations: Word Input Representation 16 • Word input representation:
token embedding + position embedding • Entity input representation: token embedding + position embedding + entity type embedding

Input Representations: Entity Input Representation 17 • Word input representation:
token embedding + position embedding • Entity input representation: token embedding + position embedding + entity type embedding

Pretraining: Masking Words and Entities 18 Wikipedia hyperlinks are treated
as entity annotations LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia 15% of random words and entities are replaced with the [MASK] words and the [MASK] entities Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the lead singer of Destiny's Child Born and [MASK] in Houston, Texas, [MASK] performed in various [MASK] and dancing competitions as a [MASK]. She rose to fame in the [MASK] 1990s as the lead singer of Destiny's Child

Pretraining: Task 19 LUKE is trained to • predict the
original words of masked words from the whole words in the vocabulary • predict the original entities of masked entities from the whole entities in the vocabulary LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia

The attention weight (αij ) is computed based on the
dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 20 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K

The attention weight (αij ) is computed based on the
dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 21 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K

• We extend the self-attention mechanism by using a different
query matrix for each possible pair of token types of x i and x j Proposed Method: Entity-aware Self-attention Mechanism 22 A simple extension of the self-attention mechanism allowing the model to use the information of target token types when computing attention weights Original self-attention mechanism Entity-aware self-attention mechanism

Experiments: Overview We advance state of the art on ﬁve
diverse tasks using similar architectures for all tasks based on a linear classiﬁer on top of the representations of words, entities, or both 23 Dataset Task Open Entity Entity typing TACRED Relation classification CoNLL-2003 Named entity recognition ReCoRD Cloze-style QA SQuAD Extractive QA

How to Compute Entity Representations in Downstream Tasks 24 Entity
representations can be computed by • using the [MASK] entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task

Entity representations can be computed by • using the [MASK]
entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task How to Compute Entity Representations in Downstream Tasks 25

Experiments: Entity Typing, Relation Classification, Cloze-style QA 26 Model: A
linear classifier with the output entity representation(s) as input feature Model inputs: • Words in the target sentence • [MASK] entity representing the target entity span(s) SOTA on three important entity-related tasks Results on Open Entity Results on TACRED Results on ReCoRD Datasets: • Open Entity (entity typing) • TACRED (relation classification) • ReCoRD (cloze-style QA)

Experiments: Named Entity Recognition (CoNLL-2003) 27 Model: 1. Enumerate all
possible spans in the input text as entity name candidates 2. Classify them into entity types or non-entity type using a linear classiﬁer based on the entity representation and the word representations of the ﬁrst and last words in the span 3. Greedily select a span based on the logits Model inputs: • Words in the input text • [MASK] entities corresponding to all possible entity name candidates SOTA on CoNLL-2003 named entity recognition dataset Results on CoNLL-2003

Analysis: Named Entity Recognition (CoNLL-2003) 28 http://explainaboard.nlpedia.ai/leaderboard/task-ner/

Analysis: Named Entity Recognition (CoNLL-2003) 29 http://explainaboard.nlpedia.ai/leaderboard/task-ner/

Experiments: Extractive Question Answering (SQuAD v1.1) 30 Model: Two linear
classiﬁers on top of the output word representations to predict the start and end positions of the answer Model inputs: • Words in the question and the passage • Wikipedia entities in the passage ◦ Automatically generated based on a heuristic entity linking method SOTA on SQuAD v1.1 extractive question answering dataset Results on SQuAD v1.1 LUKE got #1 on leaderboard

Ablation Study (1): Entity Representations 31 When addressing the task
without inputting entities, the performance degrades signiﬁcantly on CoNLL-2003 and SQuAD v1.1

Ablation Study (1): Entity Representations 32 When addressing the task
without inputting entities, the performance degrades signiﬁcantly on CoNLL-2003 and SQuAD v1.1 Using [MASK] entities as inputs Using Wikipedia entities as inputs

Ablation Study (2): Entity-aware Self-attention 33 Our entity-aware self-attention mechanism
consistently outperforms the original mechanism across all tasks

Adding LUKE to Huggingface Transformers • LUKE is ofﬁcially supported
by Huggingface Transformers • The state-of-the-art results reported in the paper can now be easily reproduced using Transformers on Colab notebooks! ◦ NER on CoNLL-2003 ◦ Relation extraction on TACRED ◦ Entity Typing on Open Entity 34 https://github.com/studio-ousia/luke/issues/38

Summary • LUKE is new contextualized representations of words and
entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on ﬁve important entity-related tasks 35 [email protected] Paper: Code: @ikuyamada https://arxiv.org/abs/2010.01057 https://github.com/studio-ousia/luke Paper: Code:

LUKE@NLPコロキウム

LUKE@NLPコロキウム

More Decks by Ikuya Yamada

Other Decks in Research

Featured

Transcript