Matsumoto2 : Deep Contextualized Entity Representations with Entity-aware Self-attention 1Studio Ousia 2RIKEN AIP 3University of Washington 4Nara Institute of Science and Technology 5National Institute of Informatics
entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity 4
entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is officially supported by Huggingface Transformers 5
entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks: SQuAD, ReCoRD, CoNLL-2003, TACRED, and Open Entity • LUKE is officially supported by Huggingface Transformers • LUKE has been cited more than 100 times within a year 6
◦ CWR do not provide the span-level representations of entities ◦ Difficult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 7 Bert....? Elmo…? The Force is not strong with them. Mark Hamill by Gage Skidmore 2
◦ CWR do not provide the span-level representations of entities ◦ Difficult to capture relationship between entities splitted into multiple tokens ◦ The pretraining task of CWRs is not suitable for entities Background 8 predicting “Rings” given “The Lord of the [MASK]” is clearly easier than predicting the entire entity
New architecture that treats both words and entities as tokens • New pretraining strategy: randomly masking and predicting words and entities • Entity-aware self-attention mechanism LUKE: Language Understanding with Knowledge-based Embeddings 9 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles
as independent tokens • Because entities are treated as tokens: ◦ LUKE provides span-level entity representations ◦ The relationships between entities can be directly captured in the transformer 10 Input text w/ Wikipedia entity annotations: Beyoncé lives in Los Angeles Computing Input Representations
◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity Input Representations: Three Types of Embeddings 11
representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices: B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
representing the corresponding token in the vocabulary ◦ The entity token embedding is represented by two small matrices, B (projection matrix) and U • Position embedding: representing the position of the token in a word sequence ◦ Entities containing multiple tokens are represented as the average of the corresponding position embedding vectors • Entity type embedding: representing that the token is an entity
as entity annotations LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia 15% of random words and entities are replaced with the [MASK] words and the [MASK] entities Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child. She rose to fame in the late 1990s as the lead singer of Destiny's Child Born and [MASK] in Houston, Texas, [MASK] performed in various [MASK] and dancing competitions as a [MASK]. She rose to fame in the [MASK] 1990s as the lead singer of Destiny's Child
original words of masked words from the whole words in the vocabulary • predict the original entities of masked entities from the whole entities in the vocabulary LUKE is trained to predict randomly masked words and entities in an entity-annotated corpus obtained from Wikipedia
dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 20 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K
dot product of two vectors: Given the input vector sequence x 1 ,x 2 …,x k , the output vector y i corresponding to the i-th token is computed based on the weighted sum of the projected input vectors of all tokens Background: Transformer’s Self-attention Mechanism 21 The transformer’s self-attention mechanism relates tokens each other based on the attention weight between each pair of tokens ◦ Qx i : The input vector corresponding to the attending token projected by query matrix Q ◦ Kx j : The input vector corresponding to the token attended to projected by key matrix K
query matrix for each possible pair of token types of x i and x j Proposed Method: Entity-aware Self-attention Mechanism 22 A simple extension of the self-attention mechanism allowing the model to use the information of target token types when computing attention weights Original self-attention mechanism Entity-aware self-attention mechanism
diverse tasks using similar architectures for all tasks based on a linear classifier on top of the representations of words, entities, or both 23 Dataset Task Open Entity Entity typing TACRED Relation classification CoNLL-2003 Named entity recognition ReCoRD Cloze-style QA SQuAD Extractive QA
representations can be computed by • using the [MASK] entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task
entity as input token(s) ◦ The model gathers the information regarding the entities from the input text ◦ Used in the all tasks except for the extractive QA (SQuAD) • using the Wikipedia entity as input token(s) ◦ The entity representations are computed based on the information stored in the entity token embeddings ◦ The word representations are enriched by the entity representations inside transformer ◦ Used in the extractive QA (SQuAD) task How to Compute Entity Representations in Downstream Tasks 25
linear classifier with the output entity representation(s) as input feature Model inputs: • Words in the target sentence • [MASK] entity representing the target entity span(s) SOTA on three important entity-related tasks Results on Open Entity Results on TACRED Results on ReCoRD Datasets: • Open Entity (entity typing) • TACRED (relation classification) • ReCoRD (cloze-style QA)
possible spans in the input text as entity name candidates 2. Classify them into entity types or non-entity type using a linear classifier based on the entity representation and the word representations of the first and last words in the span 3. Greedily select a span based on the logits Model inputs: • Words in the input text • [MASK] entities corresponding to all possible entity name candidates SOTA on CoNLL-2003 named entity recognition dataset Results on CoNLL-2003
classifiers on top of the output word representations to predict the start and end positions of the answer Model inputs: • Words in the question and the passage • Wikipedia entities in the passage ◦ Automatically generated based on a heuristic entity linking method SOTA on SQuAD v1.1 extractive question answering dataset Results on SQuAD v1.1 LUKE got #1 on leaderboard
without inputting entities, the performance degrades significantly on CoNLL-2003 and SQuAD v1.1 Using [MASK] entities as inputs Using Wikipedia entities as inputs
by Huggingface Transformers • The state-of-the-art results reported in the paper can now be easily reproduced using Transformers on Colab notebooks! ◦ NER on CoNLL-2003 ◦ Relation extraction on TACRED ◦ Entity Typing on Open Entity 34 https://github.com/studio-ousia/luke/issues/38
entities with an improved transformer architecture and a novel entity-aware self-attention mechanism • The effectiveness of LUKE is demonstrated by achieving state-of-the-art results on five important entity-related tasks 35 ikuya@ousia.jp Paper: Code: @ikuyamada https://arxiv.org/abs/2010.01057 https://github.com/studio-ousia/luke Paper: Code: