単語レベルのベクトル化 • 文脈に沿ったベクトル化 (埋め込み) • 文あるいは文章レベルの ベクトル化 d fail to capture higher-level concepts in con- lysemous disambiguation, syntactic structures, anaphora. The second-generation PTMs focus textual word embeddings, such as CoVe [126], OpenAI GPT [142] and BERT [36]. These s are still needed to represent words in context tasks. Besides, various pre-training tasks are o learn PTMs for di↵erent purposes. utions of this survey can be summarized as hensive review. We provide a comprehensive PTMs for NLP, including background knowl- odel architecture, pre-training tasks, various ns, adaption approaches, and applications. onomy. We propose a taxonomy of PTMs for ich categorizes existing PTMs from four dif- rspectives: 1) representation type, 2) model ure; 3) type of pre-training task; 4) extensions fic types of scenarios. t resources. We collect abundant resources s, including open-source implementations of sualization tools, corpora, and paper lists. irections. We discuss and analyze the limi- f existing PTMs. Also, we suggest possible meaning of a piece of text by low-dimensional real-valued vec- tors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept. Figure 1 illustrates the generic neural architecture for NLP. There are two kinds of word embeddings: non-contextual and contex- tual embeddings. The di↵erence between them is whether the embedding for a word dynamically changes according to the context it appears in. ex1 ex2 ex3 ex4 ex5 ex6 ex7 Non-contextual Embeddings h1 h2 h3 h4 h5 h6 h7 Contextual Embeddings Contextual Encoder Task-Specifc Model Figure 1: Generic Neural Architecture for NLP Non-contextual Embeddings The first step of represent- ing language is to map discrete language symbols into a dis- tributed embedding space. Formally, for each word (or sub- word) x in a vocabulary V, we map it to a vector ex 2 RDe with a lookup table E 2 RDe ⇥|V|, where De is a hyper-parameter indicating the dimension of token embeddings. These em- beddings are trained on task data along with other model
Natural Language Processing: A Survey March (2020) 3 h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (a) Convolutional Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (b) Recurrent Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (c) Fully-Connected Self-Attention Model Figure 2: Neural Contextual Encoders bedding of token xt because of the contextual information included in. 2.2 Neural Contextual Encoders Most of the neural contextual encoders can be classified into two categories: sequence models and graph-based models. Figure 2 illustrates the architecture of these models. 2.2.1 Sequence Models Sequence models usually capture local context of a word in sequential order. Convolutional Models Convolutional models take the em- beddings of words in the input sentence and capture the mean- successful instance of fully-connected self-attention model is the Transformer [184], which also needs other supplement modules, such as positional embeddings, layer normalization, residual connections and position-wise feed-forward network (FFN) layers. 2.2.3 Analysis Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words. Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks. In contrast, as an instantiated fully-connected self-attention model, the Transformer can directly model the dependency 畳み込みモデル リカレントモデル 全結合⾃⼰注意モデル
and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. https://arxiv.org/abs/1706.03762
( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 4XHVWLRQ 3DUDJUDSK 6WDUW(QG6SDQ %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH% 3UHWUDLQLQJ )LQH7XQLQJ 163 0DVN/0 0DVN/0 8QODEHOHG6HQWHQFH$DQG%3DLU 64X$' 4XHVWLRQ$QVZHU3DLU 1(5 01/, Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec- tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers). ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015). mal difference between the pre-trained architec- ture and the final downstream architecture. https://arxiv.org/abs/1810.04805
A Survey March (2020) 17 Table 5: Resources of PTMs Resource Description URL Open-Source Implementations § word2vec CBOW,Skip-Gram https://github.com/tmikolov/word2vec GloVe Pre-trained word vectors https://nlp.stanford.edu/projects/glove FastText Pre-trained word vectors https://github.com/facebookresearch/fastText Transformers Framework: PyTorch&TF, PTMs: BERT, GPT-2, RoBERTa, XLNet, etc. https://github.com/huggingface/transformers Fairseq Framework: PyTorch, PTMs:English LM, German LM, RoBERTa, etc. https://github.com/pytorch/fairseq Flair Framework: PyTorch, PTMs:BERT, ELMo, GPT, RoBERTa, XLNet, etc. https://github.com/flairNLP/flair AllenNLP [47] Framework: PyTorch, PTMs: ELMo, BERT, GPT-2, etc. https://github.com/allenai/allennlp fastNLP Framework: PyTorch, PTMs: RoBERTa, GPT, etc. https://github.com/fastnlp/fastNLP UniLMs Framework: PyTorch, PTMs: UniLM v1&v2, MiniLM, LayoutLM, etc. https://github.com/microsoft/unilm Chinese-BERT [29] Framework: PyTorch&TF, PTMs: BERT, RoBERTa, etc. (for Chinese) https://github.com/ymcui/Chinese-BERT-wwm BERT [36] Framework: TF, PTMs: BERT, BERT-wwm https://github.com/google-research/bert RoBERTa [117] Framework: PyTorch https://github.com/pytorch/fairseq/tree/master/examples/roberta XLNet [209] Framework: TF https://github.com/zihangdai/xlnet/ ALBERT [93] Framework: TF https://github.com/google-research/ALBERT T5 [144] Framework: TF https://github.com/google-research/text-to-text-transfer-transformer ERNIE(Baidu) [170, 171] Framework: PaddlePaddle https://github.com/PaddlePaddle/ERNIE CTRL [84] Conditional Transformer Language Model for Controllable Generation. https://github.com/salesforce/ctrl BertViz [185] Visualization Tool https://github.com/jessevig/bertviz exBERT [65] Visualization Tool https://github.com/bhoov/exbert TextBrewer [210] PyTorch-based toolkit for distillation of NLP models. https://github.com/airaria/TextBrewer DeepPavlov Conversational AI Library. PTMs for the Russian, Polish, Bulgarian, Czech, and informal English. https://github.com/deepmipt/DeepPavlov Corpora OpenWebText Open clone of OpenAI’s unreleased WebText dataset. https://github.com/jcpeterson/openwebtext Common Crawl A very large collection of text. http://commoncrawl.org/ WikiEn English Wikipedia dumps. https://dumps.wikimedia.org/enwiki/ Other Resources Paper List https://github.com/thunlp/PLMpapers Paper List https://github.com/tomohideshibata/BERT-related-papers Paper List https://github.com/cedrickchee/awesome-bert-nlp Bert Lang Street A collection of BERT models with reported performances on di↵erent datasets, tasks and languages. https://bertlang.unibocconi.it/ § Most papers for PTMs release their links of o cial version. Here we list some popular third-party and o cial implementations. However, motivated by the fact that the progress in recent years has eroded headroom on the GLUE benchmark dra- matically, a new benchmark called SuperGLUE [189] was presented. Compared to GLUE, SuperGLUE has more chal- lenging tasks and more diverse task formats (e.g., coreference resolution and question answering). State-of-the-art PTMs are listed in the corresponding leader- board4) 5). (HotpotQA) [208]. BERT creatively transforms the extractive QA task to the spans prediction task that predicts the starting span as well as the ending span of the answer [36]. After that, PTM as an encoder for predicting spans has become a competitive baseline. For extractive QA, Zhang et al. [215] proposed a ret- rospective reader architecture and initialize the encoder with PTM (e.g., ALBERT). For multi-round generative QA, Ju
Researchによるこの論文では、C++, Java, Pythonの間の 相互変換を実証 MT M de - C++ MT M de C++ - Bac - a a D a - c C - a Ma La a M a ( a, b) > ? : ; C++ a a MT M de J - J C -L a Ma ed LM (a, b): > P c (a, b): > P c c ( ) ( = * ; <= ; += ) = ; I c MA K ( ) (MA K = * ; <= ; += ) MA K = ; Ma c ( ) ( = * ; <= ; += ) = ; R c c in = ( , MA K, ); MA K( , , 1 -) , +, ); C c in = ( , , ); ( , , -1); ( , +1, ); R c c in = ( , , ); ( , , -1); ( , +1, ); I c Ma k ke C c de Figure 1: Illustration of the three principles of unsupervised machine translation used by our approach. The first principle initializes the model with cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language. Denoising auto-encoding, the second principle, trains the decoder to always generate valid sequences, even when fed with noisy data, and increases the encoder robustness to input noise. Back-translation, the last principle, allows the model to generate parallel data which can be used for training. Whenever the Python ! C++ model becomes better, it generates more accurate data for the C++ ! Python model, and vice versa. Figure 5 in the appendix provides a representation of the cross-lingual embeddings we obtain after training. The cross-lingual nature of the resulting model comes from the significant number of common tokens (anchor points) that exist across languages. In the context of English-French translation, the https://arxiv.org/abs/2006.03511 https://arxiv.org/abs/2006.03511
Generation)を使 用している PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization Jingqing Zhang * 1 Yao Zhao * 2 Mohammad Saleh 2 Peter J. Liu 2 Abstract work pre-training Transformers with rvised objectives on large text corpora wn great success when fine-tuned on am NLP tasks including text summa- However, pre-training objectives tai- abstractive text summarization have explored. Furthermore there is a ystematic evaluation across diverse do- n this work, we propose pre-training nsformer-based encoder-decoder mod- massive text corpora with a new self- d objective. In PEGASUS, important are removed/masked from an input doc- d are generated together as one output from the remaining sentences, similar Figure 1: The base architecture of PEGASUS is a standard Transformer encoder-decoder. Both GSG and MLM are applied simultaneously to this example as pre-training ob- jectives. Originally there are three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some tokens are randomly masked by [MASK2] (MLM). https://arxiv.org/abs/1912.08777