最近の自然言語処理モデルの動向

最近の⾃然⾔語処理モデルの動向 Pre-trained Models for Natural Language Processing: A Survey より
板垣正敏 2020/6/20 @Python機械学習勉強会in新潟 Restart#11 Online

参考論⽂ • Pre-trained Models for Natural Language Processing: A Survey
Xipeng Qiu , Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang • https://arxiv.org/abs/2003.08271 • ⾃然⾔語処理(NLP)における事前学習モデル(PTM: Pre-Trained Model)の包括的なレビュー

論⽂の構成 1. Introduction 2. Background ⾔語表現学習の背景知識 3. Overview of PTMs
事前学習モデルの概観 4. Extensions of PTMs 事前学習モデルの拡張 5. Adapting PTMs to Downstream Tasks 事前学習モデルの各種タスクへの適⽤ 6. Resources of PTMs 事前学習モデルについての情報源 7. Applications 事前学習モデルの応⽤ 8. Future Direction 今後の⽅向性 9. Conclusion

2. Background 事前学習モデルの背景知識

⾔語表現学習 • 言語表現の学習は、大まかに言って次の 2段階で⾏われる。 • 文脈によらないベクトル化（埋め込み） •
単語レベルのベクトル化 • 文脈に沿ったベクトル化（埋め込み） • 文あるいは文章レベルのベクトル化 d fail to capture higher-level concepts in con- lysemous disambiguation, syntactic structures, anaphora. The second-generation PTMs focus textual word embeddings, such as CoVe [126], OpenAI GPT [142] and BERT [36]. These s are still needed to represent words in context tasks. Besides, various pre-training tasks are o learn PTMs for di↵erent purposes. utions of this survey can be summarized as hensive review. We provide a comprehensive PTMs for NLP, including background knowl- odel architecture, pre-training tasks, various ns, adaption approaches, and applications. onomy. We propose a taxonomy of PTMs for ich categorizes existing PTMs from four dif- rspectives: 1) representation type, 2) model ure; 3) type of pre-training task; 4) extensions ﬁc types of scenarios. t resources. We collect abundant resources s, including open-source implementations of sualization tools, corpora, and paper lists. irections. We discuss and analyze the limi- f existing PTMs. Also, we suggest possible meaning of a piece of text by low-dimensional real-valued vectors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept. Figure 1 illustrates the generic neural architecture for NLP. There are two kinds of word embeddings: non-contextual and contextual embeddings. The di↵erence between them is whether the embedding for a word dynamically changes according to the context it appears in. ex1 ex2 ex3 ex4 ex5 ex6 ex7 Non-contextual Embeddings h1 h2 h3 h4 h5 h6 h7 Contextual Embeddings Contextual Encoder Task-Specifc Model Figure 1: Generic Neural Architecture for NLP Non-contextual Embeddings The ﬁrst step of represent- ing language is to map discrete language symbols into a dis- tributed embedding space. Formally, for each word (or sub- word) x in a vocabulary V, we map it to a vector ex 2 RDe with a lookup table E 2 RDe ⇥|V|, where De is a hyper-parameter indicating the dimension of token embeddings. These embeddings are trained on task data along with other model

第1世代︓事前学習単語埋め込み • Word2Vec • GloVe • LSTM • Seq2Seq •
BiLM（双⽅向LSTM） • ELMo • ULMFiT • GPT • BERT 第2世代︓事前学習⽂脈ありエンコーダー事前学習⾃然⾔語処理モデルの歴史

Word2Vecで使われたSkip-Gram Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013b) 2015-05-31 OS-1
(2) ਔ௡ध৶ੰभ॥থআগشॸॕথॢ 14 draught offer pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say ଻ भ ౐ ୁ ॑ ঘ ॽ ॢ ছ ঒ ী ഘ ऊ ै १ থ উ জ থ ॢ ख ▐ ऒ ो ै ऋ ੒ ೾ औ ो ऩ ः े अ प ಌ ৗ ␟ ଀ ୻ ␠ ଻ भ ધ ဿ ୁ ॑ ੒ ೾ घ ॊ े अ प ಌ ৗ ৊ग౐ୁऋ१থ উঝऔोॊऒध ुँॉ੭ॊ ౐ୁঋॡॺঝ࢜௪ (݀ઃ੪) ધဿঋॡॺঝ෥ ࢜௖ (݀ઃ੪) : ৔஋ ՜ +λ ष : ৔஋ ՜ െλ ष ঋॡॺঝभಌৗ্ଉ ॥شঃ५ قધဿ்݄ = 2, ଀୻१থউঝਯ݇ = 1भৃ়भ୻ك Glove (ਈ৵੸ଭ১पेॊ౐ୁঋॡॺঝभ৾ಆ) (Pennington+ 2014) 2015-05-31 OS-1 (2) ਔ௡ध৶ੰभ॥থআগشॸॕথॢ 21 ܬ = ෍ ௜,௝ୀଵ ௏ ݂(ܯ௜,௝ ) (࢝௜ ் ෥ ࢝௝ + ܾ௜ + ෨ ܾ௝ െ log ܯ௜,௝ )ଶ ৯৓ঢ়ਯ: ݂ ݔ = (ݔ/ݔ୫ୟ୶ )ఈ (if ݔ < ݔ୫ୟ୶ ) 1 (otherwise) ౐ୁ݅ध౐ୁ݆भુକᄄ২ ౐ୁभ੕ਯ ౐ୁ݅भঋॡॺঝ ધဿ݆भঋॡॺঝd ౐ୁ݅भংॖ॔५ඨ ౐ୁ݆भংॖ॔५ඨb 1௺ଁ 2௺ଁ پ૚౐ୁपৌखथঃছওॱऋ2௺ଁँॊभम SGNSध৊஘؝মଢ଼஢म౐ୁ݅भঋॡॺঝ॑ ਈી৓प(࢝௜ + ෥ ࢝௜ )धघॊقಖ২ऋ਱঱घॊك ݔ௠௔௫ = 100, Ƚ = 0.75 भৃ় ڀ AdaGrad (SGD)द৾ಆ GloVeの学習法 Non-Contextual Embedding Word2Vec/GloVe https://www.slideshare.net/naoakiokazaki/20150530-jsai2015

Neural Contextual Encoders QIU XP, et al. Pre-trained Models for
Natural Language Processing: A Survey March (2020) 3 h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (a) Convolutional Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (b) Recurrent Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (c) Fully-Connected Self-Attention Model Figure 2: Neural Contextual Encoders bedding of token xt because of the contextual information included in. 2.2 Neural Contextual Encoders Most of the neural contextual encoders can be classiﬁed into two categories: sequence models and graph-based models. Figure 2 illustrates the architecture of these models. 2.2.1 Sequence Models Sequence models usually capture local context of a word in sequential order. Convolutional Models Convolutional models take the embeddings of words in the input sentence and capture the mean- successful instance of fully-connected self-attention model is the Transformer [184], which also needs other supplement modules, such as positional embeddings, layer normalization, residual connections and position-wise feed-forward network (FFN) layers. 2.2.3 Analysis Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words. Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks. In contrast, as an instantiated fully-connected self-attention model, the Transformer can directly model the dependency 畳み込みモデルリカレントモデル全結合⾃⼰注意モデル

LSTM: Long Short Term Memory https://www.researchgate.net/figure/Structure-of-the-LSTM-cell-and-equations-that-describe-the-gates-of- an-LSTM-cell_fig5_329362532

Transformer Figure 1: The Transformer - model architecture. 3.1 Encoder
and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. https://arxiv.org/abs/1706.03762

BERT: Bidirectional Encoder Representations from Transformers %(57 %(57 ( >&/6@
( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 4XHVWLRQ 3DUDJUDSK 6WDUW(QG6SDQ %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH% 3UHWUDLQLQJ )LQH7XQLQJ 163 0DVN/0 0DVN/0 8QODEHOHG6HQWHQFH$DQG%3DLU 64X$' 4XHVWLRQ$QVZHU3DLU 1(5 01/, Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architectures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers). ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015). mal difference between the pre-trained architecture and the final downstream architecture. https://arxiv.org/abs/1810.04805

3. Overview of PTMs 事前学習モデルの概観

⽂脈の有無による分類 QIU XP, et al. Pre-trained Models for Natural Language
Processing: A Survey March (2020) Contextual? Non-Contextual CBOW, Skip-Gram [129] GloVe [133] Contextual ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], Co Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Supervised MT CoVe [126] LM ELMo [135], GPT [142], GPT-2 [143], U

アーキテクチャによる分類 Ms Contextual? Non-Contextual CBOW, Skip-Gram [129] GloVe [133] Contextual
ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126] Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117] Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Task Types Supervised MT CoVe [126] Unsupervised/ Self-Supervised LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39] MLM BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28] TLM XLM [27] Seq2Seq MLM MASS [160], T5 [144] PLM XLNet [209] DAE BART [100]

事前学習タスクによる分類 • LM︓⾔語モデリング • MLM︓マスクあり⾔語モデリング • PLM︓順序あり⾔語モデリング
• DAE︓ノイズ除去オートエンコーダー • CTL︓コントラクティブラーニング PTMs Contextual? Non-Contextual GloVe [133] Contextual ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126] Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117] Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Task Types Supervised MT CoVe [126] Unsupervised/ Self-Supervised LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39] MLM BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28] TLM XLM [27] Seq2Seq MLM MASS [160], T5 [144] PLM XLNet [209] DAE BART [100] CTL RTD CBOW-NS [129], ELECTRA [24] NSP BERT [36], UniLM [39] SOP ALBERT [93], StructBERT [193] Extensions Knowledge-Enriched ERNIE(THU) [214], KnowBERT [136], K-BERT [111] SentiLR [83], KEPLER [195], WKLM [202] Multilingual XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42] XLG MASS [160], mBART [118], XNLG [19] Language-Speciﬁc ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37] BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35] Multi-Modal Image ViLBERT [120], LXMERT [175], VisualBERT [103], B2T2 [2], VL-BERT [163] Video VideoBERT [165], CBT [164]

4. Extensions of PTMs 事前学習モデルの拡張

PTMs Self-Supervised PLM XLNet [209] DAE BART [100] CTL RTD
CBOW-NS [129], ELECTRA [24] NSP BERT [36], UniLM [39] SOP ALBERT [93], StructBERT [193] Extensions Knowledge-Enriched ERNIE(THU) [214], KnowBERT [136], K-BERT [111] SentiLR [83], KEPLER [195], WKLM [202] Multilingual XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42] XLG MASS [160], mBART [118], XNLG [19] Language-Speciﬁc ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37] BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35] Multi-Modal Image ViLBERT [120], LXMERT [175], VisualBERT [103], B2T2 [2], VL-BERT [163] Video VideoBERT [165], CBT [164] Speech SpeechBERT [22] Domain-Speciﬁc SentiLR [83], BioBERT [98], SciBERT [11], PatentBERT [97] Model Compression Model Pruning CompressingBERT [51] Quantization Q-BERT [156], Q8BERT [211] Parameter Sharing ALBERT [93] Distillation DistilBERT [152], TinyBERT [75], MiniLM [194] Module Replacing BERT-of-Theseus [203] Figure 3: Taxonomy of PTMs with Representative Examples

5. Adapting PTMs to Downstream Tasks 事前学習モデルのタスクへの適⽤

転移学習 • あるタスクで訓練したモデルの「知識」を別のタスクに応⽤する⽅法 • 事前学習のタスク、モデルのアーキテクチャ、学習に使⽤したコーパスにより、応⽤タスクへの適性が変化 • 事前学習モデルのどのレイヤーまで
使⽤するか︖ • 単純なファインチューニング • 2段階ファインチューニング • マルチタスク・ファインチューニング • 追加の適応モジュールを⽤いたファインチューニングファインチューニング転移学習とファインチューニング

6. Resources of PTMs

QIU XP, et al. Pre-trained Models for Natural Language Processing:
A Survey March (2020) 17 Table 5: Resources of PTMs Resource Description URL Open-Source Implementations § word2vec CBOW,Skip-Gram https://github.com/tmikolov/word2vec GloVe Pre-trained word vectors https://nlp.stanford.edu/projects/glove FastText Pre-trained word vectors https://github.com/facebookresearch/fastText Transformers Framework: PyTorch&TF, PTMs: BERT, GPT-2, RoBERTa, XLNet, etc. https://github.com/huggingface/transformers Fairseq Framework: PyTorch, PTMs:English LM, German LM, RoBERTa, etc. https://github.com/pytorch/fairseq Flair Framework: PyTorch, PTMs:BERT, ELMo, GPT, RoBERTa, XLNet, etc. https://github.com/ﬂairNLP/ﬂair AllenNLP [47] Framework: PyTorch, PTMs: ELMo, BERT, GPT-2, etc. https://github.com/allenai/allennlp fastNLP Framework: PyTorch, PTMs: RoBERTa, GPT, etc. https://github.com/fastnlp/fastNLP UniLMs Framework: PyTorch, PTMs: UniLM v1&v2, MiniLM, LayoutLM, etc. https://github.com/microsoft/unilm Chinese-BERT [29] Framework: PyTorch&TF, PTMs: BERT, RoBERTa, etc. (for Chinese) https://github.com/ymcui/Chinese-BERT-wwm BERT [36] Framework: TF, PTMs: BERT, BERT-wwm https://github.com/google-research/bert RoBERTa [117] Framework: PyTorch https://github.com/pytorch/fairseq/tree/master/examples/roberta XLNet [209] Framework: TF https://github.com/zihangdai/xlnet/ ALBERT [93] Framework: TF https://github.com/google-research/ALBERT T5 [144] Framework: TF https://github.com/google-research/text-to-text-transfer-transformer ERNIE(Baidu) [170, 171] Framework: PaddlePaddle https://github.com/PaddlePaddle/ERNIE CTRL [84] Conditional Transformer Language Model for Controllable Generation. https://github.com/salesforce/ctrl BertViz [185] Visualization Tool https://github.com/jessevig/bertviz exBERT [65] Visualization Tool https://github.com/bhoov/exbert TextBrewer [210] PyTorch-based toolkit for distillation of NLP models. https://github.com/airaria/TextBrewer DeepPavlov Conversational AI Library. PTMs for the Russian, Polish, Bulgarian, Czech, and informal English. https://github.com/deepmipt/DeepPavlov Corpora OpenWebText Open clone of OpenAI’s unreleased WebText dataset. https://github.com/jcpeterson/openwebtext Common Crawl A very large collection of text. http://commoncrawl.org/ WikiEn English Wikipedia dumps. https://dumps.wikimedia.org/enwiki/ Other Resources Paper List https://github.com/thunlp/PLMpapers Paper List https://github.com/tomohideshibata/BERT-related-papers Paper List https://github.com/cedrickchee/awesome-bert-nlp Bert Lang Street A collection of BERT models with reported performances on di↵erent datasets, tasks and languages. https://bertlang.unibocconi.it/ § Most papers for PTMs release their links of o cial version. Here we list some popular third-party and o cial implementations. However, motivated by the fact that the progress in recent years has eroded headroom on the GLUE benchmark dra- matically, a new benchmark called SuperGLUE [189] was presented. Compared to GLUE, SuperGLUE has more chal- lenging tasks and more diverse task formats (e.g., coreference resolution and question answering). State-of-the-art PTMs are listed in the corresponding leader- board4) 5). (HotpotQA) [208]. BERT creatively transforms the extractive QA task to the spans prediction task that predicts the starting span as well as the ending span of the answer [36]. After that, PTM as an encoder for predicting spans has become a competitive baseline. For extractive QA, Zhang et al. [215] proposed a ret- rospective reader architecture and initialize the encoder with PTM (e.g., ALBERT). For multi-round generative QA, Ju

7. Applications 事前学習モデルの応⽤

主な応⽤分野 • ⼀般的な評価ベンチマーク（GLUE, CoLA, SST-2, MNLI, RTE, WNLI, QQP, MRPC,
STS- B, QNLI, etc.） • QA • 感情分析 • エンティティ認識（固有表現抽出） • 機械翻訳 • 要約 • 敵対的攻撃とその防御

8. Future Direction 今後の⽅向性

今後の⽅向性性能限界の追求アーキテクチャの改良タスク指向事前学習モデルとモデル圧縮ファインチューニングを超える知識伝達事前学習モデルの解釈性と信頼性

最近のニュースからこの1ヶ⽉程度の間に⽬にした⾃然⾔語関連のニュース

GPT-3/ OpenAI API • あたかも「⼈間が書いたような⽂章」を⽣成できるとして、⼀時危険視された⾃然⾔語⽣成モデルGPT-2がさらにパワーアップされた •
パラメータ数もGPT-2の15億に対して1750億と100倍以上に • 学習済みのGPT-3の「知識」が使えるAPIも公開へ https://openai.com/blog/openai-api/ https://github.com/openai/gpt-3

Unsupervised Translation of Programming Languages • 「教師なし学習」により、複数のプログラミング言語間の「翻訳」が可能に • Facebook
Researchによるこの論文では、C++, Java, Pythonの間の相互変換を実証 MT M de - C++ MT M de C++ - Bac - a a D a - c C - a Ma La a M a ( a, b) > ? : ; C++ a a MT M de J - J C -L a Ma ed LM (a, b): > P c (a, b): > P c c ( ) ( = * ; <= ; += ) = ; I c MA K ( ) (MA K = * ; <= ; += ) MA K = ; Ma c ( ) ( = * ; <= ; += ) = ; R c c in = ( , MA K, ); MA K( , , 1 -) , +, ); C c in = ( , , ); ( , , -1); ( , +1, ); R c c in = ( , , ); ( , , -1); ( , +1, ); I c Ma k ke C c de Figure 1: Illustration of the three principles of unsupervised machine translation used by our approach. The ﬁrst principle initializes the model with cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language. Denoising auto-encoding, the second principle, trains the decoder to always generate valid sequences, even when fed with noisy data, and increases the encoder robustness to input noise. Back-translation, the last principle, allows the model to generate parallel data which can be used for training. Whenever the Python ! C++ model becomes better, it generates more accurate data for the C++ ! Python model, and vice versa. Figure 5 in the appendix provides a representation of the cross-lingual embeddings we obtain after training. The cross-lingual nature of the resulting model comes from the signiﬁcant number of common tokens (anchor points) that exist across languages. In the context of English-French translation, the https://arxiv.org/abs/2006.03511 https://arxiv.org/abs/2006.03511

AllenNLP 1.0 • Microsoftの「もう⼀⼈の創⽴者」Paul Allenが設⽴した研究所AI2: Allen Institute for AIによる⾃然⾔語処理（NLP）ライブラリ集 •
様々な⾃然⾔語処理タスクが⾏える • 今週v1.0.0に到達 https://allennlp.org/

Wiki-40B: Multilingual Language Model Dataset • TensorFlow Datasetで公開されている各⾔語のWikipediaデータセット • このデータセットを使⽤して訓練されたPTMも公開されている
• 12層のTransformerXL • 768次元の単語埋め込みベクトル • 64次元のアテンションヘッド×12 • ボキャブラリはSentencepieceでサンプリング学習した32,000語 https://research.google/pubs/pub49029/

spaCy/UD_Japanese-GSD • spaCy • Python/Cythonで記述された⾃然⾔語処理ライブラリ • 多⾔語に対応 • 形態素解析、依存関係解析、固有表現抽出などの⾔語処理や、ビジュアライザ、テキスト分類モデルやディープラーニングモデルを含んでいる
• MITライセンスで産業応⽤を⽬的としている • UD_Japanese-GSDは⽇本語の固有表現抽出のためのタグづけされた学習データセットで、これによりspaCyに⽇本語モデルが追加された https://github.com/megagonlabs/UD_Japanese-GSD

PEGASUS • 要約タスクに特化したモデル • 事前学習タスクとしてトークン単位でマスキングを行うMLMに代わり、センテンス単位のGSG（Gap Sentence
Generation）を使用している PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization Jingqing Zhang * 1 Yao Zhao * 2 Mohammad Saleh 2 Peter J. Liu 2 Abstract work pre-training Transformers with rvised objectives on large text corpora wn great success when ﬁne-tuned on am NLP tasks including text summa- However, pre-training objectives tai- abstractive text summarization have explored. Furthermore there is a ystematic evaluation across diverse do- n this work, we propose pre-training nsformer-based encoder-decoder mod- massive text corpora with a new self- d objective. In PEGASUS, important are removed/masked from an input doc- d are generated together as one output from the remaining sentences, similar Figure 1: The base architecture of PEGASUS is a standard Transformer encoder-decoder. Both GSG and MLM are applied simultaneously to this example as pre-training objectives. Originally there are three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some tokens are randomly masked by [MASK2] (MLM). https://arxiv.org/abs/1912.08777

電笑戦 • オンラインイベントとなったAWS Summit Tokyo 2020で募集されたイベント • お笑い共有サービス「ボ
ケて」 https://bokete.jpの投稿データを元に、「⼈間を超える笑い」をAIが作り出せるかを競う https://aws.amazon.com/jp/builders-flash/202006/bokete/

最近の自然言語処理モデルの動向

最近の自然言語処理モデルの動向

masa-ita

More Decks by masa-ita

Other Decks in Technology

Featured

Transcript

最近の⾃然⾔語処理モデルの動向 Pre-trained Models for Natural Language Processing: A Survey より

参考論⽂ • Pre-trained Models for Natural Language Processing: A Survey

論⽂の構成 1. Introduction 2. Background ⾔語表現学習の背景知識 3. Overview of PTMs

2. Background 事前学習モデルの背景知識

⾔語表現学習 • 言語表現の学習は、大まかに言って次の 2段階で⾏われる。 • 文脈によらないベクトル化（埋め込み） •

第1世代︓事前学習単語埋め込み • Word2Vec • GloVe • LSTM • Seq2Seq •

Word2Vecで使われたSkip-Gram Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013b) 2015-05-31 OS-1

Neural Contextual Encoders QIU XP, et al. Pre-trained Models for

LSTM: Long Short Term Memory https://www.researchgate.net/figure/Structure-of-the-LSTM-cell-and-equations-that-describe-the-gates-of- an-LSTM-cell_fig5_329362532

Transformer Figure 1: The Transformer - model architecture. 3.1 Encoder

BERT: Bidirectional Encoder Representations from Transformers %(57 %(57 ( >&/6@

3. Overview of PTMs 事前学習モデルの概観

⽂脈の有無による分類 QIU XP, et al. Pre-trained Models for Natural Language

アーキテクチャによる分類 Ms Contextual? Non-Contextual CBOW, Skip-Gram [129] GloVe [133] Contextual

事前学習タスクによる分類 • LM︓⾔語モデリング • MLM︓マスクあり⾔語モデリング • PLM︓順序あり⾔語モデリング

4. Extensions of PTMs 事前学習モデルの拡張

PTMs Self-Supervised PLM XLNet [209] DAE BART [100] CTL RTD

5. Adapting PTMs to Downstream Tasks 事前学習モデルのタスクへの適⽤

6. Resources of PTMs

QIU XP, et al. Pre-trained Models for Natural Language Processing:

7. Applications 事前学習モデルの応⽤

主な応⽤分野 • ⼀般的な評価ベンチマーク（GLUE, CoLA, SST-2, MNLI, RTE, WNLI, QQP, MRPC,

8. Future Direction 今後の⽅向性

今後の⽅向性性能限界の追求アーキテクチャの改良タスク指向事前学習モデルとモデル圧縮ファインチューニングを超える知識伝達事前学習モデルの解釈性と信頼性

最近のニュースからこの1ヶ⽉程度の間に⽬にした⾃然⾔語関連のニュース

GPT-3/ OpenAI API • あたかも「⼈間が書いたような⽂章」を⽣成できるとして、⼀時危険視された⾃然⾔語⽣成モデルGPT-2がさらにパワーアップされた •

Unsupervised Translation of Programming Languages • 「教師なし学習」により、複数のプログラミング言語間の「翻訳」が可能に • Facebook

AllenNLP 1.0 • Microsoftの「もう⼀⼈の創⽴者」Paul Allenが設⽴した研究所AI2: Allen Institute for AIによる⾃然⾔語処理（NLP）ライブラリ集 •

Wiki-40B: Multilingual Language Model Dataset • TensorFlow Datasetで公開されている各⾔語のWikipediaデータセット • このデータセットを使⽤して訓練されたPTMも公開されている

PEGASUS • 要約タスクに特化したモデル • 事前学習タスクとしてトークン単位でマスキングを行うMLMに代わり、センテンス単位のGSG（Gap Sentence

電笑戦 • オンラインイベントとなったAWS Summit Tokyo 2020で募集されたイベント • お笑い共有サービス「ボ