Upgrade to Pro — share decks privately, control downloads, hide ads and more …

最近の自然言語処理モデルの動向

 最近の自然言語処理モデルの動向

サーベイ論文を元にして最近の学習済み自然言語処理モデルの動向と、最近の自然言語処理関連のニュースを紹介

masa-ita

June 20, 2020
Tweet

More Decks by masa-ita

Other Decks in Technology

Transcript

  1. 最近の⾃然⾔語処理モデルの動向 Pre-trained Models for Natural Language Processing: A Survey より

    板垣正敏 2020/6/20 @Python機械学習勉強会in新潟 Restart#11 Online
  2. 参考論⽂ • Pre-trained Models for Natural Language Processing: A Survey

    Xipeng Qiu , Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang • https://arxiv.org/abs/2003.08271 • ⾃然⾔語処理(NLP)における事前学習モデル(PTM: Pre-Trained Model)の包 括的なレビュー
  3. 論⽂の構成 1. Introduction 2. Background ⾔語表現学習の背景知識 3. Overview of PTMs

    事前学習モデルの概観 4. Extensions of PTMs 事前学習モデルの拡張 5. Adapting PTMs to Downstream Tasks 事前学習モデルの各種タスクへの適⽤ 6. Resources of PTMs 事前学習モデルについての情報源 7. Applications 事前学習モデルの応⽤ 8. Future Direction 今後の⽅向性 9. Conclusion
  4. ⾔語表現学習 • 言語表現の学習は、大ま かに言って次の 2段階で ⾏われる。 • 文脈によらないベクトル 化(埋め込み) •

    単語レベルのベクトル化 • 文脈に沿ったベクトル化 (埋め込み) • 文あるいは文章レベルの ベクトル化 d fail to capture higher-level concepts in con- lysemous disambiguation, syntactic structures, anaphora. The second-generation PTMs focus textual word embeddings, such as CoVe [126], OpenAI GPT [142] and BERT [36]. These s are still needed to represent words in context tasks. Besides, various pre-training tasks are o learn PTMs for di↵erent purposes. utions of this survey can be summarized as hensive review. We provide a comprehensive PTMs for NLP, including background knowl- odel architecture, pre-training tasks, various ns, adaption approaches, and applications. onomy. We propose a taxonomy of PTMs for ich categorizes existing PTMs from four dif- rspectives: 1) representation type, 2) model ure; 3) type of pre-training task; 4) extensions fic types of scenarios. t resources. We collect abundant resources s, including open-source implementations of sualization tools, corpora, and paper lists. irections. We discuss and analyze the limi- f existing PTMs. Also, we suggest possible meaning of a piece of text by low-dimensional real-valued vec- tors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept. Figure 1 illustrates the generic neural architecture for NLP. There are two kinds of word embeddings: non-contextual and contex- tual embeddings. The di↵erence between them is whether the embedding for a word dynamically changes according to the context it appears in. ex1 ex2 ex3 ex4 ex5 ex6 ex7 Non-contextual Embeddings h1 h2 h3 h4 h5 h6 h7 Contextual Embeddings Contextual Encoder Task-Specifc Model Figure 1: Generic Neural Architecture for NLP Non-contextual Embeddings The first step of represent- ing language is to map discrete language symbols into a dis- tributed embedding space. Formally, for each word (or sub- word) x in a vocabulary V, we map it to a vector ex 2 RDe with a lookup table E 2 RDe ⇥|V|, where De is a hyper-parameter indicating the dimension of token embeddings. These em- beddings are trained on task data along with other model
  5. 第1世代︓事前学習単語埋め込み • Word2Vec • GloVe • LSTM • Seq2Seq •

    BiLM(双⽅向LSTM) • ELMo • ULMFiT • GPT • BERT 第2世代︓事前学習 ⽂脈ありエンコーダー 事前学習⾃然⾔語処理モデルの歴史
  6. Word2Vecで使われたSkip-Gram Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013b) 2015-05-31 OS-1

    (2) ਔ௡ध৶ੰभ॥থআগشॸॕথॢ 14 draught offer pubs beer, cider, and wine last use place people make city full know build time group have new game rather age show take take team season say ଻ भ ౐ ୁ ॑ ঘ ॽ ॢ ছ ঒ ী ഘ ऊ ै १ থ উ জ থ ॢ ख ▐ ऒ ो ै ऋ ੒ ೾ औ ो ऩ ः े अ प ಌ ৗ ␟ ଀ ୻ ␠ ଻ भ ધ ဿ ୁ ॑ ੒ ೾ घ ॊ े अ प ಌ ৗ ৊ग౐ୁऋ१থ উঝऔोॊऒध ुँॉ੭ॊ ౐ୁঋॡॺঝ࢜௪ (݀ઃ੪) ધဿঋॡॺঝ෥ ࢜௖ (݀ઃ੪) : ৔஋ ՜ +λ ष : ৔஋ ՜ െλ ष ঋॡॺঝभಌৗ্ଉ ॥شঃ५ قધဿ்݄ = 2, ଀୻१থউঝਯ݇ = 1भৃ়भ୻ك Glove (ਈ৵੸ଭ১पेॊ౐ୁঋॡॺঝभ৾ಆ) (Pennington+ 2014) 2015-05-31 OS-1 (2) ਔ௡ध৶ੰभ॥থআগشॸॕথॢ 21 ܬ = ෍ ௜,௝ୀଵ ௏ ݂(ܯ௜,௝ ) (࢝௜ ் ෥ ࢝௝ + ܾ௜ + ෨ ܾ௝ െ log ܯ௜,௝ )ଶ ৯৓ঢ়ਯ: ݂ ݔ = (ݔ/ݔ୫ୟ୶ )ఈ (if ݔ < ݔ୫ୟ୶ ) 1 (otherwise) ౐ୁ݅ध౐ୁ݆भુକᄄ২ ౐ୁभ੕ਯ ౐ୁ݅भঋॡॺঝ ધဿ݆भঋॡॺঝd ౐ୁ݅भংॖ॔५ඨ ౐ୁ݆भংॖ॔५ඨb 1௺ଁ 2௺ଁ پ૚౐ୁपৌखथঃছওॱऋ2௺ଁँॊभम SGNSध৊஘؝মଢ଼஢म౐ୁ݅भঋॡॺঝ॑ ਈી৓प(࢝௜ + ෥ ࢝௜ )धघॊقಖ২ऋ਱঱घॊك ݔ௠௔௫ = 100, Ƚ = 0.75 भৃ় ڀ AdaGrad (SGD)द৾ಆ GloVeの学習法 Non-Contextual Embedding Word2Vec/GloVe https://www.slideshare.net/naoakiokazaki/20150530-jsai2015
  7. Neural Contextual Encoders QIU XP, et al. Pre-trained Models for

    Natural Language Processing: A Survey March (2020) 3 h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (a) Convolutional Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (b) Recurrent Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (c) Fully-Connected Self-Attention Model Figure 2: Neural Contextual Encoders bedding of token xt because of the contextual information included in. 2.2 Neural Contextual Encoders Most of the neural contextual encoders can be classified into two categories: sequence models and graph-based models. Figure 2 illustrates the architecture of these models. 2.2.1 Sequence Models Sequence models usually capture local context of a word in sequential order. Convolutional Models Convolutional models take the em- beddings of words in the input sentence and capture the mean- successful instance of fully-connected self-attention model is the Transformer [184], which also needs other supplement modules, such as positional embeddings, layer normalization, residual connections and position-wise feed-forward network (FFN) layers. 2.2.3 Analysis Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words. Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks. In contrast, as an instantiated fully-connected self-attention model, the Transformer can directly model the dependency 畳み込みモデル リカレントモデル 全結合⾃⼰注意モデル
  8. Transformer Figure 1: The Transformer - model architecture. 3.1 Encoder

    and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. https://arxiv.org/abs/1706.03762
  9. BERT: Bidirectional Encoder Representations from Transformers %(57 %(57 ( >&/6@

    (  ( >6(3@  ( 1 (  ¶  ( 0 ¶ & 7  7 >6(3@  7 1 7  ¶  7 0 ¶ >&/6@ 7RN >6(3@  7RN1 7RN  7RN0 4XHVWLRQ 3DUDJUDSK 6WDUW(QG6SDQ %(57 ( >&/6@ (  ( >6(3@  ( 1 (  ¶  ( 0 ¶ & 7  7 >6(3@  7 1 7  ¶  7 0 ¶ >&/6@ 7RN >6(3@  7RN1 7RN  7RN0 0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH% 3UHWUDLQLQJ )LQH7XQLQJ 163 0DVN/0 0DVN/0 8QODEHOHG6HQWHQFH$DQG%3DLU 64X$' 4XHVWLRQ$QVZHU3DLU 1(5 01/, Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec- tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers). ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015). mal difference between the pre-trained architec- ture and the final downstream architecture. https://arxiv.org/abs/1810.04805
  10. ⽂脈の有無による分類 QIU XP, et al. Pre-trained Models for Natural Language

    Processing: A Survey March (2020) Contextual? Non-Contextual CBOW, Skip-Gram [129] GloVe [133] Contextual ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], Co Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Supervised MT CoVe [126] LM ELMo [135], GPT [142], GPT-2 [143], U
  11. アーキテクチャによる分類 Ms Contextual? Non-Contextual CBOW, Skip-Gram [129] GloVe [133] Contextual

    ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126] Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117] Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Task Types Supervised MT CoVe [126] Unsupervised/ Self-Supervised LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39] MLM BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28] TLM XLM [27] Seq2Seq MLM MASS [160], T5 [144] PLM XLNet [209] DAE BART [100]
  12. 事前学習タスク による分類 • LM︓⾔語モデリング • MLM︓マスクあり⾔語モデ リング • PLM︓順序あり⾔語モデリン グ

    • DAE︓ノイズ除去オートエン コーダー • CTL︓コントラクティブラー ニング PTMs Contextual? Non-Contextual GloVe [133] Contextual ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126] Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117] Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Task Types Supervised MT CoVe [126] Unsupervised/ Self-Supervised LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39] MLM BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28] TLM XLM [27] Seq2Seq MLM MASS [160], T5 [144] PLM XLNet [209] DAE BART [100] CTL RTD CBOW-NS [129], ELECTRA [24] NSP BERT [36], UniLM [39] SOP ALBERT [93], StructBERT [193] Extensions Knowledge-Enriched ERNIE(THU) [214], KnowBERT [136], K-BERT [111] SentiLR [83], KEPLER [195], WKLM [202] Multilingual XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42] XLG MASS [160], mBART [118], XNLG [19] Language-Specific ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37] BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35] Multi-Modal Image ViLBERT [120], LXMERT [175], VisualBERT [103], B2T2 [2], VL-BERT [163] Video VideoBERT [165], CBT [164]
  13. PTMs Self-Supervised PLM XLNet [209] DAE BART [100] CTL RTD

    CBOW-NS [129], ELECTRA [24] NSP BERT [36], UniLM [39] SOP ALBERT [93], StructBERT [193] Extensions Knowledge-Enriched ERNIE(THU) [214], KnowBERT [136], K-BERT [111] SentiLR [83], KEPLER [195], WKLM [202] Multilingual XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42] XLG MASS [160], mBART [118], XNLG [19] Language-Specific ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37] BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35] Multi-Modal Image ViLBERT [120], LXMERT [175], VisualBERT [103], B2T2 [2], VL-BERT [163] Video VideoBERT [165], CBT [164] Speech SpeechBERT [22] Domain-Specific SentiLR [83], BioBERT [98], SciBERT [11], PatentBERT [97] Model Compression Model Pruning CompressingBERT [51] Quantization Q-BERT [156], Q8BERT [211] Parameter Sharing ALBERT [93] Distillation DistilBERT [152], TinyBERT [75], MiniLM [194] Module Replacing BERT-of-Theseus [203] Figure 3: Taxonomy of PTMs with Representative Examples
  14. 転移学習 • あるタスクで訓練したモデルの「知 識」を別のタスクに応⽤する⽅法 • 事前学習のタスク、モデルのアーキ テクチャ、学習に使⽤したコーパス により、応⽤タスクへの適性が変化 • 事前学習モデルのどのレイヤーまで

    使⽤するか︖ • 単純なファインチューニング • 2段階ファインチューニング • マルチタスク・ファインチューニン グ • 追加の適応モジュールを⽤いたファ インチューニング ファインチューニング 転移学習とファインチューニング
  15. QIU XP, et al. Pre-trained Models for Natural Language Processing:

    A Survey March (2020) 17 Table 5: Resources of PTMs Resource Description URL Open-Source Implementations § word2vec CBOW,Skip-Gram https://github.com/tmikolov/word2vec GloVe Pre-trained word vectors https://nlp.stanford.edu/projects/glove FastText Pre-trained word vectors https://github.com/facebookresearch/fastText Transformers Framework: PyTorch&TF, PTMs: BERT, GPT-2, RoBERTa, XLNet, etc. https://github.com/huggingface/transformers Fairseq Framework: PyTorch, PTMs:English LM, German LM, RoBERTa, etc. https://github.com/pytorch/fairseq Flair Framework: PyTorch, PTMs:BERT, ELMo, GPT, RoBERTa, XLNet, etc. https://github.com/flairNLP/flair AllenNLP [47] Framework: PyTorch, PTMs: ELMo, BERT, GPT-2, etc. https://github.com/allenai/allennlp fastNLP Framework: PyTorch, PTMs: RoBERTa, GPT, etc. https://github.com/fastnlp/fastNLP UniLMs Framework: PyTorch, PTMs: UniLM v1&v2, MiniLM, LayoutLM, etc. https://github.com/microsoft/unilm Chinese-BERT [29] Framework: PyTorch&TF, PTMs: BERT, RoBERTa, etc. (for Chinese) https://github.com/ymcui/Chinese-BERT-wwm BERT [36] Framework: TF, PTMs: BERT, BERT-wwm https://github.com/google-research/bert RoBERTa [117] Framework: PyTorch https://github.com/pytorch/fairseq/tree/master/examples/roberta XLNet [209] Framework: TF https://github.com/zihangdai/xlnet/ ALBERT [93] Framework: TF https://github.com/google-research/ALBERT T5 [144] Framework: TF https://github.com/google-research/text-to-text-transfer-transformer ERNIE(Baidu) [170, 171] Framework: PaddlePaddle https://github.com/PaddlePaddle/ERNIE CTRL [84] Conditional Transformer Language Model for Controllable Generation. https://github.com/salesforce/ctrl BertViz [185] Visualization Tool https://github.com/jessevig/bertviz exBERT [65] Visualization Tool https://github.com/bhoov/exbert TextBrewer [210] PyTorch-based toolkit for distillation of NLP models. https://github.com/airaria/TextBrewer DeepPavlov Conversational AI Library. PTMs for the Russian, Polish, Bulgarian, Czech, and informal English. https://github.com/deepmipt/DeepPavlov Corpora OpenWebText Open clone of OpenAI’s unreleased WebText dataset. https://github.com/jcpeterson/openwebtext Common Crawl A very large collection of text. http://commoncrawl.org/ WikiEn English Wikipedia dumps. https://dumps.wikimedia.org/enwiki/ Other Resources Paper List https://github.com/thunlp/PLMpapers Paper List https://github.com/tomohideshibata/BERT-related-papers Paper List https://github.com/cedrickchee/awesome-bert-nlp Bert Lang Street A collection of BERT models with reported performances on di↵erent datasets, tasks and languages. https://bertlang.unibocconi.it/ § Most papers for PTMs release their links of o cial version. Here we list some popular third-party and o cial implementations. However, motivated by the fact that the progress in recent years has eroded headroom on the GLUE benchmark dra- matically, a new benchmark called SuperGLUE [189] was presented. Compared to GLUE, SuperGLUE has more chal- lenging tasks and more diverse task formats (e.g., coreference resolution and question answering). State-of-the-art PTMs are listed in the corresponding leader- board4) 5). (HotpotQA) [208]. BERT creatively transforms the extractive QA task to the spans prediction task that predicts the starting span as well as the ending span of the answer [36]. After that, PTM as an encoder for predicting spans has become a competitive baseline. For extractive QA, Zhang et al. [215] proposed a ret- rospective reader architecture and initialize the encoder with PTM (e.g., ALBERT). For multi-round generative QA, Ju
  16. 主な応⽤分野 • ⼀般的な評価ベンチマーク(GLUE, CoLA, SST-2, MNLI, RTE, WNLI, QQP, MRPC,

    STS- B, QNLI, etc.) • QA • 感情分析 • エンティティ認識(固有表現抽出) • 機械翻訳 • 要約 • 敵対的攻撃とその防御
  17. GPT-3/ OpenAI API • あたかも「⼈間が書いたよ うな⽂章」を⽣成できると して、⼀時危険視された⾃ 然⾔語⽣成モデルGPT-2がさ らにパワーアップされた •

    パラメータ数もGPT-2の15億 に対して1750億と100倍以上 に • 学習済みのGPT-3の「知識」 が使えるAPIも公開へ https://openai.com/blog/openai-api/ https://github.com/openai/gpt-3
  18. Unsupervised Translation of Programming Languages • 「教師なし学習」により、複数のプログラミング言語間の「翻 訳」が可能に • Facebook

    Researchによるこの論文では、C++, Java, Pythonの間の 相互変換を実証 MT M de - C++ MT M de C++ - Bac - a a D a - c C - a Ma La a M a ( a, b) > ? : ; C++ a a MT M de J - J C -L a Ma ed LM (a, b): > P c (a, b): > P c c ( ) ( = * ; <= ; += ) = ; I c MA K ( ) (MA K = * ; <= ; += ) MA K = ; Ma c ( ) ( = * ; <= ; += ) = ; R c c in = ( , MA K, ); MA K( , , 1 -) , +, ); C c in = ( , , ); ( , , -1); ( , +1, ); R c c in = ( , , ); ( , , -1); ( , +1, ); I c Ma k ke C c de Figure 1: Illustration of the three principles of unsupervised machine translation used by our approach. The first principle initializes the model with cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language. Denoising auto-encoding, the second principle, trains the decoder to always generate valid sequences, even when fed with noisy data, and increases the encoder robustness to input noise. Back-translation, the last principle, allows the model to generate parallel data which can be used for training. Whenever the Python ! C++ model becomes better, it generates more accurate data for the C++ ! Python model, and vice versa. Figure 5 in the appendix provides a representation of the cross-lingual embeddings we obtain after training. The cross-lingual nature of the resulting model comes from the significant number of common tokens (anchor points) that exist across languages. In the context of English-French translation, the https://arxiv.org/abs/2006.03511 https://arxiv.org/abs/2006.03511
  19. Wiki-40B: Multilingual Language Model Dataset • TensorFlow Datasetで公開されている各⾔語のWikipediaデータセット • このデータセットを使⽤して訓練されたPTMも公開されている

    • 12層のTransformerXL • 768次元の単語埋め込みベクトル • 64次元のアテンションヘッド×12 • ボキャブラリはSentencepieceでサンプリング学習した32,000語 https://research.google/pubs/pub49029/
  20. spaCy/UD_Japanese-GSD • spaCy • Python/Cythonで記述された⾃然⾔語処理ライブラリ • 多⾔語に対応 • 形態素解析、依存関係解析、固有表現抽出などの⾔語処理や、ビジュア ライザ、テキスト分類モデルやディープラーニングモデルを含んでいる

    • MITライセンスで産業応⽤を⽬的としている • UD_Japanese-GSDは⽇本語の固有表現抽出のためのタグづけされた学習 データセットで、これによりspaCyに⽇本語モデルが追加された https://github.com/megagonlabs/UD_Japanese-GSD
  21. PEGASUS • 要約タスクに特化したモ デル • 事前学習タスクとしてト ークン単位でマスキング を行うMLMに代わり、セ ンテンス単位のGSG(Gap Sentence

    Generation)を使 用している PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization Jingqing Zhang * 1 Yao Zhao * 2 Mohammad Saleh 2 Peter J. Liu 2 Abstract work pre-training Transformers with rvised objectives on large text corpora wn great success when fine-tuned on am NLP tasks including text summa- However, pre-training objectives tai- abstractive text summarization have explored. Furthermore there is a ystematic evaluation across diverse do- n this work, we propose pre-training nsformer-based encoder-decoder mod- massive text corpora with a new self- d objective. In PEGASUS, important are removed/masked from an input doc- d are generated together as one output from the remaining sentences, similar Figure 1: The base architecture of PEGASUS is a standard Transformer encoder-decoder. Both GSG and MLM are applied simultaneously to this example as pre-training ob- jectives. Originally there are three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some tokens are randomly masked by [MASK2] (MLM). https://arxiv.org/abs/1912.08777
  22. 電笑戦 • オンラインイベントと なったAWS Summit Tokyo 2020で募集されたイベン ト • お笑い共有サービス「ボ

    ケて」 https://bokete.jpの 投稿データを元に、「⼈ 間を超える笑い」をAIが 作り出せるかを競う https://aws.amazon.com/jp/builders-flash/202006/bokete/