June 20, 2020

  1. 最近の⾃然⾔語処理モデルの動向 Pre-trained Models for Natural Language Processing: A Survey より

    板垣正敏 2020/6/20 @Python機械学習勉強会in新潟 Restart#11 Online
  2. 参考論⽂ • Pre-trained Models for Natural Language Processing: A Survey

    Xipeng Qiu , Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang • https://arxiv.org/abs/2003.08271 • ⾃然⾔語処理(NLP)における事前学習モデル(PTM: Pre-Trained Model)の包 括的なレビュー
  3. 論⽂の構成 1. Introduction 2. Background ⾔語表現学習の背景知識 3. Overview of PTMs

    事前学習モデルの概観 4. Extensions of PTMs 事前学習モデルの拡張 5. Adapting PTMs to Downstream Tasks 事前学習モデルの各種タスクへの適⽤ 6. Resources of PTMs 事前学習モデルについての情報源 7. Applications 事前学習モデルの応⽤ 8. Future Direction 今後の⽅向性 9. Conclusion
  4. ⾔語表現学習 • 言語表現の学習は、大ま かに言って次の 2段階で ⾏われる。 • 文脈によらないベクトル 化(埋め込み) •

    d fail to capture higher-level concepts in con- lysemous disambiguation, syntactic structures, anaphora. The second-generation PTMs focus textual word embeddings, such as CoVe [126], OpenAI GPT [142] and BERT [36]. These s are still needed to represent words in context tasks. Besides, various pre-training tasks are o learn PTMs for di↵erent purposes. utions of this survey can be summarized as hensive review. We provide a comprehensive PTMs for NLP, including background knowl- odel architecture, pre-training tasks, various ns, adaption approaches, and applications. onomy. We propose a taxonomy of PTMs for ich categorizes existing PTMs from four dif- rspectives: 1) representation type, 2) model ure; 3) type of pre-training task; 4) extensions fic types of scenarios. t resources. We collect abundant resources s, including open-source implementations of sualization tools, corpora, and paper lists. irections. We discuss and analyze the limi- f existing PTMs. Also, we suggest possible meaning of a piece of text by low-dimensional real-valued vec- tors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept. Figure 1 illustrates the generic neural architecture for NLP. There are two kinds of word embeddings: non-contextual and contex- tual embeddings. The di↵erence between them is whether the embedding for a word dynamically changes according to the context it appears in. ex1 ex2 ex3 ex4 ex5 ex6 ex7 Non-contextual Embeddings h1 h2 h3 h4 h5 h6 h7 Contextual Embeddings Contextual Encoder Task-Specifc Model Figure 1: Generic Neural Architecture for NLP
  5. 第1世代︓事前学習単語埋め込み • Word2Vec • GloVe • LSTM • Seq2Seq •

    BiLM(双⽅向LSTM) • ELMo • ULMFiT • GPT • BERT 第2世代︓事前学習 ⽂脈ありエンコーダー 事前学習⾃然⾔語処理モデルの歴史
  6. Word2Vecで使われたSkip-Gram Skip-gram with Negative Sampling (SGNS) (Mikolov+ 2013b) 2015-05-31 OS-1

    Non-Contextual Embedding Word2Vec/GloVe https://www.slideshare.net/naoakiokazaki/20150530-jsai2015
  7. Neural Contextual Encoders QIU XP, et al. Pre-trained Models for

    Natural Language Processing: A Survey March (2020) 3 h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (a) Convolutional Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (b) Recurrent Model h1 h2 h3 h4 h5 x1 x2 x3 x4 x5 (c) Fully-Connected Self-Attention Model Figure 2: Neural Contextual Encoders bedding of token xt because of the contextual information included in. 2.2 Neural Contextual Encoders Most of the neural contextual encoders can be classified into two categories: sequence models and graph-based models. Figure 2 illustrates the architecture of these models. 2.2.1 Sequence Models Sequence models usually capture local context of a word in sequential order. Convolutional Models Convolutional models take the em- beddings of words in the input sentence and capture the mean- successful instance of fully-connected self-attention model is the Transformer [184], which also needs other supplement modules, such as positional embeddings, layer normalization, residual connections and position-wise feed-forward network (FFN) layers. 2.2.3 Analysis Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words. Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks. In contrast, as an instantiated fully-connected self-attention model, the Transformer can directly model the dependency 畳み込みモデル リカレントモデル 全結合⾃⼰注意モデル
  8. Transformer Figure 1: The Transformer - model architecture. 3.1 Encoder

    and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. https://arxiv.org/abs/1706.03762
  9. BERT: Bidirectional Encoder Representations from Transformers %(57 %(57 ( >&/6@

    (  ( >6(3@  ( 1 (  ¶  ( 0 ¶ & 7  7 >6(3@  7 1 7  ¶  7 0 ¶ >&/6@ 7RN >6(3@  7RN1 7RN  7RN0 4XHVWLRQ 3DUDJUDSK 6WDUW(QG6SDQ %(57 ( >&/6@ (  ( >6(3@  ( 1 (  ¶  ( 0 ¶ & 7  7 >6(3@  7 1 7  ¶  7 0 ¶ >&/6@ 7RN >6(3@  7RN1 7RN  7RN0 0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH% 3UHWUDLQLQJ )LQH7XQLQJ 163 0DVN/0 0DVN/0 8QODEHOHG6HQWHQFH$DQG%3DLU 64X$' 4XHVWLRQ$QVZHU3DLU 1(5 01/, Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec- tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers). ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015). mal difference between the pre-trained architec- ture and the final downstream architecture. https://arxiv.org/abs/1810.04805
  10. ⽂脈の有無による分類 QIU XP, et al. Pre-trained Models for Natural Language

    Processing: A Survey March (2020) Contextual? Non-Contextual CBOW, Skip-Gram [129] GloVe [133] Contextual ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], Co Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Supervised MT CoVe [126] LM ELMo [135], GPT [142], GPT-2 [143], U
  11. アーキテクチャによる分類 Ms Contextual? Non-Contextual CBOW, Skip-Gram [129] GloVe [133] Contextual

    ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126] Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117] Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Task Types Supervised MT CoVe [126] Unsupervised/ Self-Supervised LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39] MLM BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28] TLM XLM [27] Seq2Seq MLM MASS [160], T5 [144] PLM XLNet [209] DAE BART [100]
  12. 事前学習タスク による分類 • LM︓⾔語モデリング • MLM︓マスクあり⾔語モデ リング • PLM︓順序あり⾔語モデリン グ

    • DAE︓ノイズ除去オートエン コーダー • CTL︓コントラクティブラー ニング PTMs Contextual? Non-Contextual GloVe [133] Contextual ELMo [135], GPT [142], BERT [36] Architectures LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126] Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117] Transformer Dec. GPT [142], GPT-2 [143] Transformer MASS [160], BART [100] XNLG [19], mBART [118] Task Types Supervised MT CoVe [126] Unsupervised/ Self-Supervised LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39] MLM BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28] TLM XLM [27] Seq2Seq MLM MASS [160], T5 [144] PLM XLNet [209] DAE BART [100] CTL RTD CBOW-NS [129], ELECTRA [24] NSP BERT [36], UniLM [39] SOP ALBERT [93], StructBERT [193] Extensions Knowledge-Enriched ERNIE(THU) [214], KnowBERT [136], K-BERT [111] SentiLR [83], KEPLER [195], WKLM [202] Multilingual XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42] XLG MASS [160], mBART [118], XNLG [19] Language-Specific ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37] BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35] Multi-Modal Image ViLBERT [120], LXMERT [175], VisualBERT [103], B2T2 [2], VL-BERT [163] Video VideoBERT [165], CBT [164]
  13. PTMs Self-Supervised PLM XLNet [209] DAE BART [100] CTL RTD

    CBOW-NS [129], ELECTRA [24] NSP BERT [36], UniLM [39] SOP ALBERT [93], StructBERT [193] Extensions Knowledge-Enriched ERNIE(THU) [214], KnowBERT [136], K-BERT [111] SentiLR [83], KEPLER [195], WKLM [202] Multilingual XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42] XLG MASS [160], mBART [118], XNLG [19] Language-Specific ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37] BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35] Multi-Modal Image ViLBERT [120], LXMERT [175], VisualBERT [103], B2T2 [2], VL-BERT [163] Video VideoBERT [165], CBT [164] Speech SpeechBERT [22] Domain-Specific SentiLR [83], BioBERT [98], SciBERT [11], PatentBERT [97] Model Compression Model Pruning CompressingBERT [51] Quantization Q-BERT [156], Q8BERT [211] Parameter Sharing ALBERT [93] Distillation DistilBERT [152], TinyBERT [75], MiniLM [194] Module Replacing BERT-of-Theseus [203] Figure 3: Taxonomy of PTMs with Representative Examples
  14. 転移学習 • あるタスクで訓練したモデルの「知 識」を別のタスクに応⽤する⽅法 • 事前学習のタスク、モデルのアーキ テクチャ、学習に使⽤したコーパス により、応⽤タスクへの適性が変化 • 事前学習モデルのどのレイヤーまで

    使⽤するか︖ • 単純なファインチューニング • 2段階ファインチューニング • マルチタスク・ファインチューニン グ • 追加の適応モジュールを⽤いたファ インチューニング ファインチューニング 転移学習とファインチューニング
  15. QIU XP, et al. Pre-trained Models for Natural Language Processing:

    QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 17
  16. 主な応⽤分野 • ⼀般的な評価ベンチマーク(GLUE, CoLA, SST-2, MNLI, RTE, WNLI, QQP, MRPC,

    STS- B, QNLI, etc.) • QA • 感情分析 • エンティティ認識(固有表現抽出) • 機械翻訳 • 要約 • 敵対的攻撃とその防御
  17. GPT-3/ OpenAI API • あたかも「⼈間が書いたよ うな⽂章」を⽣成できると して、⼀時危険視された⾃ 然⾔語⽣成モデルGPT-2がさ らにパワーアップされた •

    パラメータ数もGPT-2の15億 に対して1750億と100倍以上 に • 学習済みのGPT-3の「知識」 が使えるAPIも公開へ https://openai.com/blog/openai-api/ https://github.com/openai/gpt-3
  18. Unsupervised Translation of Programming Languages • 「教師なし学習」により、複数のプログラミング言語間の「翻 訳」が可能に • Facebook

    Researchによるこの論文では、C++, Java, Pythonの間の 相互変換を実証 MT M de - C++ MT M de C++ - Bac - a a D a - c C - a Ma La a M a ( a, b) > ? : ; C++ a a MT M de J - J C -L a Ma ed LM (a, b): > P c (a, b): > P c c ( ) ( = * ; <= ; += ) = ; I c MA K ( ) (MA K = * ; <= ; += ) MA K = ; Ma c ( ) ( = * ; <= ; += ) = ; R c c in = ( , MA K, ); MA K( , , 1 -) , +, ); C c in = ( , , ); ( , , -1); ( , +1, ); R c c in = ( , , ); ( , , -1); ( , +1, ); I c Ma k ke C c de Figure 1: Illustration of the three principles of unsupervised machine translation used by our approach. The first principle initializes the model with cross-lingual masked language model pretraining. As a result, pieces of code that express the same instructions are mapped to the same representation, regardless of the programming language. Denoising auto-encoding, the second principle, trains the decoder to always generate valid sequences, even when fed with noisy data, and increases the encoder robustness to input noise. Back-translation, the last principle, allows the model to generate parallel data which can be used for training. Whenever the Python ! C++ model becomes better, it generates more accurate data for the C++ ! Python model, and vice versa. Figure 5 in the appendix provides a representation of the cross-lingual embeddings we obtain after training. The cross-lingual nature of the resulting model comes from the significant number of common tokens (anchor points) that exist across languages. In the context of English-French translation, the https://arxiv.org/abs/2006.03511 https://arxiv.org/abs/2006.03511
  19. Wiki-40B: Multilingual Language Model Dataset • TensorFlow Datasetで公開されている各⾔語のWikipediaデータセット • このデータセットを使⽤して訓練されたPTMも公開されている

    • 12層のTransformerXL • 768次元の単語埋め込みベクトル • 64次元のアテンションヘッド×12 • ボキャブラリはSentencepieceでサンプリング学習した32,000語 https://research.google/pubs/pub49029/
  20. spaCy/UD_Japanese-GSD • spaCy • Python/Cythonで記述された⾃然⾔語処理ライブラリ • 多⾔語に対応 • 形態素解析、依存関係解析、固有表現抽出などの⾔語処理や、ビジュア ライザ、テキスト分類モデルやディープラーニングモデルを含んでいる

    • MITライセンスで産業応⽤を⽬的としている • UD_Japanese-GSDは⽇本語の固有表現抽出のためのタグづけされた学習 データセットで、これによりspaCyに⽇本語モデルが追加された https://github.com/megagonlabs/UD_Japanese-GSD
  21. PEGASUS • 要約タスクに特化したモ デル • 事前学習タスクとしてト ークン単位でマスキング を行うMLMに代わり、セ ンテンス単位のGSG(Gap Sentence

    Generation)を使 用している PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization Jingqing Zhang * 1 Yao Zhao * 2 Mohammad Saleh 2 Peter J. Liu 2 Abstract work pre-training Transformers with rvised objectives on large text corpora wn great success when fine-tuned on am NLP tasks including text summa- However, pre-training objectives tai- abstractive text summarization have explored. Furthermore there is a ystematic evaluation across diverse do- n this work, we propose pre-training nsformer-based encoder-decoder mod- massive text corpora with a new self- d objective. In PEGASUS, important are removed/masked from an input doc- d are generated together as one output from the remaining sentences, similar Figure 1: The base architecture of PEGASUS is a standard Transformer encoder-decoder. Both GSG and MLM are applied simultaneously to this example as pre-training ob- jectives. Originally there are three sentences. One sentence is masked with [MASK1] and used as target generation text (GSG). The other two sentences remain in the input, but some tokens are randomly masked by [MASK2] (MLM). https://arxiv.org/abs/1912.08777
  22. 電笑戦 • オンラインイベントと なったAWS Summit Tokyo 2020で募集されたイベン ト • お笑い共有サービス「ボ

    ケて」 https://bokete.jpの 投稿データを元に、「⼈ 間を超える笑い」をAIが 作り出せるかを競う https://aws.amazon.com/jp/builders-flash/202006/bokete/