Upgrade to Pro — share decks privately, control downloads, hide ads and more …

最近の自然言語処理モデルの動向

 最近の自然言語処理モデルの動向

サーベイ論文を元にして最近の学習済み自然言語処理モデルの動向と、最近の自然言語処理関連のニュースを紹介

masa-ita

June 20, 2020
Tweet

More Decks by masa-ita

Other Decks in Technology

Transcript

  1. 最近の⾃然⾔語処理モデルの動向
    Pre-trained Models for Natural Language Processing: A
    Survey より
    板垣正敏
    2020/6/20
    @Python機械学習勉強会in新潟 Restart#11 Online

    View Slide

  2. 参考論⽂
    • Pre-trained Models for Natural Language Processing: A Survey
    Xipeng Qiu , Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang
    • https://arxiv.org/abs/2003.08271
    • ⾃然⾔語処理(NLP)における事前学習モデル(PTM: Pre-Trained Model)の包
    括的なレビュー

    View Slide

  3. 論⽂の構成
    1. Introduction
    2. Background ⾔語表現学習の背景知識
    3. Overview of PTMs 事前学習モデルの概観
    4. Extensions of PTMs 事前学習モデルの拡張
    5. Adapting PTMs to Downstream Tasks 事前学習モデルの各種タスクへの適⽤
    6. Resources of PTMs 事前学習モデルについての情報源
    7. Applications 事前学習モデルの応⽤
    8. Future Direction 今後の⽅向性
    9. Conclusion

    View Slide

  4. 2. Background
    事前学習モデルの背景知識

    View Slide

  5. ⾔語表現学習
    • 言語表現の学習は、大ま
    かに言って次の 2段階で
    ⾏われる。
    • 文脈によらないベクトル
    化(埋め込み)
    • 単語レベルのベクトル化
    • 文脈に沿ったベクトル化
    (埋め込み)
    • 文あるいは文章レベルの
    ベクトル化
    d fail to capture higher-level concepts in con-
    lysemous disambiguation, syntactic structures,
    anaphora. The second-generation PTMs focus
    textual word embeddings, such as CoVe [126],
    OpenAI GPT [142] and BERT [36]. These
    s are still needed to represent words in context
    tasks. Besides, various pre-training tasks are
    o learn PTMs for di↵erent purposes.
    utions of this survey can be summarized as
    hensive review. We provide a comprehensive
    PTMs for NLP, including background knowl-
    odel architecture, pre-training tasks, various
    ns, adaption approaches, and applications.
    onomy. We propose a taxonomy of PTMs for
    ich categorizes existing PTMs from four dif-
    rspectives: 1) representation type, 2) model
    ure; 3) type of pre-training task; 4) extensions
    fic types of scenarios.
    t resources. We collect abundant resources
    s, including open-source implementations of
    sualization tools, corpora, and paper lists.
    irections. We discuss and analyze the limi-
    f existing PTMs. Also, we suggest possible
    meaning of a piece of text by low-dimensional real-valued vec-
    tors. And each dimension of the vector has no corresponding
    sense, while the whole represents a concrete concept. Figure
    1 illustrates the generic neural architecture for NLP. There are
    two kinds of word embeddings: non-contextual and contex-
    tual embeddings. The di↵erence between them is whether the
    embedding for a word dynamically changes according to the
    context it appears in.
    ex1
    ex2
    ex3
    ex4
    ex5
    ex6
    ex7
    Non-contextual
    Embeddings
    h1 h2 h3 h4 h5 h6 h7
    Contextual
    Embeddings
    Contextual Encoder
    Task-Specifc Model
    Figure 1: Generic Neural Architecture for NLP
    Non-contextual Embeddings The first step of represent-
    ing language is to map discrete language symbols into a dis-
    tributed embedding space. Formally, for each word (or sub-
    word) x in a vocabulary V, we map it to a vector ex
    2 RDe with
    a lookup table E 2 RDe
    ⇥|V|, where De
    is a hyper-parameter
    indicating the dimension of token embeddings. These em-
    beddings are trained on task data along with other model

    View Slide

  6. 第1世代︓事前学習単語埋め込み
    • Word2Vec
    • GloVe
    • LSTM
    • Seq2Seq
    • BiLM(双⽅向LSTM)
    • ELMo
    • ULMFiT
    • GPT
    • BERT
    第2世代︓事前学習
    ⽂脈ありエンコーダー
    事前学習⾃然⾔語処理モデルの歴史

    View Slide

  7. Word2Vecで使われたSkip-Gram
    Skip-gram with Negative Sampling (SGNS)
    (Mikolov+ 2013b)
    2015-05-31 OS-1 (2)
    ਔ௡ध৶ੰभ॥থআগشॸॕথॢ 14
    draught
    offer
    pubs beer, cider, and wine
    last
    use
    place
    people
    make
    city
    full
    know
    build
    time
    group
    have
    new
    game
    rather
    age
    show
    take
    take
    team
    season
    say
























































    ৊ग౐ୁऋ१থ
    উঝऔोॊऒध
    ुँॉ੭ॊ
    ౐ୁঋॡॺঝ࢜௪
    (݀ઃ੪)
    ધဿঋॡॺঝ෥
    ࢜௖
    (݀ઃ੪)
    : ৔஋ ՜ +λ ष
    : ৔஋ ՜ െλ ष
    ঋॡॺঝभಌৗ্ଉ
    ॥شঃ५
    قધဿ்݄ = 2, ଀୻१থউঝਯ݇ = 1भৃ়भ୻ك
    Glove (ਈ৵੸ଭ১पेॊ౐ୁঋॡॺঝभ৾ಆ)
    (Pennington+ 2014)
    2015-05-31 OS-1 (2)
    ਔ௡ध৶ੰभ॥থআগشॸॕথॢ 21
    ܬ = ෍
    ௜,௝ୀଵ

    ݂(ܯ௜,௝
    ) (࢝௜
    ் ෥
    ࢝௝
    + ܾ௜
    + ෨
    ܾ௝
    െ log ܯ௜,௝
    )ଶ
    ৯৓ঢ়ਯ:
    ݂ ݔ =
    (ݔ/ݔ୫ୟ୶
    )ఈ (if ݔ < ݔ୫ୟ୶
    )
    1 (otherwise)
    ౐ୁ݅ध౐ୁ݆भુକᄄ২
    ౐ୁभ੕ਯ
    ౐ୁ݅भঋॡॺঝ
    ધဿ݆भঋॡॺঝd
    ౐ୁ݅भংॖ॔५ඨ
    ౐ୁ݆भংॖ॔५ඨb
    1௺ଁ
    2௺ଁ
    پ૚౐ୁपৌखथঃছওॱऋ2௺ଁँॊभम
    SGNSध৊஘؝মଢ଼஢म౐ୁ݅भঋॡॺঝ॑
    ਈી৓प(࢝௜
    + ෥
    ࢝௜
    )धघॊقಖ২ऋ਱঱घॊك
    ݔ௠௔௫
    = 100, Ƚ = 0.75 भৃ় ڀ
    AdaGrad
    (SGD)द৾ಆ
    GloVeの学習法
    Non-Contextual Embedding
    Word2Vec/GloVe
    https://www.slideshare.net/naoakiokazaki/20150530-jsai2015

    View Slide

  8. Neural Contextual Encoders
    QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 3
    h1
    h2
    h3
    h4
    h5
    x1
    x2
    x3
    x4
    x5
    (a) Convolutional Model
    h1 h2 h3 h4 h5
    x1 x2 x3 x4 x5
    (b) Recurrent Model
    h1
    h2
    h3
    h4
    h5
    x1
    x2
    x3
    x4
    x5
    (c) Fully-Connected Self-Attention Model
    Figure 2: Neural Contextual Encoders
    bedding of token xt
    because of the contextual information
    included in.
    2.2 Neural Contextual Encoders
    Most of the neural contextual encoders can be classified into
    two categories: sequence models and graph-based models.
    Figure 2 illustrates the architecture of these models.
    2.2.1 Sequence Models
    Sequence models usually capture local context of a word in
    sequential order.
    Convolutional Models Convolutional models take the em-
    beddings of words in the input sentence and capture the mean-
    successful instance of fully-connected self-attention model
    is the Transformer [184], which also needs other supplement
    modules, such as positional embeddings, layer normalization,
    residual connections and position-wise feed-forward network
    (FFN) layers.
    2.2.3 Analysis
    Sequence models learn the contextual representation of the
    word with locality bias and are hard to capture the long-range
    interactions between words. Nevertheless, sequence models
    are usually easy to train and get good results for various NLP
    tasks.
    In contrast, as an instantiated fully-connected self-attention
    model, the Transformer can directly model the dependency
    畳み込みモデル リカレントモデル 全結合⾃⼰注意モデル

    View Slide

  9. LSTM: Long Short Term Memory
    https://www.researchgate.net/figure/Structure-of-the-LSTM-cell-and-equations-that-describe-the-gates-of-
    an-LSTM-cell_fig5_329362532

    View Slide

  10. Transformer
    Figure 1: The Transformer - model architecture.
    3.1 Encoder and Decoder Stacks
    Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two
    sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
    wise fully connected feed-forward network. We employ a residual connection [11] around each of
    the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is
    LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
    itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding
    layers, produce outputs of dimension dmodel = 512.
    https://arxiv.org/abs/1706.03762

    View Slide

  11. BERT:
    Bidirectional Encoder Representations from Transformers
    %(57 %(57
    (
    >&/[email protected]
    (

    (
    >6([email protected]
    (
    1
    (

    ¶ (
    0

    & 7

    7
    >6([email protected]
    7
    1
    7

    ¶ 7
    0

    >&/[email protected] 7RN >6([email protected]
    7RN1 7RN 7RN0
    4XHVWLRQ 3DUDJUDSK
    6WDUW(QG6SDQ
    %(57
    (
    >&/[email protected]
    (

    (
    >6([email protected]
    (
    1
    (

    ¶ (
    0

    & 7

    7
    >6([email protected]
    7
    1
    7

    ¶ 7
    0

    >&/[email protected] 7RN >6([email protected]
    7RN1 7RN 7RN0
    0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH%
    3UHWUDLQLQJ )LQH7XQLQJ
    163 0DVN/0 0DVN/0
    8QODEHOHG6HQWHQFH$DQG%3DLU
    64X$'
    4XHVWLRQ$QVZHU3DLU
    1(5
    01/,
    Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec-
    tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize
    models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special
    symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques-
    tions/answers).
    ing and auto-encoder objectives have been used
    for pre-training such models (Howard and Ruder,
    2018; Radford et al., 2018; Dai and Le, 2015).
    mal difference between the pre-trained architec-
    ture and the final downstream architecture.
    https://arxiv.org/abs/1810.04805

    View Slide

  12. 3. Overview of PTMs
    事前学習モデルの概観

    View Slide

  13. ⽂脈の有無による分類
    QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
    Contextual?
    Non-Contextual
    CBOW, Skip-Gram [129]
    GloVe [133]
    Contextual ELMo [135], GPT [142], BERT [36]
    Architectures
    LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], Co
    Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa
    Transformer Dec. GPT [142], GPT-2 [143]
    Transformer
    MASS [160], BART [100]
    XNLG [19], mBART [118]
    Supervised MT CoVe [126]
    LM ELMo [135], GPT [142], GPT-2 [143], U

    View Slide

  14. アーキテクチャによる分類
    Ms
    Contextual?
    Non-Contextual
    CBOW, Skip-Gram [129]
    GloVe [133]
    Contextual ELMo [135], GPT [142], BERT [36]
    Architectures
    LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126]
    Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117]
    Transformer Dec. GPT [142], GPT-2 [143]
    Transformer
    MASS [160], BART [100]
    XNLG [19], mBART [118]
    Task Types
    Supervised MT CoVe [126]
    Unsupervised/
    Self-Supervised
    LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39]
    MLM
    BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28]
    TLM XLM [27]
    Seq2Seq MLM MASS [160], T5 [144]
    PLM XLNet [209]
    DAE BART [100]

    View Slide

  15. 事前学習タスク
    による分類
    • LM︓⾔語モデリング
    • MLM︓マスクあり⾔語モデ
    リング
    • PLM︓順序あり⾔語モデリン

    • DAE︓ノイズ除去オートエン
    コーダー
    • CTL︓コントラクティブラー
    ニング
    PTMs
    Contextual?
    Non-Contextual
    GloVe [133]
    Contextual ELMo [135], GPT [142], BERT [36]
    Architectures
    LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126]
    Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117]
    Transformer Dec. GPT [142], GPT-2 [143]
    Transformer
    MASS [160], BART [100]
    XNLG [19], mBART [118]
    Task Types
    Supervised MT CoVe [126]
    Unsupervised/
    Self-Supervised
    LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39]
    MLM
    BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28]
    TLM XLM [27]
    Seq2Seq MLM MASS [160], T5 [144]
    PLM XLNet [209]
    DAE BART [100]
    CTL
    RTD CBOW-NS [129], ELECTRA [24]
    NSP BERT [36], UniLM [39]
    SOP ALBERT [93], StructBERT [193]
    Extensions
    Knowledge-Enriched
    ERNIE(THU) [214], KnowBERT [136], K-BERT [111]
    SentiLR [83], KEPLER [195], WKLM [202]
    Multilingual
    XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42]
    XLG MASS [160], mBART [118], XNLG [19]
    Language-Specific
    ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37]
    BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35]
    Multi-Modal
    Image
    ViLBERT [120], LXMERT [175],
    VisualBERT [103], B2T2 [2], VL-BERT [163]
    Video VideoBERT [165], CBT [164]

    View Slide

  16. 4. Extensions of PTMs
    事前学習モデルの拡張

    View Slide

  17. PTMs Self-Supervised PLM XLNet [209]
    DAE BART [100]
    CTL
    RTD CBOW-NS [129], ELECTRA [24]
    NSP BERT [36], UniLM [39]
    SOP ALBERT [93], StructBERT [193]
    Extensions
    Knowledge-Enriched
    ERNIE(THU) [214], KnowBERT [136], K-BERT [111]
    SentiLR [83], KEPLER [195], WKLM [202]
    Multilingual
    XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42]
    XLG MASS [160], mBART [118], XNLG [19]
    Language-Specific
    ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37]
    BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35]
    Multi-Modal
    Image
    ViLBERT [120], LXMERT [175],
    VisualBERT [103], B2T2 [2], VL-BERT [163]
    Video VideoBERT [165], CBT [164]
    Speech SpeechBERT [22]
    Domain-Specific SentiLR [83], BioBERT [98], SciBERT [11], PatentBERT [97]
    Model Compression
    Model Pruning CompressingBERT [51]
    Quantization Q-BERT [156], Q8BERT [211]
    Parameter Sharing ALBERT [93]
    Distillation DistilBERT [152], TinyBERT [75], MiniLM [194]
    Module Replacing BERT-of-Theseus [203]
    Figure 3: Taxonomy of PTMs with Representative Examples

    View Slide

  18. 5. Adapting PTMs to
    Downstream Tasks
    事前学習モデルのタスクへの適⽤

    View Slide

  19. 転移学習
    • あるタスクで訓練したモデルの「知
    識」を別のタスクに応⽤する⽅法
    • 事前学習のタスク、モデルのアーキ
    テクチャ、学習に使⽤したコーパス
    により、応⽤タスクへの適性が変化
    • 事前学習モデルのどのレイヤーまで
    使⽤するか︖
    • 単純なファインチューニング
    • 2段階ファインチューニング
    • マルチタスク・ファインチューニン

    • 追加の適応モジュールを⽤いたファ
    インチューニング
    ファインチューニング
    転移学習とファインチューニング

    View Slide

  20. 6. Resources of PTMs

    View Slide

  21. QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 17
    Table 5: Resources of PTMs
    Resource Description URL
    Open-Source Implementations §
    word2vec CBOW,Skip-Gram https://github.com/tmikolov/word2vec
    GloVe Pre-trained word vectors https://nlp.stanford.edu/projects/glove
    FastText Pre-trained word vectors https://github.com/facebookresearch/fastText
    Transformers Framework: PyTorch&TF, PTMs: BERT, GPT-2, RoBERTa, XLNet, etc. https://github.com/huggingface/transformers
    Fairseq Framework: PyTorch, PTMs:English LM, German LM, RoBERTa, etc. https://github.com/pytorch/fairseq
    Flair Framework: PyTorch, PTMs:BERT, ELMo, GPT, RoBERTa, XLNet, etc. https://github.com/flairNLP/flair
    AllenNLP [47] Framework: PyTorch, PTMs: ELMo, BERT, GPT-2, etc. https://github.com/allenai/allennlp
    fastNLP Framework: PyTorch, PTMs: RoBERTa, GPT, etc. https://github.com/fastnlp/fastNLP
    UniLMs Framework: PyTorch, PTMs: UniLM v1&v2, MiniLM, LayoutLM, etc. https://github.com/microsoft/unilm
    Chinese-BERT [29] Framework: PyTorch&TF, PTMs: BERT, RoBERTa, etc. (for Chinese) https://github.com/ymcui/Chinese-BERT-wwm
    BERT [36] Framework: TF, PTMs: BERT, BERT-wwm https://github.com/google-research/bert
    RoBERTa [117] Framework: PyTorch https://github.com/pytorch/fairseq/tree/master/examples/roberta
    XLNet [209] Framework: TF https://github.com/zihangdai/xlnet/
    ALBERT [93] Framework: TF https://github.com/google-research/ALBERT
    T5 [144] Framework: TF https://github.com/google-research/text-to-text-transfer-transformer
    ERNIE(Baidu) [170, 171] Framework: PaddlePaddle https://github.com/PaddlePaddle/ERNIE
    CTRL [84] Conditional Transformer Language Model for Controllable Generation. https://github.com/salesforce/ctrl
    BertViz [185] Visualization Tool https://github.com/jessevig/bertviz
    exBERT [65] Visualization Tool https://github.com/bhoov/exbert
    TextBrewer [210] PyTorch-based toolkit for distillation of NLP models. https://github.com/airaria/TextBrewer
    DeepPavlov Conversational AI Library. PTMs for the Russian, Polish, Bulgarian,
    Czech, and informal English.
    https://github.com/deepmipt/DeepPavlov
    Corpora
    OpenWebText Open clone of OpenAI’s unreleased WebText dataset. https://github.com/jcpeterson/openwebtext
    Common Crawl A very large collection of text. http://commoncrawl.org/
    WikiEn English Wikipedia dumps. https://dumps.wikimedia.org/enwiki/
    Other Resources
    Paper List https://github.com/thunlp/PLMpapers
    Paper List https://github.com/tomohideshibata/BERT-related-papers
    Paper List https://github.com/cedrickchee/awesome-bert-nlp
    Bert Lang Street A collection of BERT models with reported performances on di↵erent
    datasets, tasks and languages.
    https://bertlang.unibocconi.it/
    § Most papers for PTMs release their links of o cial version. Here we list some popular third-party and o cial implementations.
    However, motivated by the fact that the progress in recent
    years has eroded headroom on the GLUE benchmark dra-
    matically, a new benchmark called SuperGLUE [189] was
    presented. Compared to GLUE, SuperGLUE has more chal-
    lenging tasks and more diverse task formats (e.g., coreference
    resolution and question answering).
    State-of-the-art PTMs are listed in the corresponding leader-
    board4) 5).
    (HotpotQA) [208].
    BERT creatively transforms the extractive QA task to the
    spans prediction task that predicts the starting span as well
    as the ending span of the answer [36]. After that, PTM as
    an encoder for predicting spans has become a competitive
    baseline. For extractive QA, Zhang et al. [215] proposed a ret-
    rospective reader architecture and initialize the encoder with
    PTM (e.g., ALBERT). For multi-round generative QA, Ju

    View Slide

  22. 7. Applications
    事前学習モデルの応⽤

    View Slide

  23. 主な応⽤分野
    • ⼀般的な評価ベンチマーク(GLUE, CoLA, SST-2, MNLI, RTE, WNLI, QQP,
    MRPC, STS- B, QNLI, etc.)
    • QA
    • 感情分析
    • エンティティ認識(固有表現抽出)
    • 機械翻訳
    • 要約
    • 敵対的攻撃とその防御

    View Slide

  24. 8. Future Direction
    今後の⽅向性

    View Slide

  25. 今後の⽅向性
    性能限界の追求
    アーキテクチャの改良
    タスク指向事前学習モデルとモデル圧縮
    ファインチューニングを超える知識伝達
    事前学習モデルの解釈性と信頼性

    View Slide

  26. 最近のニュースから
    この1ヶ⽉程度の間に⽬にした⾃然⾔語関連のニュース

    View Slide

  27. GPT-3/
    OpenAI API
    • あたかも「⼈間が書いたよ
    うな⽂章」を⽣成できると
    して、⼀時危険視された⾃
    然⾔語⽣成モデルGPT-2がさ
    らにパワーアップされた
    • パラメータ数もGPT-2の15億
    に対して1750億と100倍以上

    • 学習済みのGPT-3の「知識」
    が使えるAPIも公開へ
    https://openai.com/blog/openai-api/
    https://github.com/openai/gpt-3

    View Slide

  28. Unsupervised
    Translation of
    Programming
    Languages
    • 「教師なし学習」により、複数のプログラミング言語間の「翻
    訳」が可能に
    • Facebook Researchによるこの論文では、C++, Java, Pythonの間の
    相互変換を実証
    MT M de
    - C++
    MT M de
    C++ -
    Bac - a a
    D a - c
    C - a Ma La a M a
    ( a, b)
    > ? : ;
    C++ a a
    MT M de
    J - J
    C -L a
    Ma ed LM
    (a, b):
    >
    P c
    (a, b):
    >
    P c c
    ( )
    ( = * ; <= ; += )
    = ;
    I c
    MA K ( )
    (MA K = * ; <= ; += )
    MA K = ;
    Ma c
    ( )
    ( = * ; <= ; += )
    = ;
    R c c
    in = ( , MA K, );
    MA K( , , 1 -)
    , +, );
    C c
    in = ( , , );
    ( , , -1);
    ( , +1, );
    R c c
    in = ( , , );
    ( , , -1);
    ( , +1, );
    I c
    Ma k ke
    C c de
    Figure 1: Illustration of the three principles of unsupervised machine translation used by our approach.
    The first principle initializes the model with cross-lingual masked language model pretraining. As a result, pieces
    of code that express the same instructions are mapped to the same representation, regardless of the programming
    language. Denoising auto-encoding, the second principle, trains the decoder to always generate valid sequences,
    even when fed with noisy data, and increases the encoder robustness to input noise. Back-translation, the last
    principle, allows the model to generate parallel data which can be used for training. Whenever the Python ! C++
    model becomes better, it generates more accurate data for the C++ ! Python model, and vice versa. Figure 5 in
    the appendix provides a representation of the cross-lingual embeddings we obtain after training.
    The cross-lingual nature of the resulting model comes from the significant number of common
    tokens (anchor points) that exist across languages. In the context of English-French translation, the
    https://arxiv.org/abs/2006.03511
    https://arxiv.org/abs/2006.03511

    View Slide

  29. AllenNLP 1.0
    • Microsoftの「もう⼀⼈の創⽴者」Paul Allenが設⽴した研究所AI2: Allen
    Institute for AIによる⾃然⾔語処理(NLP)ライブラリ集
    • 様々な⾃然⾔語処理タスクが⾏える
    • 今週v1.0.0に到達
    https://allennlp.org/

    View Slide

  30. Wiki-40B:
    Multilingual Language Model Dataset
    • TensorFlow Datasetで公開されている各⾔語のWikipediaデータセット
    • このデータセットを使⽤して訓練されたPTMも公開されている
    • 12層のTransformerXL
    • 768次元の単語埋め込みベクトル
    • 64次元のアテンションヘッド×12
    • ボキャブラリはSentencepieceでサンプリング学習した32,000語
    https://research.google/pubs/pub49029/

    View Slide

  31. spaCy/UD_Japanese-GSD
    • spaCy
    • Python/Cythonで記述された⾃然⾔語処理ライブラリ
    • 多⾔語に対応
    • 形態素解析、依存関係解析、固有表現抽出などの⾔語処理や、ビジュア
    ライザ、テキスト分類モデルやディープラーニングモデルを含んでいる
    • MITライセンスで産業応⽤を⽬的としている
    • UD_Japanese-GSDは⽇本語の固有表現抽出のためのタグづけされた学習
    データセットで、これによりspaCyに⽇本語モデルが追加された
    https://github.com/megagonlabs/UD_Japanese-GSD

    View Slide

  32. PEGASUS
    • 要約タスクに特化したモ
    デル
    • 事前学習タスクとしてト
    ークン単位でマスキング
    を行うMLMに代わり、セ
    ンテンス単位のGSG(Gap
    Sentence Generation)を使
    用している
    PEGASUS: Pre-training with Extracted Gap-sentences for
    Abstractive Summarization
    Jingqing Zhang * 1 Yao Zhao * 2 Mohammad Saleh 2 Peter J. Liu 2
    Abstract
    work pre-training Transformers with
    rvised objectives on large text corpora
    wn great success when fine-tuned on
    am NLP tasks including text summa-
    However, pre-training objectives tai-
    abstractive text summarization have
    explored. Furthermore there is a
    ystematic evaluation across diverse do-
    n this work, we propose pre-training
    nsformer-based encoder-decoder mod-
    massive text corpora with a new self-
    d objective. In PEGASUS, important
    are removed/masked from an input doc-
    d are generated together as one output
    from the remaining sentences, similar
    Figure 1: The base architecture of PEGASUS is a standard
    Transformer encoder-decoder. Both GSG and MLM are
    applied simultaneously to this example as pre-training ob-
    jectives. Originally there are three sentences. One sentence
    is masked with [MASK1] and used as target generation text
    (GSG). The other two sentences remain in the input, but
    some tokens are randomly masked by [MASK2] (MLM).
    https://arxiv.org/abs/1912.08777

    View Slide

  33. 電笑戦
    • オンラインイベントと
    なったAWS Summit Tokyo
    2020で募集されたイベン

    • お笑い共有サービス「ボ
    ケて」 https://bokete.jpの
    投稿データを元に、「⼈
    間を超える笑い」をAIが
    作り出せるかを競う
    https://aws.amazon.com/jp/builders-flash/202006/bokete/

    View Slide