多言語学習済みモデルmT5とは？

多⾔語学習済みモデルｍT5とは︖ ―知識を持つモデルの可能性― 板垣正敏＠Python機械学習勉強会 2021/2/20

深層⾃然⾔語モデルの発展

深層学習による⾃然⾔語処理の発展 1997 … 2002 … 2013 2014 2015 2016 2017
2018 2019 2020 2021 エンコ $ ド One-Hot Hash- Vector TF-IDF モデル構造 Google Facebook OpenAI Others LSTM Word Embedding Word Piece Sentence Piece Transformer BERT GPT GPT-2 GPT-3 T5 MPNet (Microsoft) RoBERTa Transformer-XL XLM XLNet Word2Vec Glove FastText ELMo (AllenAI) ERNIE (Baidu) Attention

RNN - LSTM RNN（Recurrent Neural Network）は、時系列データやテキストのような順序が意味を持つデータを処理するためのアーキテクチャ
RNNの代表であるLSTM（Long Short-Term Memory）は、忘却ゲート、⼊⼒ゲート、出⼒ゲートを持ち、過去からの情報の影響を制御できる LONG SHORT-TERM MEMORY (1997) http://citeseerx.ist.psu.edu/viewdoc /download?doi=10.1.1.676.4320&re p=rep1&type=pdf From Guillaume Chevalier - LARNN: Linear Attention Recurrent Neural Network CC BY-SA 4.0

Embedding―単語埋め込みそれまでの単語のベクトル化の代表であるOne-Hotベクトルが、ボキャブラリ数を次元とする疎なベクトルであるのに対して、より低次元かつ密なベクトル化を⾏うのがWord Embeddingである Word
Embeddingには、⼀般的なテキストを使い、Skip-Gramなどの⼿法で学習を⾏うものと、⽬的となるモデルの中で、その領域のデータセットから学習されるものがある学習されたベクトルは、右図のように意味を持つことが期待される A Neural Probabilistic Language Model (2002) https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

WordPieceとSentencePiece 単語を単位にした処理では、学習された辞書にない単語は不明単語として扱うしかない。機械翻訳などの⽤途では、こうしたレアな単語の扱いが問題となる。 Googleはこの問題に対して、単語をSub-Wordとよばれる部分に分解する⽅法を開発した。こうして学習された単語の部品をWordPieceとよび、この分割を⾏うライブラリの名称にもなっている。
⼀⽅、⽇本語や中国語などのように単語がスペースで区切られていない⾔語では、単語の切り出し⾃体が問題となる。そこで、⽂からの単語の切り出しと単語のWordPieceへの分解を⼀度に⾏う⼿法が開発された。これをSentencePieceと呼ぶ。（開発者はMeCabの開発者でGoogleに所属する⼯藤⽒） Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016) https://arxiv.org/abs/1609.08144v2 SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018) https://arxiv.org/abs/1808.06226v1

Transformer Googleが機械翻訳⽤にとして開発したモデル⾃然⾔語処理において、CNNやRNNを使⽤せず、RNN を使った機械翻訳モデルで使⽤されるようになっていたアテンション（注意機構）を応⽤し、セルフアテンション（⾃⼰注意）を中⼼にした、Encoder - Decoder
アーキテクチャ積和演算のみを使⽤するため、CNNよりも計算量が少なく、RNNのように時系列のステップ計算が不要テキスト中の離れた単語同⼠の関連も学習可能 Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The ﬁrst is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections Attention Is All You Need (2017) https://arxiv.org/abs/1706.03762v5

BERT: Bidirectional Encoder Representations from Transformers TransformerのEncoderのみを双⽅向にしたものを積み重ねた⾃然⾔語モデル
Masked Language ModelとNext Sentence Predictionを使って教師なし学習を⾏った学習済みモデルで、出⼒層を付け替えるだけで多様な⾃然⾔語の課題に対応可能複数のベンチマークで発表時のSoTAを達成ディープラーニングによる⾃然⾔語処理に⼤きなインパクトを与えた %(57 %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 4XHVWLRQ 3DUDJUDSK 6WDUW(QG6SDQ %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH% 3UHWUDLQLQJ )LQH7XQLQJ 163 0DVN/0 0DVN/0 8QODEHOHG6HQWHQFH$DQG%3DLU 64X$' 4XHVWLRQ$QVZHU3DLU 1(5 01/, Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec- tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers). ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015). 2.3 Transfer Learning from Supervised Data There has also been work showing effective transfer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demon- strated the importance of transfer learning from mal difference between the pre-trained architecture and the final downstream architecture. Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation de- scribed in Vaswani et al. (2017) and released in the tensor2tensor library.1 Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background descrip- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) https://arxiv.org/abs/1810.04805v2

GPT (Generative Pre-trained Transformer) GPT-2 GPT-3 開発したOpenAIは、GPT-2が「公開するには危険すぎる」として当初モデルを公開せず話題に GPT-2では15億個だったパラメータ数が、GPT-3では1,750億個に達し、巨⼤⾃然⾔語モデルの時代に
GPT-3は、ファインチューニングなしで数個のQ&Aを例⽰するだけで使えるようになるプログラミング⾔語の⾃動⽣成の応⽤例も Microsoftが独占ライセンスを取得 Improving Language Understanding by Generative Pre-Training (2018) https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf Language Models are Unsupervised Multitask Learners (2019) https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf Language Models are Few-Shot Learners (2020) https://arxiv.org/abs/2005.14165v4

Hugging Face Inc. - Transformers Transformerをベースにした⾃然⾔語モデルを共通のAPIで使えるようにしたライブラリ 2021年2⽉時点で組み込まれているモデルは、ALBERT、BART、BARThez、BERT、BERT for
Sequence Generation、Blenderbot、BORT、CamemBERT、ConvBERT、CTRL、DeBERTa、 DialoGPT、DistilBERT、DPR、ELECTRA、FlauBERT、Funnel Transformer、GPT、GPT-2、 LayoutLM、LED、Longformer、LXMERT、MarianMT、MBart、MBart-50、MPNet、MT5、 Pegasus、ProphetNet、Reformer、RoBERTa、SqueezeBERT、T5、TAPAS、Transformer-XL、 Wav2Vec2、XLM、XLM-ProphetNet、XLM-RoBERTa、XLNet 各モデルはTensorFlowとPyTorchで使え、CPUとGPUに対応している学習済みモデルを共有するModel Hubも提供されている https://github.com/huggingface/transformers

ここから本題

T5 : Text-to-Text Transfer Transformer Google が Text-to-Text の学習済みモデルを様々なタスクに適⽤する転移学
習の⽐較研究論⽂の中で提案公開したモデル基本的な Transformer モデルを改良したもので、Web をクローリングした 20TB のデータから⽣成した、約 750GB の英⽂テキストコーパス C4 （Colossal Clean Crawled Corpus）で学習させたモデルが公開されているパラメータ数は、 Small: 6千万、Base: 2億2千万、Large: 7億7千万、 3B/XL: 約28億、11B/XXL: 約110億巨⼤なモデルを巨⼤なコーパスで訓練するため、モデル並列、データ並列が可能な Mesh TensorFlow を使って構築されている Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (2019) https://arxiv.org/abs/1910.10683v3

ｍT5: Multilingual T5 T5 を 101⾔語の多⾔語データセット mC4 で訓練したモデル
⾔語によってページ数に差があるが、⽇本語では 87百万ページのデータを含む Small, Base, Large, XL, XXLのサイズのモデルが公開されている mT5: A massively multilingual pre-trained text-to-text transformer https://arxiv.org/abs/2010.11934

mT5の適⽤例︓⽇本語要約ライブドアニュースのトピックと本⽂を利⽤した「3⾏要約データセット」（https://github.com/KodairaTomonori/ThreeLineSummaryDataset）を使って mT5 のファインチューニングを⾏ったモデルサイズは Google
Colab で TPU(v2-8) を使って訓練可能な XL（28億パラメータ）バッチサイズ 16、ステップ数 4,000の訓練で、テキスト要約の評価指標 ROUGE で次の性能 ROUGE 1: 0.370 ROUGE 2: 0.135 ROUGE L: 0.296 訓練データ数を10、ステップ数を 1,000という Few-shot でも、そこそこの性能が得られた ROUGE 1: 0.350, ROUGE 2: 0.153, ROUGE L: 0.269

要約の例（XLモデル4000ステップ）【本⽂】ソニー・コンピュータエンタテインメントジャパンアジアとスクウェア・エニックスは1⽉25⽇から31⽇までの1 週間、東京メトロ丸ノ内線新宿駅構内メトロプロムナードにおいて、「新宿ドラゴンクエストジャック」を開催する。これは、1⽉28⽇に発売するps4/ps3/psvita向けソフト「ドラゴンクエストビルダーズアレフガルドを復活せよ」のプロモーションとして⾏われるもの。「ブロックモンスター討伐作戦」と「新宿モンスターロード」の 2つを同時に楽しむことができる企画だ。「ブロックモンスター討伐作戦」では、東京メトロ新宿駅構内の東⼝と⻄⼝とを結ぶ約80メートルにわたる通路の“壁⾯”に、ダイヤブロックを⽤いてドット絵状に描いた、「ドラゴンクエスト」シリーズおなじみのモンスターが⼤量に出現。また、「新宿モンスターロード」では、通路にも歴代モンスターが登場し、通路中に配置された“柱”がモンスターに変⾝、ダンジョンの中でモンスターに遭遇するような、ゲームさながらの空間を演出する。壁を埋め尽くしたブロック玩具の⼤量のモンスターの下には、「ド
ラゴンクエストビルダーズ」の世界を描いたスペシャルポスターが貼られているが、初⽇は約18万個のブロックによる⼤量のモンスターに占領されて⾒えない状態になっている。ブロック下のポスターを⾒るためには、約18 万個のブロックを1つ1つ取り外してモンスターを討伐しなければならないため、多くの参加者の⼒が必要――という仕掛けだ(※取り外しは1⽉28⽇7時00分から可能になる)。「ブロックモンスター討伐作戦」は、誰でも気軽に参加可能。約18万個のブロックのうち300個には限定qrコードが印字されており、スマートフォンなどで読み込むことでオリジナル壁紙画像をダウンロードすることができる。もちろんブロックは持ち帰ってokだ。なお、「ブロックモンスター討伐作戦」の制作期間は4⽇間のべ32時間。総勢100⼈による作業で完成させたという。【要約】新宿駅構内で25⽇から31⽇まで「新宿ドラゴンクエストジャック」を開催する。約18万個のブロックを1つ1つ取り外してモンスターを討伐しなければならない。約18万個のブロックのうち300個には限定qrコードが印字されている。 https://news.livedoor.com/article/detail/11107787/

mT5の適⽤例︓トリビアクイズモデルが学習した「知識」を問うタスク T5のサンプル notebook を応⽤「JAQKET:クイズを題材にした⽇本語QAデータセット」（
https://www.nlp.ecei.tohoku.ac.jp/projects/jaqket/）を使ってファインチューニングを⾏ったモデルサイズは XL、25,000ステップでのバリデーションデータの正解率は 28.1%

学習中のValidationデータでの回答例問題 mT5の回答正解 “not in employment,education or training”の略称である、15歳から34歳の⾮労働⼒⼈⼝のうち就労活動を⾏っていない⼈を何という?
ニートニート和名を「テンジクネズミ」というネズミの⼀種で、古くから実験動物の代表例とされたのは何でしょう? モルモットモルモットオーストリアの物理学者の名前から付けられた、物体の速度と⾳速の⽐のことを何というでしょう? マッハマッハスマート、エアタッチ、ノンダスト、ライトなどの種類がある、トンボ鉛筆から発売されているロングセラーの消しゴムは何でしょう? トンボ mono 別名を「華燭の典」ともいう儀式は何でしょう? 結婚式⾹典トタンは鉄の板に亜鉛をメッキしたものですが、ブリキは鉄の板に何をメッキしたもの? ニッケルスズアルファベットのtを逆さまにしたような形の地図記号が表す場所はどこ? 東京墓地

⽇本語のクイズでファインチューンしたモデルに英語で質問をしてみた質問 mT5の回答正解 Where is the Google headquarters
located? googleplex Mountain View What is the most populous country in the world? 中国 China Who are the 4 members of The Beatles? ビートルズ John, Paul, George, Ringo How many teeth do humans have? twenty four 26 l ⽇本語でクイズ形式のQ&Aでのファインチューニングを⾏ったが、他の⾔語（英語）でも、Q&Aに応答できることがわかる。 l GooglePlexはGoogleの本社の名称であり、期待される正解ではないがハズレではない。 l 英語での質問に⽇本語で答えていることでもわかるように、多⾔語を理解している。 l ファインチューニングに使ったデータの正解は１語だが、最後の例のように英語では複数語で答えようとしている。

ここでデモ

多⾔語学習済みモデルの可能性巨⼤な学習済み⾃然⾔語モデルは、学習したコーパスから得られた「知識」を⾔語モデルの中に記憶している。 mT5のような多⾔語モデルは、複数⾔語の単語（サブワード）情報を1つの空間にマッピングしており、機械翻訳に使えるだけではなく、クイズの例からわかるように他の⾔語で学習した知識を活⽤することができると考えられる。 1つの⾔語で、あるタスクに適応したモデルは、他の⾔語でも同様のタスクに対応する
ことができる。世界中のWebをクロールし続ける巨⼤な⾔語モデルは、あらゆる知識を獲得できるのでは︖

機械による⾔語と画像の獲得︓CLIPとDALL-E CLIPはOpenAIが開発したテキストと画像の関係を学習するモデル https://openai.com/blog/clip/ DALL-Eはパラメータ数120億の GPT-3に画像を⽣成することを
学習させたモデル https://openai.com/blog/dall-e/ 概念を画像化する能⼒がある⾔語→概念→画像 https://edition.cnn.com/2021/01/08/tech/artificial-intelligence- openai-images-from-text/index.html "an illustration of a baby daikon radish in a tutu walking a dog"

視覚と⾔語を獲得したモデルは AGI = 汎⽤⼈⼯知能になるのか︖ ⼈⼯知能の未来 –
ディープラーニングの先にあるもの松尾豊 (2016) - hops://www.soumu.go.jp/main_content/000400435.pdf 今後発展画像画像特徴抽出映像特徴抽出化動自分動測特徴抽出号操作動画作外界外界特徴引出次特徴知獲得大入力抽化、高度状況認識知識獲得解決、推論言語理解、自動翻訳先広世界 16 画像認識精度向上動画認識精度向上、行動予測、異常検知動

最後に夢のない話ですが…

Transformer⾔語モデルの性能はべき乗則に従う Dataset Size tokens Parameters non-embedding Compute PF-days, non-embedding
Test Loss Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embed- dings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3) Smooth power laws: Performance has a power-law relationship with each of the three scale factors OpenAIによる⽐較研究 Transformerを使った⾔語モデルの性能は、計算時間、デーセットサイズ、モデルのパラメータ数のべき乗則に従うモデルのアーキテクチャはあまり関係ない Scaling Laws for Neural Language Models (2020) https://arxiv.org/abs/2001.08361

多言語学習済みモデルmT5とは？

多言語学習済みモデルmT5とは？

masa-ita

More Decks by masa-ita

Other Decks in Technology

Featured

Transcript

多⾔語学習済みモデルｍT5とは︖ ―知識を持つモデルの可能性― 板垣正敏＠Python機械学習勉強会 2021/2/20

深層⾃然⾔語モデルの発展

深層学習による⾃然⾔語処理の発展 1997 … 2002 … 2013 2014 2015 2016 2017

RNN - LSTM RNN（Recurrent Neural Network）は、時系列データやテキストのような順序が意味を持つデータを処理するためのアーキテクチャ

Embedding―単語埋め込みそれまでの単語のベクトル化の代表であるOne-Hotベクトルが、ボキャブラリ数を次元とする疎なベクトルであるのに対して、より低次元かつ密なベクトル化を⾏うのがWord Embeddingである Word

BERT: Bidirectional Encoder Representations from Transformers TransformerのEncoderのみを双⽅向にしたものを積み重ねた⾃然⾔語モデル

GPT (Generative Pre-trained Transformer) GPT-2 GPT-3 開発したOpenAIは、GPT-2が「公開するには危険すぎる」として当初モデルを公開せず話題に GPT-2では15億個だったパラメータ数が、GPT-3では1,750億個に達し、巨⼤⾃然⾔語モデルの時代に

Hugging Face Inc. - Transformers Transformerをベースにした⾃然⾔語モデルを共通のAPIで使えるようにしたライブラリ 2021年2⽉時点で組み込まれているモデルは、ALBERT、BART、BARThez、BERT、BERT for

ここから本題

T5 : Text-to-Text Transfer Transformer Google が Text-to-Text の学習済みモデルを様々なタスクに適⽤する転移学

ｍT5: Multilingual T5 T5 を 101⾔語の多⾔語データセット mC4 で訓練したモデル

mT5の適⽤例︓⽇本語要約ライブドアニュースのトピックと本⽂を利⽤した「3⾏要約データセット」（https://github.com/KodairaTomonori/ThreeLineSummaryDataset）を使って mT5 のファインチューニングを⾏ったモデルサイズは Google

mT5の適⽤例︓トリビアクイズモデルが学習した「知識」を問うタスク T5のサンプル notebook を応⽤「JAQKET:クイズを題材にした⽇本語QAデータセット」（

学習中のValidationデータでの回答例問題 mT5の回答正解 “not in employment,education or training”の略称である、15歳から34歳の⾮労働⼒⼈⼝のうち就労活動を⾏っていない⼈を何という?

⽇本語のクイズでファインチューンしたモデルに英語で質問をしてみた質問 mT5の回答正解 Where is the Google headquarters

ここでデモ

機械による⾔語と画像の獲得︓CLIPとDALL-E CLIPはOpenAIが開発したテキストと画像の関係を学習するモデル https://openai.com/blog/clip/ DALL-Eはパラメータ数120億の GPT-3に画像を⽣成することを

視覚と⾔語を獲得したモデルは AGI = 汎⽤⼈⼯知能になるのか︖ ⼈⼯知能の未来 –

最後に夢のない話ですが…

Transformer⾔語モデルの性能はべき乗則に従う Dataset Size tokens Parameters non-embedding Compute PF-days, non-embedding