Embeddingには、⼀般的なテキ ストを使い、Skip-Gramなどの⼿法で 学習を⾏うものと、⽬的となるモデ ルの中で、その領域のデータセット から学習されるものがある 学習されたベクトルは、右図のよう に意味を持つことが期待される A Neural Probabilistic Language Model (2002) https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
⼀⽅、⽇本語や中国語などのように単語がスペースで区切られていない⾔語では、単語の切り出 し⾃体が問題となる。そこで、⽂からの単語の切り出しと単語のWordPieceへの分解を⼀度に⾏う ⼿法が開発された。これをSentencePieceと呼ぶ。(開発者はMeCabの開発者でGoogleに所属する⼯ 藤⽒) Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation (2016) https://arxiv.org/abs/1609.08144v2 SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018) https://arxiv.org/abs/1808.06226v1
アーキテクチャ 積和演算のみを使⽤するため、CNNよりも計算量が少 なく、RNNのように時系列のステップ計算が不要 テキスト中の離れた単語同⼠の関連も学習可能 Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections Attention Is All You Need (2017) https://arxiv.org/abs/1706.03762v5
Masked Language ModelとNext Sentence Predictionを使って教 師なし学習を⾏った学習済み モデルで、出⼒層を付け替え るだけで多様な⾃然⾔語の課 題に対応可能 複数のベンチマークで発表時 のSoTAを達成 ディープラーニングによる⾃ 然⾔語処理に⼤きなインパク トを与えた %(57 %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 4XHVWLRQ 3DUDJUDSK 6WDUW(QG6SDQ %(57 ( >&/6@ ( ( >6(3@ ( 1 ( ¶ ( 0 ¶ & 7 7 >6(3@ 7 1 7 ¶ 7 0 ¶ >&/6@ 7RN >6(3@ 7RN1 7RN 7RN0 0DVNHG6HQWHQFH$ 0DVNHG6HQWHQFH% 3UHWUDLQLQJ )LQH7XQLQJ 163 0DVN/0 0DVN/0 8QODEHOHG6HQWHQFH$DQG%3DLU 64X$' 4XHVWLRQ$QVZHU3DLU 1(5 01/, Figure 1: Overall pre-training and fine-tuning procedures for BERT. Apart from output layers, the same architec- tures are used in both pre-training and fine-tuning. The same pre-trained model parameters are used to initialize models for different down-stream tasks. During fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. separating ques- tions/answers). ing and auto-encoder objectives have been used for pre-training such models (Howard and Ruder, 2018; Radford et al., 2018; Dai and Le, 2015). 2.3 Transfer Learning from Supervised Data There has also been work showing effective trans- fer from supervised tasks with large datasets, such as natural language inference (Conneau et al., 2017) and machine translation (McCann et al., 2017). Computer vision research has also demon- strated the importance of transfer learning from mal difference between the pre-trained architec- ture and the final downstream architecture. Model Architecture BERT’s model architec- ture is a multi-layer bidirectional Transformer en- coder based on the original implementation de- scribed in Vaswani et al. (2017) and released in the tensor2tensor library.1 Because the use of Transformers has become common and our im- plementation is almost identical to the original, we will omit an exhaustive background descrip- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) https://arxiv.org/abs/1810.04805v2
GPT-3は、ファインチューニングなしで数個のQ&Aを例⽰するだけで使えるようになる プログラミング⾔語の⾃動⽣成の応⽤例も Microsoftが独占ライセンスを取得 Improving Language Understanding by Generative Pre-Training (2018) https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf Language Models are Unsupervised Multitask Learners (2019) https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf Language Models are Few-Shot Learners (2020) https://arxiv.org/abs/2005.14165v4
located? googleplex Mountain View What is the most populous country in the world? 中国 China Who are the 4 members of The Beatles? ビートルズ John, Paul, George, Ringo How many teeth do humans have? twenty four 26 l ⽇本語でクイズ形式のQ&Aでのファインチューニングを⾏ったが、他の⾔語(英語)でも、Q&Aに 応答できることがわかる。 l GooglePlexはGoogleの本社の名称であり、期待される正解ではないがハズレではない。 l 英語での質問に⽇本語で答えていることでもわかるように、多⾔語を理解している。 l ファインチューニングに使ったデータの正解は1語だが、最後の例のように英語では複数語で答え ようとしている。
学習させたモデル https://openai.com/blog/dall-e/ 概念を画像化する能⼒がある ⾔語→概念→画像 https://edition.cnn.com/2021/01/08/tech/artificial-intelligence- openai-images-from-text/index.html "an illustration of a baby daikon radish in a tutu walking a dog"
Test Loss Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. Performance depends strongly on scale, weakly on model shape: Model performance depends most strongly on scale, which consists of three factors: the number of model parameters N (excluding embed- dings), the size of the dataset D, and the amount of compute C used for training. Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width. (Section 3) Smooth power laws: Performance has a power-law relationship with each of the three scale factors OpenAIによる⽐較研究 Transformerを使った⾔語モデルの性能は、計算時間、デー セットサイズ、モデルのパラメータ数のべき乗則に従う モデルのアーキテクチャはあまり関係ない Scaling Laws for Neural Language Models (2020) https://arxiv.org/abs/2001.08361