Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPT-3

tomohideshibata
October 10, 2020

 GPT-3

Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165

tomohideshibata

October 10, 2020
Tweet

More Decks by tomohideshibata

Other Decks in Science

Transcript

  1. Language Models are Few-Shot Learners (GPT-3) OpenAI 柴⽥ 知秀 ※

    図表の多くはGPT-3の論⽂から引⽤
  2. 概要 2 ⼤規模テキストでの Pre-training 各タスクでの Fine-tuning (数千から数万個の正解データ) 動機 最近の⾔語処理: 各タスクの正解データをなし、もしくは、少量にしたい

    正解データ モデル パラメータ タスク精度 その他 GPT-2 zero-shot 1.5B まだまだ フェイクニュース ⽣成 GPT-3 few-shot 175B それなり 想像以上に いろいろできる
  3. GPT-3を様々な⽤途に利⽤ • 検索エンジン • 記事⽣成 • HTMLレイアウト⽣成 • Latex数式⽣成 •

    .. • デモは以下のページなどから • http://deeplearning.hatenablog.com/entry/gpt3 • https://github.com/elyase/awesome-gpt3 • https://machinelearningtokyo.com/2020/07/26/10-cool-gpt-3- demos/ 3
  4. ⾔語モデル (Language Model) • ⽂の⽣成確率を定義するもの • 利⽤⽅法 • システムが⽣成した⽂がどれくらい正しそうか •

    古典的⾳声認識 → ⾳響モデル X ⾔語モデル (⾳響との対⽐で「⾔語」) • 古典的機械翻訳 → 飜訳モデル X ⾔語モデル • ⽂⽣成: 確率にしたがって⽂を⽣成 • 確率の推定⽅法 • 最近はニューラルネットワークで 4 P(w1 , w2 , w3 , ..., wN ) = P(w1 )P(w2 |w1 )P(w3 |w1 , w2 )... ここから数枚は準備 P(ࢲ, ͸, ֶੜ, ..., ) = P(ࢲ)P(͸ | ࢲ)P(ֶੜ | ࢲ, ͸)... <latexit sha1_base64="xN9SPiXcco5lD7min8ERHCCS09s=">AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=</latexit> <latexit sha1_base64="xN9SPiXcco5lD7min8ERHCCS09s=">AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=</latexit> <latexit sha1_base64="xN9SPiXcco5lD7min8ERHCCS09s=">AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=</latexit> <latexit sha1_base64="xN9SPiXcco5lD7min8ERHCCS09s=">AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=</latexit> 例: 「⾔語をモデル化した」という意味で 最近のモデル(BERTなど)を⾔語モデルと呼ぶこ とも多い (が、GPT-3は従来の意味での⾔語モデル) (次ページ)
  5. RNN⾔語モデル [Mikolov+ 10] → Transformerへ 5 W V U x1

    h1 y1 V U y2 h2 x2 V U y3 h3 x3 W W 私 は 学⽣ です は 学⽣ ・・ 0.2 0.1・・ P(ֶੜ | ࢲ, ͸) <latexit sha1_base64="XQBnmtja+QyPkI5qKeChfyjoHvo=">AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=</latexit> <latexit sha1_base64="XQBnmtja+QyPkI5qKeChfyjoHvo=">AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=</latexit> <latexit sha1_base64="XQBnmtja+QyPkI5qKeChfyjoHvo=">AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=</latexit> <latexit sha1_base64="XQBnmtja+QyPkI5qKeChfyjoHvo=">AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=</latexit> ⼤規模なラベルなし テキストから学習
  6. GPTの歴史 • 2018.8: GPT-1 • “Improving Language Understanding by Generative

    Pre-Training” • 2018.10: BERT • 2019.2: GPT-2 • “Language Models are Unsupervised Multitask Learners” • 2020.5: GPT-3 • “Language Models are Few-Shot Learners” 6 ここで⼤きく分かれる (=Zero-Shot) GPT: Generative Pre-trained Transformer とも
  7. 7 2. Fine-tuning 1. Pre-training タスク例: 評判分析 の 消耗 が

    激しい … … GPT-1 の 消耗 が 電池 … 前しか参照できない 電池 の [MASK] [CLS] 消耗 が … BERT 前も後ろも参照できる! GPT-1 タスク: ⾔語モデル タスク: ⽳埋め問題 BERT … … GPT-1 電池 は すぐ この negative … この 電池 は [CLS] すぐ … BERT negative … … … … Pre-trainingで学習したパラメータ を初期値にし、Fine-tuningで学習
  8. 8 2. Fine-tuning 1. Pre-training の 消耗 が 激しい …

    … GPT の 消耗 が 電池 … GPT-1 タスク: ⾔語モデル … … GPT-1 電池 は すぐ この negative … パラメータを更新しない … … 電池 … 英語 この … This で This battery GPT-2,3 ⾔語モデルを適⽤するのみ GPT-2,3
  9. GPT-2, 3がなぜ可能なのか? 9 Language Models are Unsupervised Multitask Learners many

    different tasks on examples with is also able to, in principle, learn the l. (2018) without the need for explicit h symbols are the outputs to be pre- rvised objective is the the same as the e but only evaluated on a subset of the minimum of the unsupervised objective imum of the supervised objective. In g, the concerns with density estimation ng objective discussed in (Sutskever tepped. The problem instead becomes to, in practice, optimize the unsuper- nvergence. Preliminary experiments ently large language models are able to ”I’m not the cleverest man in the world, but like they say in French: Je ne suis pas un imbecile [I’m not a fool]. In a now-deleted post from Aug. 16, Soheil Eid, Tory candidate in the riding of Joliette, wrote in French: ”Mentez mentez, il en restera toujours quelque chose,” which translates as, ”Lie lie and something will always remain.” “I hate the word ‘perfume,”’ Burr says. ‘It’s somewhat better in French: ‘parfum.’ If listened carefully at 29:55, a conversation can be heard between two guys in French: “-Comment on fait pour aller de l’autre cot´ e? -Quel autre cot´ e?”, which means “- How do you get to the other side? - What side?”. If this sounds like a bit of a stretch, consider this ques- tion in French: As-tu aller au cin´ ema?, or Did you go to the movies?, which literally translates as Have-you to go to movies/theater? 例: 英語 ⇒ フランス語 以下のようなテキストがWebには たくさんある → ここから ⾔語モデルを学習 ※ GPT-2の論⽂から引⽤
  10. GPT-2, 3 10 Figure 2.1: Zero-shot, one-shot and few-shot, contr

    four methods for performing a task with a language mo and few-shot, which we study in this work, require th time. We typically present the model with a few dozen descriptions, examples and prompts can be found in Ap GPT-2 GPT-3 [task description], ([example],) [prompt]を⾔語モデルで読み、 その後を出⼒していく (パラメータを更新しない)
  11. Example数と精度 11 Context ! Please unscramble the letters into a

    word, and write that word: r e!c.i p r o.c a/l = Completion ! reciprocal Figure G.26: Evaluation example for Symbol Insertion Context ! Please unscramble the letters into a word, and write that word: taefed = Completion ! defeat Figure G.27: Evaluation example for Reversed Words 56 Symbol Insertion タスク
  12. モデルサイズ・計算コスト 12 Model Name nparams nlayers dmodel nheads dhead Batch

    Size Learning Rate GPT-3 Small 125M 12 768 12 64 0.5M 6.0 ⇥ 10 4 GPT-3 Medium 350M 24 1024 16 64 0.5M 3.0 ⇥ 10 4 GPT-3 Large 760M 24 1536 16 96 0.5M 2.5 ⇥ 10 4 GPT-3 XL 1.3B 24 2048 24 128 1M 2.0 ⇥ 10 4 GPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6 ⇥ 10 4 GPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2 ⇥ 10 4 GPT-3 13B 13.0B 40 5140 40 128 2M 1.0 ⇥ 10 4 GPT-3 175B or “GPT-3” 175.0B 96 12288 96 128 3.2M 0.6 ⇥ 10 4 Table 2.1: Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models which we trained. All models were trained for a total of 300 billion tokens. 2.1 Model and Architectures We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a 1GPUだと数⼗〜数百年 (そもそもモデルがGPU メモリに乗らないので 不可能) GPT-2 ≒
  13. ⾔語モデル学習のためのデータセット 13 2: Total compute used during training. Based on

    the analysis in Scaling Laws For Neural Languag 20] we train much larger models on many fewer tokens than is typical. As a consequence, although G 10x larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of e-training. Methodology for these calculations can be found in Appendix D. Dataset Quantity (tokens) Weight in training mix Epochs elapsed when training for 300B tokens Common Crawl (filtered) 410 billion 60% 0.44 WebText2 19 billion 22% 2.9 Books1 12 billion 8% 1.9 Books2 55 billion 8% 0.43 Wikipedia 3 billion 3% 3.4 2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples durin rawn from a given dataset, which we intentionally do not make proportional to the size of the data hen we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while othe ess than once. 良質なテキストの重みを⼤きく
  14. 実験 • Exampleの選ぶ⽅ • trainingデータからランダムにサンプル • ⽂脈⻑ (2,048単語)に⼊るだけサンプル → 10

    〜 100個 • 選択肢問題 • 各選択肢の尤度を計算し、最も⾼いものを選ぶ • ⾃由記述問題 • ビームサーチで⽣成 14
  15. タスク 1. Language Modeling, Cloze, and Completion 2. Closed Book

    Question Answering 3. Translation 4. Winograd-Style Tasks 5. Common Sense Reasoning 6. Reading Comprehension 7. SuperGLUE 8. NLI 9. Synthetic and Qualitative Tasks 15
  16. 2. Closed Book Question Answering 16 Setting NaturalQS WebQS TriviaQA

    RAG (Fine-tuned, Open-Domain) [LPP+20] 44.5 45.5 68.0 T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5 T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1 GPT-3 Zero-Shot 14.6 14.4 64.3 GPT-3 One-Shot 23.0 25.3 68.0 GPT-3 Few-Shot 29.9 41.5 71.2 Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server. One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact Completion ! henri robert marcel duchamp Completion ! Duchampian Completion ! Duchamp Completion ! duchampian Completion ! marcel du champ Completion ! Marcel Duchamp Completion ! MARCEL DUCHAMP Figure G.34: Evaluation example for TriviaQA. TriviaQA allows for multiple valid completio Context ! Q: What school did burne hogarth establish? A: Completion ! School of Visual Arts Figure G.35: Evaluation example for WebQA Context ! Keinesfalls d¨ urfen diese f¨ ur den kommerziellen Gebrauch verwendet werden Completion ! In no case may they be used for commercial purposes. Figure G.36: Evaluation example for De!En. This is the format for one- and few-shot learning, other langauge tasks, the format for zero-shot learning is “Q: What is the {language} translation of { {translation}.” 58 関連する⽂書を検索
  17. 2. Closed Book Question Answering 17 Setting NaturalQS WebQS TriviaQA

    RAG (Fine-tuned, Open-Domain) [LPP+20] 44.5 45.5 68.0 T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5 T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1 GPT-3 Zero-Shot 14.6 14.4 64.3 GPT-3 One-Shot 23.0 25.3 68.0 GPT-3 Few-Shot 29.9 41.5 71.2 Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the wiki split test server. One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact Completion ! henri robert marcel duchamp Completion ! Duchampian Completion ! Duchamp Completion ! duchampian Completion ! marcel du champ Completion ! Marcel Duchamp Completion ! MARCEL DUCHAMP Figure G.34: Evaluation example for TriviaQA. TriviaQA allows for multiple valid completio Context ! Q: What school did burne hogarth establish? A: Completion ! School of Visual Arts Figure G.35: Evaluation example for WebQA Context ! Keinesfalls d¨ urfen diese f¨ ur den kommerziellen Gebrauch verwendet werden Completion ! In no case may they be used for commercial purposes. Figure G.36: Evaluation example for De!En. This is the format for one- and few-shot learning, other langauge tasks, the format for zero-shot learning is “Q: What is the {language} translation of { {translation}.” 58 Figure 3.3: On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models 関連する⽂書を検索
  18. 7. SuperGLUE (1/2) 18 SuperGLUE BoolQ CB CB COPA RTE

    Average Accuracy Accuracy F1 Accuracy Accuracy Fine-tuned SOTA 89.0 91.0 96.9 93.9 94.8 92.5 Fine-tuned BERT-Large 69.0 77.4 83.6 75.7 70.6 71.7 GPT-3 Few-Shot 71.8 76.4 75.6 52.0 92.0 69.0 WiC WSC MultiRC MultiRC ReCoRD ReCoRD Accuracy Accuracy Accuracy F1a Accuracy F1 Fine-tuned SOTA 76.1 93.8 62.3 88.2 92.5 93.3 Fine-tuned BERT-Large 69.6 64.6 24.1 70.0 71.3 72.0 GPT-3 Few-Shot 49.4 80.1 30.5 75.4 90.2 91.1 Table 3.8: Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient updates. 3.6 Reading Comprehension Context ! An outfitter provided everything needed for the safari. Before his first walking holiday, he went to a specialist outfitter to buy some boots. question: Is the word ‘outfitter’ used in the same way in the two sentences above? answer: Completion ! no the sealant. Figure G.4: Evaluation example for PIQA Context ! My body cast a shadow over the grass because Correct Answer ! the sun was rising. Incorrect Answer ! the grass was cut. Figure G.5: Evaluation example for COPA Context ! (CNN) Yuval Rabin, whose father, Yitzhak Rabin, w serving as Prime Minister of Israel, criticized D to "Second Amendment people" in a speech and warn politicians use can incite violence and undermine words are an incitement to the type of political me personally," Rabin wrote in USAToday. He said "Second Amendment people" to stop Hillary Clinton criticized as a call for violence against Clinton -- "were a new level of ugliness in an ugly campa - The son of a former Israeli Prime Minister who op ed about the consequence of violent political - Warns of "parallels" between Israel of the 1990 Correct Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem aggressive rhetoric. Correct Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem rhetoric. Incorrect Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem aggressive rhetoric. Incorrect Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem rhetoric. Incorrect Answer ! - Referencing his father, who was shot and killed political tension in Israel in 1995, Rabin condem aggressive rhetoric. 若⼲無理やり適⽤しているので できなくても仕⽅ない
  19. 7. SuperGLUE (2/2) 19 Figure 3.8: Performance on SuperGLUE increases

    with model size and number of examples in context. A value of K = 32 means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in
  20. 9. Synthetic and Qualitative Tasks (1/2) 20 Context ! Q:

    What is 98 plus 45? A: Completion ! 143 Figure G.44: Evaluation example for Arithmetic 2D+ Context ! Q: What is 95 times 45? A: Completion ! 4275 Figure G.45: Evaluation example for Arithmetic 2Dx Context ! Q: What is 509 minus 488? A: Completion ! 21 Mean accuracy 95% Confidence Interval (low, hi) t compared control (p-v Control (deliberately bad model) 86% 83%–90% - GPT-3 Small 76% 72%–80% 3.9 (2e-4 GPT-3 Medium 61% 58%–65% 10.3 (7e-2 GPT-3 Large 68% 64%–72% 7.3 (3e-1 GPT-3 XL 62% 59%–65% 10.7 (1e-1 GPT-3 2.7B 62% 58%–65% 10.4 (5e-1 GPT-3 6.7B 60% 56%–63% 11.2 (3e-2 GPT-3 13B 55% 52%–58% 15.3 (1e-3 GPT-3 175B 52% 49%–54% 16.9 (1e-3 Table 3.11: Human accuracy in identifying whether short (⇠200 word) news articles find that human accuracy (measured by the ratio of correct assignments to non-neutral assi Title: United Methodists Agree to Historic Split Subtitle: Those who oppose gay marriage will form their own denomination Article: Figure 3.14: The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human written article (accuracy: 12%). Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm Subtitle: Joaquin Phoenix pledged to not change for each awards event Article: Arithmetic News Article Generation ⼈間が書いたかGPT-3が書いたか⾒分けがつかない モデルサイズを⼤きくすると 急に解けるようになる
  21. 9. Synthetic and Qualitative Tasks (2/2) 21 Poor English input:

    I eated the purple berries. Good English output: I ate the purple berries. Poor English input: Thank you for picking me as your designer. I’d appreciate it. Good English output: Thank you for choosing me as your designer. I appreciate it. Poor English input: The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications. Good English output: The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications. Poor English input: I’d be more than happy to work with you in another project. Poor English input: Please provide me with a short brief of the design you’re looking for and that’d be nice if you could share some examples or project you did before. Correcting English Grammar
  22. まとめ • Few-shot かつ 巨⼤なパラメータ数のモデルの合わせ技で⼤き く改善 • ⽣成系タスクにめっぽう強い • たまに、繰り返しや、⼀貫性のないテキストを⽣成してしまうこともある

    • 解析系タスクも⼤幅に改善し、タスクによってはFine-tuningに勝つ • 2⽂を⽐較するようなタスク (WiC)は弱い (でも仕⽅ない) • GPT-3で全部解決とはならない(はず)だが、様々な可能性を 秘めている • APIを公開 → Microsoftが独占ライセンス契約 22