Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPT-3

tomohideshibata
October 10, 2020

 GPT-3

Language Models are Few-Shot Learners
https://arxiv.org/abs/2005.14165

tomohideshibata

October 10, 2020
Tweet

More Decks by tomohideshibata

Other Decks in Science

Transcript

  1. Language Models are
    Few-Shot Learners (GPT-3)
    OpenAI
    柴⽥ 知秀
    ※ 図表の多くはGPT-3の論⽂から引⽤

    View Slide

  2. 概要
    2
    ⼤規模テキストでの
    Pre-training
    各タスクでの
    Fine-tuning
    (数千から数万個の正解データ)
    動機
    最近の⾔語処理:
    各タスクの正解データをなし、もしくは、少量にしたい
    正解データ
    モデル
    パラメータ
    タスク精度 その他
    GPT-2 zero-shot 1.5B まだまだ フェイクニュース
    ⽣成
    GPT-3 few-shot 175B それなり 想像以上に
    いろいろできる

    View Slide

  3. GPT-3を様々な⽤途に利⽤
    • 検索エンジン
    • 記事⽣成
    • HTMLレイアウト⽣成
    • Latex数式⽣成
    • ..
    • デモは以下のページなどから
    • http://deeplearning.hatenablog.com/entry/gpt3
    • https://github.com/elyase/awesome-gpt3
    • https://machinelearningtokyo.com/2020/07/26/10-cool-gpt-3-
    demos/
    3

    View Slide

  4. ⾔語モデル (Language Model)
    • ⽂の⽣成確率を定義するもの
    • 利⽤⽅法
    • システムが⽣成した⽂がどれくらい正しそうか
    • 古典的⾳声認識 → ⾳響モデル X ⾔語モデル (⾳響との対⽐で「⾔語」)
    • 古典的機械翻訳 → 飜訳モデル X ⾔語モデル
    • ⽂⽣成: 確率にしたがって⽂を⽣成
    • 確率の推定⽅法
    • 最近はニューラルネットワークで
    4
    P(w1
    , w2
    , w3
    , ..., wN
    ) = P(w1
    )P(w2
    |w1
    )P(w3
    |w1
    , w2
    )...
    ここから数枚は準備
    P(ࢲ, ͸, ֶੜ, ..., ) = P(ࢲ)P(͸ | ࢲ)P(ֶੜ | ࢲ, ͸)...
    AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=
    AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=
    AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=
    AAADQXicjVLLSuRAFD2JjjrtjLa6EWZhY6NMgzQ3IiiCILqZpY9pFXpEklitwXQSk3Rj2/oD/oALVzMgMvgV4sYfcOEnyGxEBRe6mJsyw6jxVUUlt07dc+rcqjI82wpCojNFbWj80NTc8jHV+ulzW3u6o3MucCu+KQqma7v+gqEHwrYcUQit0BYLni/0smGLeWNtMlqfrwo/sFzne1jzxGJZX3GskmXqIUNu2sEUvuIHyjDgYgP12hG2MYDMQ4zcJFYsVXpiNC97FOV4jPF4TjOXxKXu1nsy492Suc85zf3ztJTOUp5kyyQDLQ6yiNuUmz5gsWWWMlFhUQEHIcc2dATci9BA8BhbRJ0xnyNLrgveNsXcCmcJztAZXePvCs+KMerwPNIMJNvkXWwePjMz6KNT+k1XdEKHdE53L2rVpUbkpSaLllzhLbXvdM/evMkq8z/E6n/Wq55DlDAivVrs3ZNIVIV5z69u7l7Njs701fvpF/1h/z/pjI65Aqd6be5Pi5k9pPgCtKfHnQzmBvMa5bXpoez4SHwVLfiCXn4CGoYxjm/8HAowlQllVVlXfPVYPVcv1Mv7VFWJOV141NTbv0ObuI4=
    例:
    「⾔語をモデル化した」という意味で
    最近のモデル(BERTなど)を⾔語モデルと呼ぶこ
    とも多い
    (が、GPT-3は従来の意味での⾔語モデル)
    (次ページ)

    View Slide

  5. RNN⾔語モデル [Mikolov+ 10] → Transformerへ
    5
    W
    V
    U
    x1
    h1
    y1
    V
    U
    y2
    h2
    x2
    V
    U
    y3
    h3
    x3
    W W
    私 は 学⽣
    です
    は 学⽣
    ・・
    0.2
    0.1・・
    P(ֶੜ | ࢲ, ͸)
    AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=
    AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=
    AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=
    AAACo3ichVLLSsNQED3Gd31V3QguLJaKgpSJCIqrghvBTVutCrWUJN5qME1CkhZr9Qf8AReuFEXErxA34lZd9BPEpYIbF07TgC/Uudxk7rlzzsxkotqG7npEtSapuaW1rb2jM9TV3dPbF+4fWHGtkqOJjGYZlrOmKq4wdFNkPN0zxJrtCKWoGmJV3Z6v36+WhePqlrnsVWyRKyqbpl7QNcVjyAqPIIlxrKMIFRZ2UM0WSiPYx95nrHLFyCQinzGyGJvIh6MUJ98iPx05cKIILGmFz1lkgyU0lFhMwITHvgEFLq8sZBBsxnKoMuawp/v3gtOFmFviKMERCqPb/NzkUzZATT7XNV2frXEWg7fDzAhi9EAX9Ew3dEmP9ParVtXXqNdS8Zv1ucLO9x0MLb3+yyry28PWB+vPmj0UMOvXqnPtto/Uu9Aa/PLu4fPSXDpWHaMTeuL6j6lG19yBWX7RzlIifYQQD0D+/rl/OitTcZnicmo6mpgNRtGBYYzy6GXMIIEF/g0ynPcUt7jDvRSTFqW0tNwIlZoCziC+mJR7B0KfmLA=
    ⼤規模なラベルなし
    テキストから学習

    View Slide

  6. GPTの歴史
    • 2018.8: GPT-1
    • “Improving Language Understanding by Generative Pre-Training”
    • 2018.10: BERT
    • 2019.2: GPT-2
    • “Language Models are Unsupervised Multitask Learners”
    • 2020.5: GPT-3
    • “Language Models are Few-Shot Learners”
    6
    ここで⼤きく分かれる
    (=Zero-Shot)
    GPT: Generative Pre-trained Transformer とも

    View Slide

  7. 7
    2. Fine-tuning
    1. Pre-training タスク例:
    評判分析
    の 消耗 が 激しい


    GPT-1
    の 消耗 が
    電池 …
    前しか参照できない
    電池 の [MASK]
    [CLS]
    消耗
    が …
    BERT
    前も後ろも参照できる!
    GPT-1
    タスク:
    ⾔語モデル
    タスク:
    ⽳埋め問題
    BERT


    GPT-1
    電池 は すぐ
    この
    negative

    この 電池 は
    [CLS] すぐ …
    BERT
    negative




    Pre-trainingで学習したパラメータ
    を初期値にし、Fine-tuningで学習

    View Slide

  8. 8
    2. Fine-tuning
    1. Pre-training
    の 消耗 が 激しい


    GPT
    の 消耗 が
    電池 …
    GPT-1
    タスク:
    ⾔語モデル


    GPT-1
    電池 は すぐ
    この
    negative

    パラメータを更新しない


    電池 … 英語
    この …
    This

    This battery
    GPT-2,3
    ⾔語モデルを適⽤するのみ
    GPT-2,3

    View Slide

  9. GPT-2, 3がなぜ可能なのか?
    9
    Language Models are Unsupervised Multitask Learners
    many different tasks on examples with
    is also able to, in principle, learn the
    l. (2018) without the need for explicit
    h symbols are the outputs to be pre-
    rvised objective is the the same as the
    e but only evaluated on a subset of the
    minimum of the unsupervised objective
    imum of the supervised objective. In
    g, the concerns with density estimation
    ng objective discussed in (Sutskever
    tepped. The problem instead becomes
    to, in practice, optimize the unsuper-
    nvergence. Preliminary experiments
    ently large language models are able to
    ”I’m not the cleverest man in the world, but like they say in
    French: Je ne suis pas un imbecile [I’m not a fool].
    In a now-deleted post from Aug. 16, Soheil Eid, Tory candidate
    in the riding of Joliette, wrote in French: ”Mentez mentez,
    il en restera toujours quelque chose,” which translates as,
    ”Lie lie and something will always remain.”
    “I hate the word ‘perfume,”’ Burr says. ‘It’s somewhat better
    in French: ‘parfum.’
    If listened carefully at 29:55, a conversation can be heard
    between two guys in French: “-Comment on fait pour aller
    de l’autre cot´
    e? -Quel autre cot´
    e?”, which means “- How
    do you get to the other side? - What side?”.
    If this sounds like a bit of a stretch, consider this ques-
    tion in French: As-tu aller au cin´
    ema?, or Did you go to
    the movies?, which literally translates as Have-you to go to
    movies/theater?
    例: 英語 ⇒ フランス語 以下のようなテキストがWebには
    たくさんある
    → ここから
    ⾔語モデルを学習
    ※ GPT-2の論⽂から引⽤

    View Slide

  10. GPT-2, 3
    10
    Figure 2.1: Zero-shot, one-shot and few-shot, contr
    four methods for performing a task with a language mo
    and few-shot, which we study in this work, require th
    time. We typically present the model with a few dozen
    descriptions, examples and prompts can be found in Ap
    GPT-2 GPT-3
    [task description], ([example],) [prompt]を⾔語モデルで読み、
    その後を出⼒していく (パラメータを更新しない)

    View Slide

  11. Example数と精度
    11
    Context ! Please unscramble the letters into a word, and write that word:
    r e!c.i p r o.c a/l =
    Completion ! reciprocal
    Figure G.26: Evaluation example for Symbol Insertion
    Context ! Please unscramble the letters into a word, and write that word:
    taefed =
    Completion ! defeat
    Figure G.27: Evaluation example for Reversed Words
    56
    Symbol Insertion タスク

    View Slide

  12. モデルサイズ・計算コスト
    12
    Model Name nparams nlayers dmodel nheads dhead
    Batch Size Learning Rate
    GPT-3 Small 125M 12 768 12 64 0.5M 6.0 ⇥ 10 4
    GPT-3 Medium 350M 24 1024 16 64 0.5M 3.0 ⇥ 10 4
    GPT-3 Large 760M 24 1536 16 96 0.5M 2.5 ⇥ 10 4
    GPT-3 XL 1.3B 24 2048 24 128 1M 2.0 ⇥ 10 4
    GPT-3 2.7B 2.7B 32 2560 32 80 1M 1.6 ⇥ 10 4
    GPT-3 6.7B 6.7B 32 4096 32 128 2M 1.2 ⇥ 10 4
    GPT-3 13B 13.0B 40 5140 40 128 2M 1.0 ⇥ 10 4
    GPT-3 175B or “GPT-3” 175.0B 96 12288 96 128 3.2M 0.6 ⇥ 10 4
    Table 2.1: Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models
    which we trained. All models were trained for a total of 300 billion tokens.
    2.1 Model and Architectures
    We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization,
    and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse
    attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence
    of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125
    million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20]
    suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a
    1GPUだと数⼗〜数百年
    (そもそもモデルがGPU
    メモリに乗らないので
    不可能)
    GPT-2 ≒

    View Slide

  13. ⾔語モデル学習のためのデータセット
    13
    2: Total compute used during training. Based on the analysis in Scaling Laws For Neural Languag
    20] we train much larger models on many fewer tokens than is typical. As a consequence, although G
    10x larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of
    e-training. Methodology for these calculations can be found in Appendix D.
    Dataset
    Quantity
    (tokens)
    Weight in
    training mix
    Epochs elapsed when
    training for 300B tokens
    Common Crawl (filtered) 410 billion 60% 0.44
    WebText2 19 billion 22% 2.9
    Books1 12 billion 8% 1.9
    Books2 55 billion 8% 0.43
    Wikipedia 3 billion 3% 3.4
    2: Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples durin
    rawn from a given dataset, which we intentionally do not make proportional to the size of the data
    hen we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while othe
    ess than once.
    良質なテキストの重みを⼤きく

    View Slide

  14. 実験
    • Exampleの選ぶ⽅
    • trainingデータからランダムにサンプル
    • ⽂脈⻑ (2,048単語)に⼊るだけサンプル → 10 〜 100個
    • 選択肢問題
    • 各選択肢の尤度を計算し、最も⾼いものを選ぶ
    • ⾃由記述問題
    • ビームサーチで⽣成
    14

    View Slide

  15. タスク
    1. Language Modeling, Cloze, and Completion
    2. Closed Book Question Answering
    3. Translation
    4. Winograd-Style Tasks
    5. Common Sense Reasoning
    6. Reading Comprehension
    7. SuperGLUE
    8. NLI
    9. Synthetic and Qualitative Tasks
    15

    View Slide

  16. 2. Closed Book Question Answering
    16
    Setting NaturalQS WebQS TriviaQA
    RAG (Fine-tuned, Open-Domain) [LPP+20] 44.5 45.5 68.0
    T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5
    T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1
    GPT-3 Zero-Shot 14.6 14.4 64.3
    GPT-3 One-Shot 23.0 25.3 68.0
    GPT-3 Few-Shot 29.9 41.5 71.2
    Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as
    compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the
    wiki split test server.
    One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA
    dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact
    Completion ! henri robert marcel duchamp
    Completion ! Duchampian
    Completion ! Duchamp
    Completion ! duchampian
    Completion ! marcel du champ
    Completion ! Marcel Duchamp
    Completion ! MARCEL DUCHAMP
    Figure G.34: Evaluation example for TriviaQA. TriviaQA allows for multiple valid completio
    Context ! Q: What school did burne hogarth establish?
    A:
    Completion ! School of Visual Arts
    Figure G.35: Evaluation example for WebQA
    Context ! Keinesfalls d¨
    urfen diese f¨
    ur den kommerziellen Gebrauch verwendet werden
    Completion ! In no case may they be used for commercial purposes.
    Figure G.36: Evaluation example for De!En. This is the format for one- and few-shot learning,
    other langauge tasks, the format for zero-shot learning is “Q: What is the {language} translation of {
    {translation}.”
    58
    関連する⽂書を検索

    View Slide

  17. 2. Closed Book Question Answering
    17
    Setting NaturalQS WebQS TriviaQA
    RAG (Fine-tuned, Open-Domain) [LPP+20] 44.5 45.5 68.0
    T5-11B+SSM (Fine-tuned, Closed-Book) [RRS20] 36.6 44.7 60.5
    T5-11B (Fine-tuned, Closed-Book) 34.5 37.4 50.1
    GPT-3 Zero-Shot 14.6 14.4 64.3
    GPT-3 One-Shot 23.0 25.3 68.0
    GPT-3 Few-Shot 29.9 41.5 71.2
    Table 3.3: Results on three Open-Domain QA tasks. GPT-3 is shown in the few-, one-, and zero-shot settings, as
    compared to prior SOTA results for closed book and open domain settings. TriviaQA few-shot result is evaluated on the
    wiki split test server.
    One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA
    dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact
    Completion ! henri robert marcel duchamp
    Completion ! Duchampian
    Completion ! Duchamp
    Completion ! duchampian
    Completion ! marcel du champ
    Completion ! Marcel Duchamp
    Completion ! MARCEL DUCHAMP
    Figure G.34: Evaluation example for TriviaQA. TriviaQA allows for multiple valid completio
    Context ! Q: What school did burne hogarth establish?
    A:
    Completion ! School of Visual Arts
    Figure G.35: Evaluation example for WebQA
    Context ! Keinesfalls d¨
    urfen diese f¨
    ur den kommerziellen Gebrauch verwendet werden
    Completion ! In no case may they be used for commercial purposes.
    Figure G.36: Evaluation example for De!En. This is the format for one- and few-shot learning,
    other langauge tasks, the format for zero-shot learning is “Q: What is the {language} translation of {
    {translation}.”
    58
    Figure 3.3: On TriviaQA GPT3’s performance grows smoothly with model size, suggesting that language models
    関連する⽂書を検索

    View Slide

  18. 7. SuperGLUE (1/2)
    18
    SuperGLUE BoolQ CB CB COPA RTE
    Average Accuracy Accuracy F1 Accuracy Accuracy
    Fine-tuned SOTA 89.0 91.0 96.9 93.9 94.8 92.5
    Fine-tuned BERT-Large 69.0 77.4 83.6 75.7 70.6 71.7
    GPT-3 Few-Shot 71.8 76.4 75.6 52.0 92.0 69.0
    WiC WSC MultiRC MultiRC ReCoRD ReCoRD
    Accuracy Accuracy Accuracy F1a Accuracy F1
    Fine-tuned SOTA 76.1 93.8 62.3 88.2 92.5 93.3
    Fine-tuned BERT-Large 69.6 64.6 24.1 70.0 71.3 72.0
    GPT-3 Few-Shot 49.4 80.1 30.5 75.4 90.2 91.1
    Table 3.8: Performance of GPT-3 on SuperGLUE compared to fine-tuned baselines and SOTA. All results are reported
    on the test set. GPT-3 few-shot is given a total of 32 examples within the context of each task and performs no gradient
    updates.
    3.6 Reading Comprehension
    Context ! An outfitter provided everything needed for the safari.
    Before his first walking holiday, he went to a specialist outfitter to buy some
    boots.
    question: Is the word ‘outfitter’ used in the same way in the two sentences
    above?
    answer:
    Completion ! no
    the sealant.
    Figure G.4: Evaluation example for PIQA
    Context ! My body cast a shadow over the grass because
    Correct Answer ! the sun was rising.
    Incorrect Answer ! the grass was cut.
    Figure G.5: Evaluation example for COPA
    Context ! (CNN) Yuval Rabin, whose father, Yitzhak Rabin, w
    serving as Prime Minister of Israel, criticized D
    to "Second Amendment people" in a speech and warn
    politicians use can incite violence and undermine
    words are an incitement to the type of political
    me personally," Rabin wrote in USAToday. He said
    "Second Amendment people" to stop Hillary Clinton
    criticized as a call for violence against Clinton
    -- "were a new level of ugliness in an ugly campa
    - The son of a former Israeli Prime Minister who
    op ed about the consequence of violent political
    - Warns of "parallels" between Israel of the 1990
    Correct Answer ! - Referencing his father, who was shot and killed
    political tension in Israel in 1995, Rabin condem
    aggressive rhetoric.
    Correct Answer ! - Referencing his father, who was shot and killed
    political tension in Israel in 1995, Rabin condem
    rhetoric.
    Incorrect Answer ! - Referencing his father, who was shot and killed
    political tension in Israel in 1995, Rabin condem
    aggressive rhetoric.
    Incorrect Answer ! - Referencing his father, who was shot and killed
    political tension in Israel in 1995, Rabin condem
    rhetoric.
    Incorrect Answer ! - Referencing his father, who was shot and killed
    political tension in Israel in 1995, Rabin condem
    aggressive rhetoric.
    若⼲無理やり適⽤しているので
    できなくても仕⽅ない

    View Slide

  19. 7. SuperGLUE (2/2)
    19
    Figure 3.8: Performance on SuperGLUE increases with model size and number of examples in context. A value
    of K = 32 means that our model was shown 32 examples per task, for 256 examples total divided across the 8 tasks in

    View Slide

  20. 9. Synthetic and Qualitative Tasks (1/2)
    20
    Context ! Q: What is 98 plus 45?
    A:
    Completion ! 143
    Figure G.44: Evaluation example for Arithmetic 2D+
    Context ! Q: What is 95 times 45?
    A:
    Completion ! 4275
    Figure G.45: Evaluation example for Arithmetic 2Dx
    Context ! Q: What is 509 minus 488?
    A:
    Completion ! 21
    Mean accuracy
    95% Confidence
    Interval (low, hi)
    t compared
    control (p-v
    Control (deliberately bad model) 86% 83%–90% -
    GPT-3 Small 76% 72%–80% 3.9 (2e-4
    GPT-3 Medium 61% 58%–65% 10.3 (7e-2
    GPT-3 Large 68% 64%–72% 7.3 (3e-1
    GPT-3 XL 62% 59%–65% 10.7 (1e-1
    GPT-3 2.7B 62% 58%–65% 10.4 (5e-1
    GPT-3 6.7B 60% 56%–63% 11.2 (3e-2
    GPT-3 13B 55% 52%–58% 15.3 (1e-3
    GPT-3 175B 52% 49%–54% 16.9 (1e-3
    Table 3.11: Human accuracy in identifying whether short (⇠200 word) news articles
    find that human accuracy (measured by the ratio of correct assignments to non-neutral assi
    Title: United Methodists Agree to Historic Split
    Subtitle: Those who oppose gay marriage will form their own denomination
    Article:
    Figure 3.14: The GPT-3 generated news article that humans had the greatest difficulty distinguishing from a human
    written article (accuracy: 12%).
    Title: Star’s Tux Promise Draws Megyn Kelly’s Sarcasm
    Subtitle: Joaquin Phoenix pledged to not change for each awards event
    Article:
    Arithmetic News Article Generation
    ⼈間が書いたかGPT-3が書いたか⾒分けがつかない
    モデルサイズを⼤きくすると
    急に解けるようになる

    View Slide

  21. 9. Synthetic and Qualitative Tasks (2/2)
    21
    Poor English input: I eated the purple berries.
    Good English output: I ate the purple berries.
    Poor English input: Thank you for picking me as your designer. I’d appreciate it.
    Good English output: Thank you for choosing me as your designer. I appreciate it.
    Poor English input: The mentioned changes have done. or I did the alteration that you
    requested. or I changed things you wanted and did the modifications.
    Good English output: The requested changes have been made. or I made the alteration that you
    requested. or I changed things you wanted and made the modifications.
    Poor English input: I’d be more than happy to work with you in another project.
    Poor English input: Please provide me with a short brief of the design you’re looking for and
    that’d be nice if you could share some examples or project you did before.
    Correcting English Grammar

    View Slide

  22. まとめ
    • Few-shot かつ 巨⼤なパラメータ数のモデルの合わせ技で⼤き
    く改善
    • ⽣成系タスクにめっぽう強い
    • たまに、繰り返しや、⼀貫性のないテキストを⽣成してしまうこともある
    • 解析系タスクも⼤幅に改善し、タスクによってはFine-tuningに勝つ
    • 2⽂を⽐較するようなタスク (WiC)は弱い (でも仕⽅ない)
    • GPT-3で全部解決とはならない(はず)だが、様々な可能性を
    秘めている
    • APIを公開 → Microsoftが独占ライセンス契約
    22

    View Slide