論文解説 InstructGPT : Training language models to follow instructions with human feedback

論⽂解説 InstrcutGPT: Training language models to follow instructions with human
feedback Takehiro Matsuda

2 Amazing to ChatGPT more and more … https://twitter.com/i/status/1598994604912939009 おまえらがモメるあのお菓⼦の名前AI様が決めてくださったぞ

3 論⽂情報タイトル：Training language models to follow instructions with human
feedback • 論⽂： https://arxiv.org/abs/2203.02155 • コード：なし • 投稿学会： NeurIPS 2022 • 著者： Long Ouyang JeffWu Xu Jiang Diogo Almeida Carroll L. Wainwright et al. • 所属：OpenAI 選んだ理由： • 最近話題のChatGPTの前⾝となる論⽂ • ⼤規模モデルを⼈のフィードバックで調整するというのはどう⾏うのか知りたい

4 compare ChatGPT and InstructGPT ChatGPT • GPT-3.5(2022年前半に学習完了)をベースにする。 • 会話(Chat)データをメインにする。
• (恐らくInstructGPTと同様の⼿法+αでチューニングしている) InstructGPT • GPT-3(2020年7⽉発表)をベースにする。 • OpenAIのAPIへの⼊⼒プロンプトや雇った⼈間のLabelersによるフィードバックを元にチューニング(alignment)する。 https://openai.com/blog/chatgpt/ We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides— the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses.

5 GPT-3 Overview Embedding Layer Layer Normalization Masked Self-Attention Layer
Normalization MLP Decoder N X NeurIPS2020 https://arxiv.org/abs/2005.14165 • 1750億のパラメータを持つ • 2016~2019年にCommon Crawl(インターネット上から取得)した45TBのデータからフィルタリングした 570GB以上のコーパス(⽂章)が学習に使われている。

6 GPT-3の課題 GPT-3は汎⽤性が⾼く、⼈間の書いたような⽂章を⽣成でき、⼤規模⾔語モデルの威⼒を知らしめた。しかし、まだ課題があった。 • ユーザーが期待するアウトプットがでないことが結構ある。 • 不正確な答えを出⼒することがある。 •
道徳的に良くない答えやバイアスがある答えを出⼒することがある。ユーザー(⼈間)が好むアウトプットをだすようにAlignmentできないか。損失関数でなく、⼈間のフィードバックを元にモデルを学習させる。 RLHF (Reinforcement Learning from Human Feedback)

7 InstructGPT Overview 3 Step 教師ありファインチューニング Reward modelの獲得 RLHF(Reinforcement
Learning from Human Feedback)

8 Step 1 SFT(Supervised Fine-Tuning ): Train a supervised policy
Trained Labeler(⼈)が⼊⼒Promptとそれに対する望ましい出⼒⽂を作成する。それらのデータを学習データとして、GPT-3をfine-tuningする。このモデルをSFT(Supervised Fine-Tuning)と呼ぶ。 1万3千の学習データを作る。 • 16epochs • cosine learning rate decay • residual dropout of 0.2 学習設定 Validation datasetに対するRM score(後述)でSFT modelを選択する。

9 Labeler(human) information UpworkとScale AIを通して、雇⽤した40名。リサーチチームとはミーティング、Chatなどでコミュニケーションをとり、⽬的・⽅向性の共有。 96%はEnglish speaker •
Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring the tasks had sufficient diversity. • Few-shot: We ask the labelers to come up with an instruction, and multiple query/response pairs for that instruction. • User-based: We had a number of use-cases stated in waitlist applications to the OpenAI API. We asked labelers to come up with prompts corresponding to these use cases. Labelerへの指針⼈類の属性分布を網羅しているわけではないですと注釈しています。 (若めの理系[コンピュータ関連]の⼈材が多そうでしょうか。)

10 Step2 RM(Reward Modeling) : train with comparison data ⼈間の代わりにoutput(⽣成される⽂章)の良さを評価するモデル
を作る。 SFT modelの最終層を除去し、スカラー値(スコア)を出す層にする。 (基本的なアーキテクチャはGPT-3[パラメータ数60億]) APIとlabelerの作った3万3千のPromptから出⼒をランク付けして学習データとする。

11 RM data Promptに対して複数の出⼒⽂(K=4~9)を⽤意し、左のようなインターフェースを⽤意し、その出⼒をLabeler(⼈間) がランキング付けしたものが学習データとなる。 Helpful : ユーザーにとって有⽤か Truthfulness：
正しいか Harmless : ⼈を傷つけるような有害な内容でないか観点優先度は学習データはHelpful, 評価データはTruthfulness, Harmlessを⾼くする。

12 RM training 良い出⼒(Rankが⾼い) 劣る出⼒(Rankが低い) パラメータθのReward Model sigmoid K=4~9の出⼒について、ペアの組み合わせで計算
する。

13 Step3 RLHF(Reinforce Learning from Human Feedback) Step2で得たReward Modelを最⼤化するようにSFT Modelを強化学
習の⼿法でfine-tuningする。 PPO(Proximal Policy Optimization)を使う。 API出⼒の3万1千のデータで学習する

14 PPO: Reinforcement Learning 1.パラメータθのReward Model 基本はこれを最⼤化するただし、元のSFTから⼤きく変化しすぎて、出⼒が破綻することは避けたい。
2. KL正則化項の導⼊ PublicなNLP benchmark datasetでの性能劣化 (alignment tax)を避ける。この項を加えたものをPPO-ptx modelと呼ぶことにする。

15 Model update Step2 Reward Modelの作成(更新)とStep3 Reinforcement Learningは繰り返し実⾏され、性能向上が図られる。直接に絶対的なスコア(正解値)が付けづらい、様々な⾔語タスクについて
⼈間にとって望ましい回答を出⼒するモデルにしていきたという問題をうまく解けるようなサイクルにしている。

16 Result Human evaluations of various models on our API
prompt distribution. 175BパラメータのSFT modelを基準に⼈が出⼒を好ましいと判断した割合。 PPO-ptx, PPOでは1.3Bパラメータモデルでも175Bパラメータ SFTを上回っている。 SFT及びRLHFが⼈間の好みの出⼒をすることに有効

17 output distribution

18 Evaluation for truthfulness using the TruthfulQA dataset (Lin et
al., 2021). Gray bars: Truthfulness, Colored bars: Truthfulness and Informativeness Instruction: 確実な回答を⾔えない場合は、”I have no comment”と返す

19 TruthfulQA

20 Evaluation for harmless using RealToxicityPrompts (Gehman et al., 2020).

21 RealToxicityPrompts

22 Winogender

23 Labeling RealToxicityPrompts distribution

24 RealToxityPrompts: Different instruction input

25 Prompt about Non-English, Code 英語でない⾔語やコードの解釈もかなりできている

26 Mistake samples 間違った前提による Instructionがあるとそれに沿って話をすすめる。シンプルに答えずに、過度に直接的な表現を避け、混乱している。その他に複数の制限を加えた場合にも degradeが⾒られた。
e.g. “list 10 movies made in the 1930’s set in France”

論文解説 InstructGPT : Training language models to ...

論文解説 InstructGPT : Training language models to follow instructions with human feedback

koharite

More Decks by koharite

Other Decks in Research

Featured

Transcript

論⽂解説 InstrcutGPT: Training language models to follow instructions with human

2 Amazing to ChatGPT more and more … https://twitter.com/i/status/1598994604912939009 おまえらがモメるあのお菓⼦の名前AI様が決めてくださったぞ

3 論⽂情報タイトル：Training language models to follow instructions with human

4 compare ChatGPT and InstructGPT ChatGPT • GPT-3.5(2022年前半に学習完了)をベースにする。 • 会話(Chat)データをメインにする。

5 GPT-3 Overview Embedding Layer Layer Normalization Masked Self-Attention Layer

7 InstructGPT Overview 3 Step 教師ありファインチューニング Reward modelの獲得 RLHF(Reinforcement

8 Step 1 SFT(Supervised Fine-Tuning ): Train a supervised policy

9 Labeler(human) information UpworkとScale AIを通して、雇⽤した40名。リサーチチームとはミーティング、Chatなどでコミュニケーションをとり、⽬的・⽅向性の共有。 96%はEnglish speaker •

10 Step2 RM(Reward Modeling) : train with comparison data ⼈間の代わりにoutput(⽣成される⽂章)の良さを評価するモデル

11 RM data Promptに対して複数の出⼒⽂(K=4~9)を⽤意し、左のようなインターフェースを⽤意し、その出⼒をLabeler(⼈間) がランキング付けしたものが学習データとなる。 Helpful : ユーザーにとって有⽤か Truthfulness：

12 RM training 良い出⼒(Rankが⾼い) 劣る出⼒(Rankが低い) パラメータθのReward Model sigmoid K=4~9の出⼒について、ペアの組み合わせで計算

13 Step3 RLHF(Reinforce Learning from Human Feedback) Step2で得たReward Modelを最⼤化するようにSFT Modelを強化学

14 PPO: Reinforcement Learning 1.パラメータθのReward Model 基本はこれを最⼤化するただし、元のSFTから⼤きく変化しすぎて、出⼒が破綻することは避けたい。

15 Model update Step2 Reward Modelの作成(更新)とStep3 Reinforcement Learningは繰り返し実⾏され、性能向上が図られる。直接に絶対的なスコア(正解値)が付けづらい、様々な⾔語タスクについて

16 Result Human evaluations of various models on our API