post-training - Speaker Deck

Slide 1

Slide 1 text

Post-training LLMs: From Algorithms to Training Infrastructures

Slide 2

Slide 2 text

$whoami • 名前: ⾼橋淳⼀郎 • 東⼤⼯学部電⼦情報⼯学科四年 • 本業はHPCの研究をしています • NVIDIA Project DIGITSを買いましょう • バイトで東⼤病院のRAをしています • NeurIPS2024 workshopにいました • X: @takanas0517 • 実装歴 • NanoGPTというリポジトリを参考にPretrain • QA Dataを使ってSFTの経験 • TRLを少し動かしたことある家の近くの野良猫研究で使うGPUの GH200(カッコイイ‼)

Slide 3

Slide 3 text

今⽇の話・NeurIPS2024でByteDanceが⾏っていたPost-trainingに関する workshopをまとめて付加情報増やして再放送・本日はDeveloperも多いと思われるので実装, 応用についての情報も多く付与しました・公開情報なので本題の内容はいつでもByteDanceのGitHubで⾒ることができる→ ・元の公演が確か120分ぐらいだったので少し駆け⾜で話します。申し訳ないです。・本⽇はPost trainingに関して話します・作ってて⾃分の強化学習の実⼒不⾜を痛感したので適宜修正をお願いします。

Slide 4

Slide 4 text

Post-trainingとはなんぞや • 次単語予測のPretrain後にやる訓練全般のこと • 指⽰に従う能⼒の追加, 害悪なOutputの削減, 好みの反映などを⾏う https://ai.gopubby.com/the-training-pipeline- of-large-language-models-afd5fa57df46

Slide 5

Slide 5 text

Post-training Algorithms Figure from [Ouyang et al. 2022]

Slide 6

Slide 6 text

SFT ・Promptとそれに対応するResponseを正解ラベルとして訓練するTuning⽅法例: Prompt: ぬるぽ Response: ガッ！ GeminiのPost trainチームは優秀なので古のオタクミームにも対応している↓

Slide 7

Slide 7 text

訓練の実装⽅法・promptとresponseをtokenize してくっつける・response部分のみを損失計算対象にする←使うフレームワークにもよるが, prompt部分に ignore_indexを指定する

Slide 8

Slide 8 text

SFTに対して… ・⾊々と説があるんですが, Pretrain段階でPromptとそれに対応するResponseを⽤意してしまえばSFTはいらないんじゃないのという話も騒がれていたりしなくもないです→ Instruction Pre- Training: Language Models are Supervised Multitask Learners ・NeurIPS2024でもPromptとResponseの⻑さの⽐率によっては lossの計算⽅法をpretrainのようにするだけでSFTは良いと⾔った話もポスターにありました→ Instruction Tuning With Loss Over Instructions

Slide 9

Slide 9 text

RL • LLMの訓練に強化学習を使うという⽅法 • 詳しく理論的なことを話そうとすると⻑くなるので今回は雰囲気だけ伝えます(学ぶ術は後述します) • 強化学習をするなら報酬関数の設計が必要 • ⼈間の趣味嗜好をランキング化して報酬にする • コード: Unit testsの結果を報酬化 • Math: Ground truthの結果と⾒⽐べてあってるか否かで報酬化

Slide 10

Slide 10 text

RL for LLM ・基本的な戦略: 報酬を最⼤化する⽅策を訓練→⼈間の好み, Unit testsが通るように, 数学の問題が解けるように(報酬)モデル(⽅策)を訓練

Slide 11

Slide 11 text

どうやる🤔 ・オフラインでのProximal Policy Optimization(PPO)の場合・報酬の設定: ⼈間の好みを反映したデータで分類モデルを訓練・⽬的関数(maximize), xは⼊⼒, yは出⼒・ 𝑟 𝜃 𝑥, 𝑦 は報酬・⼆項⽬はSFTモデルとの確率分布が離れすぎないように・三項⽬はPretrain時からの確率分布が離れすぎないように

Slide 12

Slide 12 text

三項⽬について・Pretrain時からのデータとか⼿に⼊れられなくないか? ・はい, なのでOpenRLHFなどではNoneの時には計算しないようです

Slide 13

Slide 13 text

Zheng et al. 2023

Slide 14

Slide 14 text

Other algorithms ・DPO: PPOはComputation resourceが多すぎるのでPPOの⽬的関数の数式をclosed formに解いて訓練してあげる⽅式・数式の第三項目を無視して解いている・報酬関数の訓練がいらない・DQO: DQO以前のRLではMulti step reasoningを含むタスクの訓練がうまくいかなかったため, Markov Decision Processを利⽤して訓練・Open reviewが苦しそうだけども …https://openreview.net/forum?id=k2q0rUX2lx

Slide 15

Slide 15 text

余談: 強化学習⼊⾨したいけど算数とコードの対応が難しくてできないからどうすればいいかわからない・ゼロつく4巻が⼀番おすすめだと思います・実装と理論のバランスがちょうどよくて初⼼者でも読める

Slide 16

Slide 16 text

余談: 最先端に⾏きたい・僕はHugging Faceのコースを⾒ます。RLHFまで⼊ってて最⾼

Slide 17

Slide 17 text

Reinforcement Learning for Code Generation ・Code GenerationにおけるRLの報酬指標・Unit testsがエラー無く通るか・速度, コーディング規則など・ただ典型的な全体のコードを⽣成させてUnit testsのpass/failを判断する RL workflowだとコードをIterativeに改善していくことが困難なものになってしまう。

Slide 18

Slide 18 text

By providing intermediate rewards that assess the correctness of partial code sequences, our approach guides the model more eﬀectively toward generating correct programs ● Traditional method: Reinforcement Learning from Unit Test Feedback (RLTF) ● Our method: Process Supervision-Guided RLTF Process Supervision-Guided RL code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips https://arxiv.org/html/2410.17621v1

Slide 19

Slide 19 text

What is a PRM? A model that provides feedback for each line of code during generation, enabling step-by-step corrections and guidance. ● A PRM evaluates the correctness of each line of code by predicting a score between -1 (incorrect) and +1 (correct), providing feedback for the generated code snippet Process Reward Model (PRM) code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

Slide 20

Slide 20 text

In the RL training phase, the PRM is integrated to supply dense rewards (line-by-line guidance) and value initialization (starting points based on PRM predictions). Process Supervision-Guided Policy Optimization code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

Slide 21

Slide 21 text

For each generated code sequence, a binary search identifies the first line where errors occur with a BoN completer PRM Data Collection Process Step 1: Start with the whole code sequence Step 2: Check if the prefix (up to a midpoint line) can be completed into a correct program that passes tests with the BoN completer Step 3: Narrow down by adjusting the midpoint until finding the line where errors start code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

Slide 22

Slide 22 text

To develop a PRM capable of providing meaningful process supervision throughout the RL training process, we designed the following training pipeline to create a robust PRM and integrate it into RL training: 1) RL Baseline Training: Fine-tune the SFT policy using PPO 2) PRM Data Collection: ● Sample multiple policy checkpoints in RL baseline training to cover the state space ● Collect training data for PRM using binary search to label actions for each checkpoint 3) PRM Training: Train the PRM using regression loss on collected data 4) Integrating PRM into RL: ● Start from the scratch, fine-tune the SFT policy using RL with PRM ● Use PRM as dense, step-wise rewards in PPO (DenseReward) ● Use PRM as the initialization of the critic in PPO (ValueInit) Overall Training Pipeline code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

Slide 23

Slide 23 text

補⾜: ラベル付けに関して

Slide 24

Slide 24 text

補⾜: ByteDanceのスライドでは説明がなかったが実際のPaperでは記述されていた⼯夫 • 報酬を⽣成した⽂章の⻑さで正規化 • コメントlineの報酬を0に • PPO時, PRMの報酬以外にもunit testsを通ったかどうかも報酬に加える, PRMに対する過剰な適合を抑えるために追加している。

Slide 25

Slide 25 text

PRM Enhances Exploration in RL Training We compare the Best-of-K performance of the policy learned across all four settings on the training set. Both DenseReward and ValueInit independently enhance performance. Furthermore, when both are enabled, the model achieves the greatest improvement. code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

Slide 26

Slide 26 text

PRM Improves Long-horizon Code Generation We measure the improvement of the policy trained with PRM compared to the baseline policy and analyze its effect based on the length of the generated responses. Our findings show that PRM provides greater benefits for generating longer responses. code repo and talk materials

Slide 27

Slide 27 text

実際にPRMを試したい・⼀応, Hugging Faceにexperimentalではあるもののあるみたい →https://huggingface.co/docs/trl/prm_trainer ・ただHFのTRLが正しく動作するかとかは怖いのでソースコードを読んでみてからコメントします。・OpenRLHFとかを使ってるコードもよく⾒ます: https://github.com/OpenRLHF/OpenRLHF OpenRLHFにもPRMはあります: https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/trainer/prm_trainer.py

Slide 28

Slide 28 text

余談: Jeff Dean NeurIPS講演会・Computer systemsにAIを使うと⾯⽩そうな領域を話していた・e.g Compiler optimization, Chip design support, Model inference cost reductionなど h"ps://youtu.be/2A31Amaq_c?si=dPYhGY5ROA3x6uiV

Slide 29

Slide 29 text

余談: このスライド作ってる途中で DeepSeekが話題になった。・もう次からML系のイベントは企画を⽴ち上げた⼀週間後とか3 ⽇後にやりませんか? ・今日説明したPost-trainingと全然違うtrain-flowで訓練しているみたいです。・それはそうとしてあの論⽂HPC的にも⾯⽩い要素があるので懇親会で語りましょう・DeepSeek-R1をreplicateしたrepoがなんかあったので共有 https://github.com/hkust-nlp/simpleRL-reason/tree/main ・最近HFでも出たみたい https://github.com/huggingface/open-r1