post-training

Post-training LLMs: From Algorithms to Training Infrastructures

$whoami • 名前: ⾼橋淳⼀郎 • 東⼤⼯学部電⼦情報⼯学科四年 • 本業はHPCの研究をしています • NVIDIA
Project DIGITSを買いましょう • バイトで東⼤病院のRAをしています • NeurIPS2024 workshopにいました • X: @takanas0517 • 実装歴 • NanoGPTというリポジトリを参考にPretrain • QA Dataを使ってSFTの経験 • TRLを少し動かしたことある家の近くの野良猫研究で使うGPUの GH200(カッコイイ‼)

今⽇の話・NeurIPS2024でByteDanceが⾏っていたPost-trainingに関する workshopをまとめて付加情報増やして再放送・本日はDeveloperも多いと思われるので実装, 応用についての情報も多く付与しました・公開情報なので本題の内容はいつでもByteDanceのGitHubで⾒ることができる→ ・元の公演が確か120分ぐらいだったので少し駆け⾜で話します。申し訳ないです。
・本⽇はPost trainingに関して話します・作ってて⾃分の強化学習の実⼒不⾜を痛感したので適宜修正をお願いします。

Post-trainingとはなんぞや • 次単語予測のPretrain後にやる訓練全般のこと • 指⽰に従う能⼒の追加, 害悪なOutputの削減, 好みの反映などを⾏う https://ai.gopubby.com/the-training-pipeline- of-large-language-models-afd5fa57df46

Post-training Algorithms Figure from [Ouyang et al. 2022]

SFT ・Promptとそれに対応するResponseを正解ラベルとして訓練するTuning⽅法例: Prompt: ぬるぽ Response: ガッ！ GeminiのPost trainチームは優秀なので
古のオタクミームにも対応している↓

訓練の実装⽅法・promptとresponseをtokenize してくっつける・response部分のみを損失計算対象にする←使うフレームワークにもよるが, prompt部分に ignore_indexを指定する

SFTに対して… ・⾊々と説があるんですが, Pretrain段階でPromptとそれに対応するResponseを⽤意してしまえばSFTはいらないんじゃないのという話も騒がれていたりしなくもないです→ Instruction Pre- Training: Language Models
are Supervised Multitask Learners ・NeurIPS2024でもPromptとResponseの⻑さの⽐率によっては lossの計算⽅法をpretrainのようにするだけでSFTは良いと⾔った話もポスターにありました→ Instruction Tuning With Loss Over Instructions

RL • LLMの訓練に強化学習を使うという⽅法 • 詳しく理論的なことを話そうとすると⻑くなるので今回は雰囲気だけ伝えます(学ぶ術は後述します) • 強化学習をするなら報酬関数の設計が必要 • ⼈間の趣味嗜好をランキング化して報酬にする
• コード: Unit testsの結果を報酬化 • Math: Ground truthの結果と⾒⽐べてあってるか否かで報酬化

RL for LLM ・基本的な戦略: 報酬を最⼤化する⽅策を訓練→⼈間の好み, Unit testsが通るように, 数学の問題が解けるように(報酬)モデル(⽅策)を訓練

どうやる🤔 ・オフラインでのProximal Policy Optimization(PPO)の場合・報酬の設定: ⼈間の好みを反映したデータで分類モデルを訓練・⽬的関数(maximize), xは⼊⼒, yは出⼒・
𝑟 𝜃 𝑥, 𝑦 は報酬・⼆項⽬はSFTモデルとの確率分布が離れすぎないように・三項⽬はPretrain時からの確率分布が離れすぎないように

三項⽬について・Pretrain時からのデータとか⼿に⼊れられなくないか? ・はい, なのでOpenRLHFなどではNoneの時には計算しないようです

Zheng et al. 2023

Other algorithms ・DPO: PPOはComputation resourceが多すぎるのでPPOの⽬的関数の数式をclosed formに解いて訓練してあげる⽅式・数式の第三項目を無視して解いている・報酬関数の訓練がいらない・DQO:
DQO以前のRLではMulti step reasoningを含むタスクの訓練がうまくいかなかったため, Markov Decision Processを利⽤して訓練・Open reviewが苦しそうだけども …https://openreview.net/forum?id=k2q0rUX2lx

余談: 強化学習⼊⾨したいけど算数とコードの対応が難しくてできないからどうすればいいかわからない・ゼロつく4巻が⼀番おすすめだと思います・実装と理論のバランスがちょうどよくて初⼼者でも読める

余談: 最先端に⾏きたい・僕はHugging Faceのコースを⾒ます。RLHFまで⼊ってて最⾼

Reinforcement Learning for Code Generation ・Code GenerationにおけるRLの報酬指標・Unit testsがエラー無く通るか・速度,
コーディング規則など・ただ典型的な全体のコードを⽣成させてUnit testsのpass/failを判断する RL workflowだとコードをIterativeに改善していくことが困難なものになってしまう。

By providing intermediate rewards that assess the correctness of partial
code sequences, our approach guides the model more eﬀectively toward generating correct programs • Traditional method: Reinforcement Learning from Unit Test Feedback (RLTF) • Our method: Process Supervision-Guided RLTF Process Supervision-Guided RL code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips https://arxiv.org/html/2410.17621v1

What is a PRM? A model that provides feedback for
each line of code during generation, enabling step-by-step corrections and guidance. • A PRM evaluates the correctness of each line of code by predicting a score between -1 (incorrect) and +1 (correct), providing feedback for the generated code snippet Process Reward Model (PRM) code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

In the RL training phase, the PRM is integrated to
supply dense rewards (line-by-line guidance) and value initialization (starting points based on PRM predictions). Process Supervision-Guided Policy Optimization code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

For each generated code sequence, a binary search identifies the
first line where errors occur with a BoN completer PRM Data Collection Process Step 1: Start with the whole code sequence Step 2: Check if the prefix (up to a midpoint line) can be completed into a correct program that passes tests with the BoN completer Step 3: Narrow down by adjusting the midpoint until finding the line where errors start code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

To develop a PRM capable of providing meaningful process supervision
throughout the RL training process, we designed the following training pipeline to create a robust PRM and integrate it into RL training: 1) RL Baseline Training: Fine-tune the SFT policy using PPO 2) PRM Data Collection: • Sample multiple policy checkpoints in RL baseline training to cover the state space • Collect training data for PRM using binary search to label actions for each checkpoint 3) PRM Training: Train the PRM using regression loss on collected data 4) Integrating PRM into RL: • Start from the scratch, fine-tune the SFT policy using RL with PRM • Use PRM as dense, step-wise rewards in PPO (DenseReward) • Use PRM as the initialization of the critic in PPO (ValueInit) Overall Training Pipeline code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

補⾜: ラベル付けに関して

補⾜: ByteDanceのスライドでは説明がなかったが実際のPaperでは記述されていた⼯夫 • 報酬を⽣成した⽂章の⻑さで正規化 • コメントlineの報酬を0に • PPO時, PRMの報酬以外にもunit
testsを通ったかどうかも報酬に加える, PRMに対する過剰な適合を抑えるために追加している。

PRM Enhances Exploration in RL Training We compare the Best-of-K
performance of the policy learned across all four settings on the training set. Both DenseReward and ValueInit independently enhance performance. Furthermore, when both are enabled, the model achieves the greatest improvement. code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips

PRM Improves Long-horizon Code Generation We measure the improvement of
the policy trained with PRM compared to the baseline policy and analyze its effect based on the length of the generated responses. Our findings show that PRM provides greater benefits for generating longer responses. code repo and talk materials

実際にPRMを試したい・⼀応, Hugging Faceにexperimentalではあるもののあるみたい →https://huggingface.co/docs/trl/prm_trainer ・ただHFのTRLが正しく動作するかとかは怖いのでソースコードを読んでみてからコメントします。・OpenRLHFとかを使ってるコードもよく⾒ます: https://github.com/OpenRLHF/OpenRLHF
OpenRLHFにもPRMはあります: https://github.com/OpenRLHF/OpenRLHF/blob/main/openrlhf/trainer/prm_trainer.py

余談: Jeff Dean NeurIPS講演会・Computer systemsにAIを使うと⾯⽩そうな領域を話していた・e.g Compiler optimization,
Chip design support, Model inference cost reductionなど h"ps://youtu.be/2A31Amaq_c?si=dPYhGY5ROA3x6uiV

余談: このスライド作ってる途中で DeepSeekが話題になった。・もう次からML系のイベントは企画を⽴ち上げた⼀週間後とか3 ⽇後にやりませんか? ・今日説明したPost-trainingと全然違うtrain-flowで訓練しているみたいです。・それはそうとしてあの論⽂HPC的にも⾯⽩い要素があるので懇親会で語りましょう・DeepSeek-R1をreplicateしたrepoがなんかあったので共有
https://github.com/hkust-nlp/simpleRL-reason/tree/main ・最近HFでも出たみたい https://github.com/huggingface/open-r1

広告: GB10 Project DIGITS

post-training

post-training

SuperHotDog

More Decks by SuperHotDog

Featured

Transcript

Post-training LLMs: From Algorithms to Training Infrastructures

$whoami • 名前: ⾼橋淳⼀郎 • 東⼤⼯学部電⼦情報⼯学科四年 • 本業はHPCの研究をしています • NVIDIA

Post-trainingとはなんぞや • 次単語予測のPretrain後にやる訓練全般のこと • 指⽰に従う能⼒の追加, 害悪なOutputの削減, 好みの反映などを⾏う https://ai.gopubby.com/the-training-pipeline- of-large-language-models-afd5fa57df46

Post-training Algorithms Figure from [Ouyang et al. 2022]

SFT ・Promptとそれに対応するResponseを正解ラベルとして訓練するTuning⽅法例: Prompt: ぬるぽ Response: ガッ！ GeminiのPost trainチームは優秀なので

訓練の実装⽅法・promptとresponseをtokenize してくっつける・response部分のみを損失計算対象にする←使うフレームワークにもよるが, prompt部分に ignore_indexを指定する

SFTに対して… ・⾊々と説があるんですが, Pretrain段階でPromptとそれに対応するResponseを⽤意してしまえばSFTはいらないんじゃないのという話も騒がれていたりしなくもないです→ Instruction Pre- Training: Language Models

RL for LLM ・基本的な戦略: 報酬を最⼤化する⽅策を訓練→⼈間の好み, Unit testsが通るように, 数学の問題が解けるように(報酬)モデル(⽅策)を訓練

どうやる🤔 ・オフラインでのProximal Policy Optimization(PPO)の場合・報酬の設定: ⼈間の好みを反映したデータで分類モデルを訓練・⽬的関数(maximize), xは⼊⼒, yは出⼒・

三項⽬について・Pretrain時からのデータとか⼿に⼊れられなくないか? ・はい, なのでOpenRLHFなどではNoneの時には計算しないようです

Zheng et al. 2023

Other algorithms ・DPO: PPOはComputation resourceが多すぎるのでPPOの⽬的関数の数式をclosed formに解いて訓練してあげる⽅式・数式の第三項目を無視して解いている・報酬関数の訓練がいらない・DQO:

余談: 強化学習⼊⾨したいけど算数とコードの対応が難しくてできないからどうすればいいかわからない・ゼロつく4巻が⼀番おすすめだと思います・実装と理論のバランスがちょうどよくて初⼼者でも読める

余談: 最先端に⾏きたい・僕はHugging Faceのコースを⾒ます。RLHFまで⼊ってて最⾼

Reinforcement Learning for Code Generation ・Code GenerationにおけるRLの報酬指標・Unit testsがエラー無く通るか・速度,

By providing intermediate rewards that assess the correctness of partial

What is a PRM? A model that provides feedback for

In the RL training phase, the PRM is integrated to

For each generated code sequence, a binary search identiﬁes the

To develop a PRM capable of providing meaningful process supervision

補⾜: ラベル付けに関して

補⾜: ByteDanceのスライドでは説明がなかったが実際のPaperでは記述されていた⼯夫 • 報酬を⽣成した⽂章の⻑さで正規化 • コメントlineの報酬を0に • PPO時, PRMの報酬以外にもunit

PRM Enhances Exploration in RL Training We compare the Best-of-K

PRM Improves Long-horizon Code Generation We measure the improvement of

余談: Jeff Dean NeurIPS講演会・Computer systemsにAIを使うと⾯⽩そうな領域を話していた・e.g Compiler optimization,

広告: GB10 Project DIGITS

ऴ