Upgrade to Pro — share decks privately, control downloads, hide ads and more …

post-training

SuperHotDog
January 27, 2025
510

 post-training

SuperHotDog

January 27, 2025
Tweet

Transcript

  1. $whoami • 名前: ⾼橋淳⼀郎 • 東⼤⼯学部電⼦情報⼯学科四年 • 本業はHPCの研究をしています • NVIDIA

    Project DIGITSを買いましょう • バイトで東⼤病院のRAをしています • NeurIPS2024 workshopにいました • X: @takanas0517 • 実装歴 • NanoGPTというリポジトリを参考にPretrain • QA Dataを使ってSFTの経験 • TRLを少し動かしたことある 家の近くの野良猫 研究で使うGPUの GH200(カッコイイ‼)
  2. SFTに対して… ・⾊々と説があるんですが, Pretrain段階でPromptとそれに対応 するResponseを⽤意してしまえばSFTはいらないんじゃないのと いう話も騒がれていたりしなくもないです→ Instruction Pre- Training: Language Models

    are Supervised Multitask Learners ・NeurIPS2024でもPromptとResponseの⻑さの⽐率によっては lossの計算⽅法をpretrainのようにするだけでSFTは良いと⾔った 話もポスターにありました→ Instruction Tuning With Loss Over Instructions
  3. どうやる🤔 ・オフラインでのProximal Policy Optimization(PPO)の場合 ・報酬の設定: ⼈間の好みを反映したデータで分類モデルを訓練 ・⽬的関数(maximize), xは⼊⼒, yは出⼒ ・

    𝑟 𝜃 𝑥, 𝑦 は報酬 ・⼆項⽬はSFTモデルとの確率分布が離れすぎないように ・三項⽬はPretrain時からの確率分布が離れすぎないように
  4. Other algorithms ・DPO: PPOはComputation resourceが多すぎるのでPPOの⽬的関数 の数式をclosed formに解いて訓練してあげる⽅式 ・数式の第三項目を無視して解いている ・報酬関数の訓練がいらない ・DQO:

    DQO以前のRLではMulti step reasoningを含むタスクの訓練 がうまくいかなかったため, Markov Decision Processを利⽤して訓練 ・Open reviewが苦しそうだけども …https://openreview.net/forum?id=k2q0rUX2lx
  5. Reinforcement Learning for Code Generation ・Code GenerationにおけるRLの報酬指標 ・Unit testsがエラー無く通るか ・速度,

    コーディング規則など ・ただ典型的な全体のコードを⽣成させてUnit testsのpass/failを判断する RL workflowだとコードをIterativeに改善していくことが困難なものになっ てしまう。
  6. By providing intermediate rewards that assess the correctness of partial

    code sequences, our approach guides the model more effectively toward generating correct programs • Traditional method: Reinforcement Learning from Unit Test Feedback (RLTF) • Our method: Process Supervision-Guided RLTF Process Supervision-Guided RL code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips https://arxiv.org/html/2410.17621v1
  7. What is a PRM? A model that provides feedback for

    each line of code during generation, enabling step-by-step corrections and guidance. • A PRM evaluates the correctness of each line of code by predicting a score between -1 (incorrect) and +1 (correct), providing feedback for the generated code snippet Process Reward Model (PRM) code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips
  8. In the RL training phase, the PRM is integrated to

    supply dense rewards (line-by-line guidance) and value initialization (starting points based on PRM predictions). Process Supervision-Guided Policy Optimization code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips
  9. For each generated code sequence, a binary search identifies the

    first line where errors occur with a BoN completer PRM Data Collection Process Step 1: Start with the whole code sequence Step 2: Check if the prefix (up to a midpoint line) can be completed into a correct program that passes tests with the BoN completer Step 3: Narrow down by adjusting the midpoint until finding the line where errors start code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips
  10. To develop a PRM capable of providing meaningful process supervision

    throughout the RL training process, we designed the following training pipeline to create a robust PRM and integrate it into RL training: 1) RL Baseline Training: Fine-tune the SFT policy using PPO 2) PRM Data Collection: • Sample multiple policy checkpoints in RL baseline training to cover the state space • Collect training data for PRM using binary search to label actions for each checkpoint 3) PRM Training: Train the PRM using regression loss on collected data 4) Integrating PRM into RL: • Start from the scratch, fine-tune the SFT policy using RL with PRM • Use PRM as dense, step-wise rewards in PPO (DenseReward) • Use PRM as the initialization of the critic in PPO (ValueInit) Overall Training Pipeline code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips
  11. PRM Enhances Exploration in RL Training We compare the Best-of-K

    performance of the policy learned across all four settings on the training set. Both DenseReward and ValueInit independently enhance performance. Furthermore, when both are enabled, the model achieves the greatest improvement. code repo and talk materials Slides from https://github.com/eric-haibin-lin/verl-data/tree/neurips
  12. PRM Improves Long-horizon Code Generation We measure the improvement of

    the policy trained with PRM compared to the baseline policy and analyze its effect based on the length of the generated responses. Our findings show that PRM provides greater benefits for generating longer responses. code repo and talk materials
  13. 余談: Jeff Dean NeurIPS講演会 ・Computer systemsにAIを使う と⾯⽩そうな領域を話していた ・e.g Compiler optimization,

    Chip design support, Model inference cost reductionなど h"ps://youtu.be/2A31Amaq_c?si=dPYhGY5ROA3x6uiV