Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[Paper Introduction] Steering Your Diffusion Po...

Avatar for Rei Ando Rei Ando
July 17, 2025
21

[Paper Introduction] Steering Your Diffusion Policy with Latent Space Reinforcement Learning

2025/7/17
Paper Introduction @Tanichu-lab.
https://sites.google.com/view/tanichu-lab-ku/home-jp

Avatar for Rei Ando

Rei Ando

July 17, 2025
Tweet

Transcript

  1. Paper Information Title: Authors: Andrew Wagenmaker+ Pub. date: 2025/6/25 Link:

    https://diffusion-steering.github.io Steering Your Diffusion Policy with Latent Space Reinforcement Learning 2
  2. Background 4 Diffusion Policy is a method to constitute complex

    action distribution ・・・Showing high performance in Behavior Cloning (Imitation Learning) initial noise 𝑤 ~ 𝑁 0, 𝐼 Denoising Process Action Conditions Images, Language instruction, etc Ø Fine-Tuning is a way to adapt the model to new environments ・・・There are various challenges
  3. DSRL - overview Diffusion Steering via Reinforcement Learning (DSRL) Optimize

    Diffusion Policyʼs initial noise with Reinforcement Learning, instead of model parameters with fine-tuning 𝒔: states 𝒂: action 𝒘: initial noise 𝜋!" : denoising process 5
  4. DSRL - detail (1/3) Original Diffusion Policy 𝑎 ~ 𝜋!"

    𝑠, 𝑤 𝑠 : state 𝑠′ : next state 𝑎 : action 𝑤 : initial noise 𝑃 𝑠#|𝑠, 𝑎 = 𝑃 𝑠#|𝑠, 𝜋!" 𝑠, 𝑤 = 𝑃$ 𝑠#|𝑠, 𝑤 𝑃$ 𝑠#|𝑠, 𝑤 ≔ 𝑃 𝑠#|𝑠, 𝜋!" 𝑠, 𝑤 considering MDP, → we can consider 𝑤 as Action 𝑟$ 𝑠, 𝑤 ≔ 𝑟 𝑠, 𝜋!" 𝑠, 𝑤 State transitions: Reward function: latent-action MDP 6
  5. DSRL - detail (2/3) In latent-action MDP, initial noise 𝑤

    is considered as Action → We can consider the policy of 𝑤 → Optimizing this policy may be effective for generating Ø Optimize the policy with DDPG method actor: critic: 𝑤 ~ 𝜋$ 𝑠 𝑄$ 𝑠, 𝑤 But, datasets are consisted with 𝑠, 𝑎, 𝑟, 𝑠# in Offline Training = Data in latent-action MDP is not usable adopt Additional Critic to bridge real-env and laten-action env 𝐴-critic: 𝑄% 𝑠, 𝑎 7
  6. DSRL - detail (3/3) actor: critic: 𝜋$ 𝑠 𝑄$ 𝑠,

    𝑤 𝐴-critic: 𝑄% 𝑠, 𝑎 : TD error : Match two critics : Policy Gradient 8 consisted with shallow NN
  7. Experiment - for Task Specific Policy 9 trials with 5

    seeds DSRL realizes Near-Optimal Behavior and Effectively Modify Task: 4 tasks from Robomimic (Simulation) Base 𝜋!" : Diffusion Policy with pubic check point and prepared by authors
  8. Experiment - for Generalized Policy 10 collect 60-80 online data

    Simulation Env Real Env trials with 3 seeds ü 𝜋& is adopted as Generalized Policy 𝜋! ∶ https://www.pi.website/blog/pi0 Showing improvement with small data
  9. Conclusion 11 A novelty method DSRL is proposed ü Higher

    performance is achieved in various experiments by optimizing initial noise of Diffusion Policy ü Comparing with Fine-Tuning, much smaller dataset is enough to improve Limitations • Exploration capabilities are determined by 𝜋!" • You have to make reward signal in Online Training
  10. Appendix - DQN and DDPG 12 DQN Approximate true action

    value with NN 𝑄 𝑠, 𝑎 ≈ 𝑄) 𝑠, 𝑎 object function 𝔼 #,%,&,#! ~ℬ 𝑟 + 𝛾 max %! 𝑄) 𝑠*, 𝑎* − 𝑄) 𝑠, 𝑎 + DDPG Approximate value in case of the continuous action If 𝑎 is continuous, this is difficult Actor: 𝜇* 𝑠 Critic: 𝑄) 𝑠, 𝑎 object function 𝔼 #,%,&,#! ~ℬ 𝑟 + 𝛾𝑄) 𝑠*, 𝜇, 𝑠′ − 𝑄) 𝑠, 𝑎 + 𝔼#~ℬ −𝑄) 𝑠, 𝜇, 𝑠 for 𝜙 for 𝜃