[Paper Introduction] Steering Your Diffusion Policy with Latent Space Reinforcement Learning

by Rei Ando

Slide 1

Slide 1 text

Steering Your Diffusion Policy with Latent Space Reinforcement Learning Symbol Emergence Systems Lab Rei Ando 1

Slide 2

Slide 2 text

Paper Information Title: Authors: Andrew Wagenmaker+ Pub. date: 2025/6/25 Link: https://diffusion-steering.github.io Steering Your Diffusion Policy with Latent Space Reinforcement Learning 2

Slide 3

Slide 3 text

Contents 3 1.Background 2.DSRL 1.overview 2.detail 3.Experiment 1.for Task Specific Policy 2.for Generalized Policy 4.Conclusion

Slide 4

Slide 4 text

Background 4 Diffusion Policy is a method to constitute complex action distribution ・・・Showing high performance in Behavior Cloning (Imitation Learning) initial noise 𝑤 ~ 𝑁 0, 𝐼 Denoising Process Action Conditions Images, Language instruction, etc Ø Fine-Tuning is a way to adapt the model to new environments ・・・There are various challenges

Slide 5

Slide 5 text

DSRL - overview Diffusion Steering via Reinforcement Learning (DSRL) Optimize Diffusion Policyʼs initial noise with Reinforcement Learning, instead of model parameters with fine-tuning 𝒔: states 𝒂: action 𝒘: initial noise 𝜋!" : denoising process 5

Slide 6

Slide 6 text

DSRL - detail (1/3) Original Diffusion Policy 𝑎 ~ 𝜋!" 𝑠, 𝑤 𝑠 : state 𝑠′ : next state 𝑎 : action 𝑤 : initial noise 𝑃 𝑠#|𝑠, 𝑎 = 𝑃 𝑠#|𝑠, 𝜋!" 𝑠, 𝑤 = 𝑃$ 𝑠#|𝑠, 𝑤 𝑃$ 𝑠#|𝑠, 𝑤 ≔ 𝑃 𝑠#|𝑠, 𝜋!" 𝑠, 𝑤 considering MDP, → we can consider 𝑤 as Action 𝑟$ 𝑠, 𝑤 ≔ 𝑟 𝑠, 𝜋!" 𝑠, 𝑤 State transitions: Reward function: latent-action MDP 6

Slide 7

Slide 7 text

DSRL - detail (2/3) In latent-action MDP, initial noise 𝑤 is considered as Action → We can consider the policy of 𝑤 → Optimizing this policy may be effective for generating Ø Optimize the policy with DDPG method actor: critic: 𝑤 ~ 𝜋$ 𝑠 𝑄$ 𝑠, 𝑤 But, datasets are consisted with 𝑠, 𝑎, 𝑟, 𝑠# in Offline Training = Data in latent-action MDP is not usable adopt Additional Critic to bridge real-env and laten-action env 𝐴-critic: 𝑄% 𝑠, 𝑎 7

Slide 8

Slide 8 text

DSRL - detail (3/3) actor: critic: 𝜋$ 𝑠 𝑄$ 𝑠, 𝑤 𝐴-critic: 𝑄% 𝑠, 𝑎 : TD error : Match two critics : Policy Gradient 8 consisted with shallow NN

Slide 9

Slide 9 text

Experiment - for Task Specific Policy 9 trials with 5 seeds DSRL realizes Near-Optimal Behavior and Effectively Modify Task: 4 tasks from Robomimic (Simulation) Base 𝜋!" : Diffusion Policy with pubic check point and prepared by authors

Slide 10

Slide 10 text

Experiment - for Generalized Policy 10 collect 60-80 online data Simulation Env Real Env trials with 3 seeds ü 𝜋& is adopted as Generalized Policy 𝜋! ∶ https://www.pi.website/blog/pi0 Showing improvement with small data

Slide 11

Slide 11 text

Conclusion 11 A novelty method DSRL is proposed ü Higher performance is achieved in various experiments by optimizing initial noise of Diffusion Policy ü Comparing with Fine-Tuning, much smaller dataset is enough to improve Limitations • Exploration capabilities are determined by 𝜋!" • You have to make reward signal in Online Training

Slide 12

Slide 12 text

Appendix - DQN and DDPG 12 DQN Approximate true action value with NN 𝑄 𝑠, 𝑎 ≈ 𝑄) 𝑠, 𝑎 object function 𝔼 #,%,&,#! ~ℬ 𝑟 + 𝛾 max %! 𝑄) 𝑠*, 𝑎* − 𝑄) 𝑠, 𝑎 + DDPG Approximate value in case of the continuous action If 𝑎 is continuous, this is difficult Actor: 𝜇* 𝑠 Critic: 𝑄) 𝑠, 𝑎 object function 𝔼 #,%,&,#! ~ℬ 𝑟 + 𝛾𝑄) 𝑠*, 𝜇, 𝑠′ − 𝑄) 𝑠, 𝑎 + 𝔼#~ℬ −𝑄) 𝑠, 𝜇, 𝑠 for 𝜙 for 𝜃