Slide 1

Slide 1 text

Steering Your Diffusion Policy with Latent Space Reinforcement Learning Symbol Emergence Systems Lab Rei Ando 1

Slide 2

Slide 2 text

Paper Information Title: Authors: Andrew Wagenmaker+ Pub. date: 2025/6/25 Link: https://diffusion-steering.github.io Steering Your Diffusion Policy with Latent Space Reinforcement Learning 2

Slide 3

Slide 3 text

Contents 3 1.Background 2.DSRL 1.overview 2.detail 3.Experiment 1.for Task Specific Policy 2.for Generalized Policy 4.Conclusion

Slide 4

Slide 4 text

Background 4 Diffusion Policy is a method to constitute complex action distribution ・・・Showing high performance in Behavior Cloning (Imitation Learning) initial noise 𝑀 ~ 𝑁 0, 𝐼 Denoising Process Action Conditions Images, Language instruction, etc Ø Fine-Tuning is a way to adapt the model to new environments ・・・There are various challenges

Slide 5

Slide 5 text

DSRL - overview Diffusion Steering via Reinforcement Learning (DSRL) Optimize Diffusion PolicyΚΌs initial noise with Reinforcement Learning, instead of model parameters with fine-tuning 𝒔: states 𝒂: action π’˜: initial noise πœ‹!" : denoising process 5

Slide 6

Slide 6 text

DSRL - detail (1/3) Original Diffusion Policy π‘Ž ~ πœ‹!" 𝑠, 𝑀 𝑠 : state 𝑠′ : next state π‘Ž : action 𝑀 : initial noise 𝑃 𝑠#|𝑠, π‘Ž = 𝑃 𝑠#|𝑠, πœ‹!" 𝑠, 𝑀 = 𝑃$ 𝑠#|𝑠, 𝑀 𝑃$ 𝑠#|𝑠, 𝑀 ≔ 𝑃 𝑠#|𝑠, πœ‹!" 𝑠, 𝑀 considering MDP, β†’ we can consider 𝑀 as Action π‘Ÿ$ 𝑠, 𝑀 ≔ π‘Ÿ 𝑠, πœ‹!" 𝑠, 𝑀 State transitions: Reward function: latent-action MDP 6

Slide 7

Slide 7 text

DSRL - detail (2/3) In latent-action MDP, initial noise 𝑀 is considered as Action β†’ We can consider the policy of 𝑀 β†’ Optimizing this policy may be effective for generating Ø Optimize the policy with DDPG method actor: critic: 𝑀 ~ πœ‹$ 𝑠 𝑄$ 𝑠, 𝑀 But, datasets are consisted with 𝑠, π‘Ž, π‘Ÿ, 𝑠# in Offline Training = Data in latent-action MDP is not usable adopt Additional Critic to bridge real-env and laten-action env 𝐴-critic: 𝑄% 𝑠, π‘Ž 7

Slide 8

Slide 8 text

DSRL - detail (3/3) actor: critic: πœ‹$ 𝑠 𝑄$ 𝑠, 𝑀 𝐴-critic: 𝑄% 𝑠, π‘Ž : TD error : Match two critics : Policy Gradient 8 consisted with shallow NN

Slide 9

Slide 9 text

Experiment - for Task Specific Policy 9 trials with 5 seeds DSRL realizes Near-Optimal Behavior and Effectively Modify Task: 4 tasks from Robomimic (Simulation) Base πœ‹!" : Diffusion Policy with pubic check point and prepared by authors

Slide 10

Slide 10 text

Experiment - for Generalized Policy 10 collect 60-80 online data Simulation Env Real Env trials with 3 seeds ΓΌ πœ‹& is adopted as Generalized Policy πœ‹! ∢ https://www.pi.website/blog/pi0 Showing improvement with small data

Slide 11

Slide 11 text

Conclusion 11 A novelty method DSRL is proposed ΓΌ Higher performance is achieved in various experiments by optimizing initial noise of Diffusion Policy ΓΌ Comparing with Fine-Tuning, much smaller dataset is enough to improve Limitations β€’ Exploration capabilities are determined by πœ‹!" β€’ You have to make reward signal in Online Training

Slide 12

Slide 12 text

Appendix - DQN and DDPG 12 DQN Approximate true action value with NN 𝑄 𝑠, π‘Ž β‰ˆ 𝑄) 𝑠, π‘Ž object function 𝔼 #,%,&,#! ~ℬ π‘Ÿ + 𝛾 max %! 𝑄) 𝑠*, π‘Ž* βˆ’ 𝑄) 𝑠, π‘Ž + DDPG Approximate value in case of the continuous action If π‘Ž is continuous, this is difficult Actor: πœ‡* 𝑠 Critic: 𝑄) 𝑠, π‘Ž object function 𝔼 #,%,&,#! ~ℬ π‘Ÿ + 𝛾𝑄) 𝑠*, πœ‡, 𝑠′ βˆ’ 𝑄) 𝑠, π‘Ž + 𝔼#~ℬ βˆ’π‘„) 𝑠, πœ‡, 𝑠 for πœ™ for πœƒ