MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI'20)

© 2021 OMRON SINIC X Corporation. All Rights Reserved. MULTIPOLAR:
Multi-source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI’20) Mohammadamin Barekatain (TUM), Ryo Yonetani, and Masashi Hamaya (OMRON SINIC X) International Joint Conference on Artificial Intelligence (IJCAI 2020) Jan. 13, 2021

MULTIPOLAR: Multi-source Policy Aggreagtion for Transfer RL between Diverse Environmental
Dynamics Mohammadamin Barekatain Technical University of Munich Ryo Yonetani OMRON SINIC X Masashi Hamaya OMRON SINIC X (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Objective: Transfer RL between Heterogeneous Dynamics Video of real robot
Video of real robot (c) 2021 OMRON SINIC X Corporation. All Rights Reserved. *See our presentation on YouTube https://youtu.be/adUnIj83RtU for video materials

Problem Setting • Multiple sources to single target environmental instances
• Each instance share the same state/action spaces and reward function but differs in state-transition dynamics • Only source policies are shared to the target agent • No prior knowledge about source policies and source environmental dynamics are accessible … … Source agents Target agent Source agents Target agent (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Problem Setting • Multiple sources to single target environmental instances
• Each instance share the same state/action spaces and reward function but differs in state-transition dynamics • Not able to adopt Policy reuse [Fernandez and veloso, AAMAS’06; Zheng et al., NeurIPS’18; etc.] and option frameworks [Sutton et al., Art. Intel. ’98; Bacon et al., AAAI‘17; etc.] as they assume policies to have the same transition dynamics. • Only source policies are shared to the target agent • No prior knowledge about source policies and source environmental dynamics are accessible • Not able to adopt existing transfer RL methods or meta-RL methods that require access to source environmental dynamics [Lazaric et al., ICML’08; Chen et al., NeurIPS’18; Yu et al., ICLR’19; etc] [Fernandez and veloso, AAMAS’06] “Probabilistic Policy Reuse in a Reinforcement Learning Agent”, AAMAS’06 [Zheng et al., NeurIPS’18] “A Deep Bayesian Policy Reuse Approach against Non- Stationary Agents”, NeurIPS’18 [Sutton et al., Art. Intel.’98] “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”, Artificial Intelligence’98 [Bacon et al., AAAI’17] “The Option-Critic Architecture”, AAAI’17 (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

MULTIPOLAR: MULTI-source POLicy AggRegation State !! Auxiliary network for predicting
residuals: ""#$ !! ; $"#$ !aux Continuous action space: %%"&'(% ≡ ' " !! ; (, $"'' , $"#$ , Σ Discrete action space: %%"&'(% ≡ +,-./01 " !! ; (, $"'' , $"#$ )! )" )# … Source policies " = $! , … , $" … ⊙ (# … !agg Adaptive aggregation of source policies: ""'' !! ; (, $"'' " !! ; (, $"'' , $"#$ + (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

MULTIPOLAR: Advantages • Able to work on black-box source policies
• No need to know source dynamics: aggregation is done adaptively to maximize the target task performance • No need to know original task performances: auxiliary network ensures the policy expressiveness • Applicable to both continuous/discrete action spaces (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Environments CartPole Acrobot Lunar Lander Roboschool Hopper Roboshool Ant Roboschool
Inverted Pendulum Swingup (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Quantitative Evaluation • Experimental setups • 100 instances with unique
dynamics for each environment • 3 random choices of 4 source instances for each of 100 target instances • 3 random seeds • Baselines • MLP learned from scratch • Extension of Residual Policy Learning (RPL) [Johannik et al., ICRA’19]: single source policy + auxiliary network • A2T [Rajendran et al., ICLR’17]: 4 source policies + auxiliary network + state-dependent attention • Metrics • Average episodic rewards over training samples [Henderson et al., AAAI’18] • Bootstrap mean + 95% confidence bounds 900 sessions for each of 6 environments! [Johannik et al., ICRA’19] “Residual Reinforcement Learning for Robot Control”, ICRA’19 Rajendran et al., ICLR’17] “Attend, adapt and transfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain”, ICLR’17 [Henderson et al., AAAI’18] “Deep Reinforcement Learning That Matters”, AAAI’18 (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Quantitative Evaluations: Learning Curves MULTIPOLAR (Ours) A2T [Rajendran et al.,
ICLR’17] RPL MLP (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Analysis Timesteps Weights for high-performing sources Weights for low-performing sources
Aggregation weights (Hopper) Source policies Average episodic rewards Random 283 4 high 420 2 high / 2 low 208 4 low 92 Learning from scrtach 92 MULITPOLAR avoids negative transfer by suppressing low-performing source policies (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Summary • MULTIPOLAR: multi-source aggregation for transfer RL between diverse
environmental dymamics • Able to work on black-box source policies • Able to work on both continous/discrete action spaces • Confirmed effective on a variety of Gym environments • Code available! • https://github.com/Mohammadamin-Barekatain/multipolar • Interested in internship at OMRON SINIC X? • Contact us at [email protected] with your CV! Learning from scratch MULTIPOLAR (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

MULTIPOLAR: Multi-Source Policy Aggregation for...

MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI'20)

OMRON SINIC X

More Decks by OMRON SINIC X

Other Decks in Research

Featured

Transcript

© 2021 OMRON SINIC X Corporation. All Rights Reserved. MULTIPOLAR:

MULTIPOLAR: Multi-source Policy Aggreagtion for Transfer RL between Diverse Environmental

Objective: Transfer RL between Heterogeneous Dynamics Video of real robot

Problem Setting • Multiple sources to single target environmental instances

Problem Setting • Multiple sources to single target environmental instances

MULTIPOLAR: MULTI-source POLicy AggRegation State !! Auxiliary network for predicting

MULTIPOLAR: Advantages • Able to work on black-box source policies

Environments CartPole Acrobot Lunar Lander Roboschool Hopper Roboshool Ant Roboschool

Source Tasks (RoboSchool Ant) (c) 2021 OMRON SINIC X Corporation.

Results (RoboSchool Ant) MULTIPOLAR Learning from scratch (c) 2021 OMRON

Results (RoboSchool Hopper) MULTIPOLAR Learning from scratch (c) 2021 OMRON

Quantitative Evaluation • Experimental setups • 100 instances with unique

Quantitative Evaluations: Learning Curves MULTIPOLAR (Ours) A2T [Rajendran et al.,

Learning Curves (c) 2021 OMRON SINIC X Corporation. All Rights

Average Episodic Rewards Roboschool Ant Roboschool Hopper MLP RPL A2T

Analysis Number of source policies (Hopper) Ablation study (Hopper) K=1

Analysis Timesteps Weights for high-performing sources Weights for low-performing sources

Summary • MULTIPOLAR: multi-source aggregation for transfer RL between diverse