MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI'20)

by OMRON SINIC X

Slide 1

Slide 1 text

© 2021 OMRON SINIC X Corporation. All Rights Reserved. MULTIPOLAR: Multi-source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI’20) Mohammadamin Barekatain (TUM), Ryo Yonetani, and Masashi Hamaya (OMRON SINIC X) International Joint Conference on Artificial Intelligence (IJCAI 2020) Jan. 13, 2021

Slide 2

Slide 2 text

MULTIPOLAR: Multi-source Policy Aggreagtion for Transfer RL between Diverse Environmental Dynamics Mohammadamin Barekatain Technical University of Munich Ryo Yonetani OMRON SINIC X Masashi Hamaya OMRON SINIC X (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Slide 3

Slide 3 text

Objective: Transfer RL between Heterogeneous Dynamics Video of real robot Video of real robot (c) 2021 OMRON SINIC X Corporation. All Rights Reserved. *See our presentation on YouTube https://youtu.be/adUnIj83RtU for video materials

Slide 4

Slide 4 text

Problem Setting • Multiple sources to single target environmental instances • Each instance share the same state/action spaces and reward function but differs in state-transition dynamics • Only source policies are shared to the target agent • No prior knowledge about source policies and source environmental dynamics are accessible … … Source agents Target agent Source agents Target agent (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Slide 5

Slide 5 text

Problem Setting • Multiple sources to single target environmental instances • Each instance share the same state/action spaces and reward function but differs in state-transition dynamics • Not able to adopt Policy reuse [Fernandez and veloso, AAMAS’06; Zheng et al., NeurIPS’18; etc.] and option frameworks [Sutton et al., Art. Intel. ’98; Bacon et al., AAAI‘17; etc.] as they assume policies to have the same transition dynamics. • Only source policies are shared to the target agent • No prior knowledge about source policies and source environmental dynamics are accessible • Not able to adopt existing transfer RL methods or meta-RL methods that require access to source environmental dynamics [Lazaric et al., ICML’08; Chen et al., NeurIPS’18; Yu et al., ICLR’19; etc] [Fernandez and veloso, AAMAS’06] “Probabilistic Policy Reuse in a Reinforcement Learning Agent”, AAMAS’06 [Zheng et al., NeurIPS’18] “A Deep Bayesian Policy Reuse Approach against Non- Stationary Agents”, NeurIPS’18 [Sutton et al., Art. Intel.’98] “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”, Artificial Intelligence’98 [Bacon et al., AAAI’17] “The Option-Critic Architecture”, AAAI’17 (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Slide 6

Slide 6 text

MULTIPOLAR: MULTI-source POLicy AggRegation State !! Auxiliary network for predicting residuals: ""#$ !! ; $"#$ !aux Continuous action space: %%"&'(% ≡ ' " !! ; (, $"'' , $"#$ , Σ Discrete action space: %%"&'(% ≡ +,-./01 " !! ; (, $"'' , $"#$ )! )" )# … Source policies " = $! , … , $" … ⊙ (# … !agg Adaptive aggregation of source policies: ""'' !! ; (, $"'' " !! ; (, $"'' , $"#$ + (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Slide 7

Slide 7 text

MULTIPOLAR: Advantages • Able to work on black-box source policies • No need to know source dynamics: aggregation is done adaptively to maximize the target task performance • No need to know original task performances: auxiliary network ensures the policy expressiveness • Applicable to both continuous/discrete action spaces (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Quantitative Evaluation • Experimental setups • 100 instances with unique dynamics for each environment • 3 random choices of 4 source instances for each of 100 target instances • 3 random seeds • Baselines • MLP learned from scratch • Extension of Residual Policy Learning (RPL) [Johannik et al., ICRA’19]: single source policy + auxiliary network • A2T [Rajendran et al., ICLR’17]: 4 source policies + auxiliary network + state-dependent attention • Metrics • Average episodic rewards over training samples [Henderson et al., AAAI’18] • Bootstrap mean + 95% confidence bounds 900 sessions for each of 6 environments! [Johannik et al., ICRA’19] “Residual Reinforcement Learning for Robot Control”, ICRA’19 Rajendran et al., ICLR’17] “Attend, adapt and transfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain”, ICLR’17 [Henderson et al., AAAI’18] “Deep Reinforcement Learning That Matters”, AAAI’18 (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Analysis Timesteps Weights for high-performing sources Weights for low-performing sources Aggregation weights (Hopper) Source policies Average episodic rewards Random 283 4 high 420 2 high / 2 low 208 4 low 92 Learning from scrtach 92 MULITPOLAR avoids negative transfer by suppressing low-performing source policies (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.

Slide 18

Slide 18 text

Summary • MULTIPOLAR: multi-source aggregation for transfer RL between diverse environmental dymamics • Able to work on black-box source policies • Able to work on both continous/discrete action spaces • Confirmed effective on a variety of Gym environments • Code available! • https://github.com/Mohammadamin-Barekatain/multipolar • Interested in internship at OMRON SINIC X? • Contact us at [email protected] with your CV! Learning from scratch MULTIPOLAR (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.