Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI'20)

MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI'20)

Mohammadamin Barekatain, Ryo Yonetani, Masashi Hamaya, "MULTIPOLAR: Multi-Source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics", IJCAI'20

paper: https://www.ijcai.org/Proceedings/2020/430
blog: https://medium.com/sinicx/multipolar-multi-source-policy-aggregation-for-transfer-reinforcement-learning-between-diverse-bc42a152b0f5
code: https://github.com/Mohammadamin-Barekatain/multipolar

OMRON SINIC X

March 30, 2021
Tweet

More Decks by OMRON SINIC X

Other Decks in Research

Transcript

  1. © 2021 OMRON SINIC X Corporation. All Rights Reserved. MULTIPOLAR:

    Multi-source Policy Aggregation for Transfer Reinforcement Learning between Diverse Environmental Dynamics (IJCAI’20) Mohammadamin Barekatain (TUM), Ryo Yonetani, and Masashi Hamaya (OMRON SINIC X) International Joint Conference on Artificial Intelligence (IJCAI 2020) Jan. 13, 2021
  2. MULTIPOLAR: Multi-source Policy Aggreagtion for Transfer RL between Diverse Environmental

    Dynamics Mohammadamin Barekatain Technical University of Munich Ryo Yonetani OMRON SINIC X Masashi Hamaya OMRON SINIC X (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  3. Objective: Transfer RL between Heterogeneous Dynamics Video of real robot

    Video of real robot (c) 2021 OMRON SINIC X Corporation. All Rights Reserved. *See our presentation on YouTube https://youtu.be/adUnIj83RtU for video materials
  4. Problem Setting • Multiple sources to single target environmental instances

    • Each instance share the same state/action spaces and reward function but differs in state-transition dynamics • Only source policies are shared to the target agent • No prior knowledge about source policies and source environmental dynamics are accessible … … Source agents Target agent Source agents Target agent (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  5. Problem Setting • Multiple sources to single target environmental instances

    • Each instance share the same state/action spaces and reward function but differs in state-transition dynamics • Not able to adopt Policy reuse [Fernandez and veloso, AAMAS’06; Zheng et al., NeurIPS’18; etc.] and option frameworks [Sutton et al., Art. Intel. ’98; Bacon et al., AAAI‘17; etc.] as they assume policies to have the same transition dynamics. • Only source policies are shared to the target agent • No prior knowledge about source policies and source environmental dynamics are accessible • Not able to adopt existing transfer RL methods or meta-RL methods that require access to source environmental dynamics [Lazaric et al., ICML’08; Chen et al., NeurIPS’18; Yu et al., ICLR’19; etc] [Fernandez and veloso, AAMAS’06] “Probabilistic Policy Reuse in a Reinforcement Learning Agent”, AAMAS’06 [Zheng et al., NeurIPS’18] “A Deep Bayesian Policy Reuse Approach against Non- Stationary Agents”, NeurIPS’18 [Sutton et al., Art. Intel.’98] “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”, Artificial Intelligence’98 [Bacon et al., AAAI’17] “The Option-Critic Architecture”, AAAI’17 (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  6. MULTIPOLAR: MULTI-source POLicy AggRegation State !! Auxiliary network for predicting

    residuals: ""#$ !! ; $"#$ !aux Continuous action space: %%"&'(% ≡ ' " !! ; (, $"'' , $"#$ , Σ Discrete action space: %%"&'(% ≡ +,-./01 " !! ; (, $"'' , $"#$ )! )" )# … Source policies " = $! , … , $" … ⊙ (# … !agg Adaptive aggregation of source policies: ""'' !! ; (, $"'' " !! ; (, $"'' , $"#$ + (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  7. MULTIPOLAR: Advantages • Able to work on black-box source policies

    • No need to know source dynamics: aggregation is done adaptively to maximize the target task performance • No need to know original task performances: auxiliary network ensures the policy expressiveness • Applicable to both continuous/discrete action spaces (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  8. Environments CartPole Acrobot Lunar Lander Roboschool Hopper Roboshool Ant Roboschool

    Inverted Pendulum Swingup (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  9. Quantitative Evaluation • Experimental setups • 100 instances with unique

    dynamics for each environment • 3 random choices of 4 source instances for each of 100 target instances • 3 random seeds • Baselines • MLP learned from scratch • Extension of Residual Policy Learning (RPL) [Johannik et al., ICRA’19]: single source policy + auxiliary network • A2T [Rajendran et al., ICLR’17]: 4 source policies + auxiliary network + state-dependent attention • Metrics • Average episodic rewards over training samples [Henderson et al., AAAI’18] • Bootstrap mean + 95% confidence bounds 900 sessions for each of 6 environments! [Johannik et al., ICRA’19] “Residual Reinforcement Learning for Robot Control”, ICRA’19 Rajendran et al., ICLR’17] “Attend, adapt and transfer: Attentive deep architecture for adaptive transfer from multiple sources in the same domain”, ICLR’17 [Henderson et al., AAAI’18] “Deep Reinforcement Learning That Matters”, AAAI’18 (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  10. Quantitative Evaluations: Learning Curves MULTIPOLAR (Ours) A2T [Rajendran et al.,

    ICLR’17] RPL MLP (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  11. Average Episodic Rewards Roboschool Ant Roboschool Hopper MLP RPL A2T

    MULTIPOLAR MLP RPL A2T MULTIPOLAR (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  12. Analysis Number of source policies (Hopper) Ablation study (Hopper) K=1

    K=4 K=16 Full Uniform aggregation weights State- independent residual (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  13. Analysis Timesteps Weights for high-performing sources Weights for low-performing sources

    Aggregation weights (Hopper) Source policies Average episodic rewards Random 283 4 high 420 2 high / 2 low 208 4 low 92 Learning from scrtach 92 MULITPOLAR avoids negative transfer by suppressing low-performing source policies (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.
  14. Summary • MULTIPOLAR: multi-source aggregation for transfer RL between diverse

    environmental dymamics • Able to work on black-box source policies • Able to work on both continous/discrete action spaces • Confirmed effective on a variety of Gym environments • Code available! • https://github.com/Mohammadamin-Barekatain/multipolar • Interested in internship at OMRON SINIC X? • Contact us at [email protected] with your CV! Learning from scratch MULTIPOLAR (c) 2021 OMRON SINIC X Corporation. All Rights Reserved.