Slide 5
Slide 5 text
Problem Setting
• Multiple sources to single target environmental instances
• Each instance share the same state/action spaces and reward function but differs in state-transition dynamics
• Not able to adopt Policy reuse [Fernandez and veloso, AAMAS’06; Zheng et al., NeurIPS’18; etc.] and option frameworks [Sutton
et al., Art. Intel. ’98; Bacon et al., AAAI‘17; etc.] as they assume policies to have the same transition dynamics.
• Only source policies are shared to the target agent
• No prior knowledge about source policies and source environmental dynamics are accessible
• Not able to adopt existing transfer RL methods or meta-RL methods that require access to source environmental dynamics
[Lazaric et al., ICML’08; Chen et al., NeurIPS’18; Yu et al., ICLR’19; etc]
[Fernandez and veloso, AAMAS’06] “Probabilistic Policy Reuse in a Reinforcement Learning Agent”, AAMAS’06
[Zheng et al., NeurIPS’18] “A Deep Bayesian Policy Reuse Approach against Non- Stationary Agents”, NeurIPS’18
[Sutton et al., Art. Intel.’98] “Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning”, Artificial Intelligence’98
[Bacon et al., AAAI’17] “The Option-Critic Architecture”, AAAI’17
(c) 2021 OMRON SINIC X Corporation. All Rights Reserved.