ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning

Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University

Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)

Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・

Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3

Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action

Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5

Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0

(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results

Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3

Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3

Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration

PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:

‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:

‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:

Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability

Results: Delayed Q-Learning

Results: R-Max (Brafman&Tennenholtz 2002)

Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time

Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3

ICML2018読み会 Policy and Value Transfer in Lifelo...

ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning

Yuu David Jinnai

Other Decks in Research

Featured

Transcript

Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,

Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et

Markov Decision Processes M = (S, A, T, R, γ)

Optimal Fixed Policy Given a distribution of task what policy

Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5

Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5

Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2

(Theorem) Average MDP is an optimal policy if only reward

Optimal Fixed Policy Given a distribution of task what policy

Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an

Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to

PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task

Results: Delayed Q-Learning

Results: Delayed Q-Learning

Results: R-Max (Brafman&Tennenholtz 2002)

Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence

Conclusions Average MDP Task M 1 Policy 1 Task M