Yuu David Jinnai
July 28, 2018
560

# ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning

July 28, 2018

## Transcript

1. Policy and Value Transfer in
Lifelong Reinforcement
Learning
David Abel*, Yuu Jinnai*, George Konidaris,
Michael Littman, Yue Guo
Brown University

2. Motivation: Solving Multiple Tasks
(Arumugam et al. 2017) (Konidaris et al. 2017)

3. Markov Decision Processes
M = (S, A, T, R, γ)
S: set of states
A: set of actions
T: transitions
R: reward
γ: discount factor
Objective: Find a policy π(a | s) which maximizes total discounted reward
s
t
s
t+1
r
t+1
a
t
・・・・

4. Optimal Fixed Policy
Given a distribution of task what policy maximizes the expected performance?
M
1
Policy
M
2
M
3

5. Previous Work: Action Prior (Rosman&Ramamoorthy2012)
Pr(M
1
) = 0.5
Pr(M
2
) = 0.5
0.5 0.5
Probability of the action being the optimal action

6. Pr(M
1
) = 0.5
Pr(M
2
) = 0.5
Probability of the action being the optimal action
Previous Work: Action Prior (Rosman&Ramamoorthy2012)
0.5 0.5

7. Algorithm: Average MDP
Pr(M
1
) = 0.5
Pr(M
2
) = 0.5
0.0 1.0

8. (Theorem) Average MDP is an optimal policy if only reward function is distributed
(e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007)
Results

9. Optimal Fixed Policy
Given a distribution of task what policy maximizes the expected performance?
M
1
Policy
M
2
M
3

10. Lifelong Reinforcement Learning D
・・
Repeat:
1. Agent samples an MDP from a distribution
M ← sample(D)
2. Solve it
π ← solve(M)
M
1
M
2
M
3

11. Optimistic Initialization (Keans&Singh 2002)
Initialize Q-value:
Initialize Q-value optimistically to encourage exploration

12. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe ‘13)
(Theorem) Sample complexity of PAC-MDP algorithms are:
IF:

13. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe ‘13)
(Theorem) Sample complexity of PAC-MDP algorithms are:
Minimize: the overestimate
Subject to:

14. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe ‘13)
(Theorem) Sample complexity of PAC-MDP algorithms are:
Minimize: the overestimate
Subject to:
Solution:

15. Algorithm: MaxQInit
M
1
M
2
・・・
M
m
・・・
(Theorem) For m sufficiently large, MaxQInit
preserves the PAC-MDP property with high
probability

16. Results: Delayed Q-Learning

17. Results: Delayed Q-Learning

18. Results: R-Max (Brafman&Tennenholtz 2002)

19. Results: Q-Learning (Watkins 1992)
Tradeoff in jumpstart performance vs. convergence time

20. Conclusions
Average MDP
M
1
Policy
1
M
2
Policy
2
・・・・
MaxQInit