Yuu David Jinnai
July 28, 2018
560

# ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning

July 28, 2018

## Transcript

1. ### Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,

Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University

al. 2017)
3. ### Markov Decision Processes M = (S, A, T, R, γ)

S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・

5. ### Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5

Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
6. ### Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5

Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
7. ### Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2

) = 0.5 0.0 1.0
8. ### (Theorem) Average MDP is an optimal policy if only reward

function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results

10. ### Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an

MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
11. ### Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to

encourage exploration
12. ### PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
13. ### PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
14. ### PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:

M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability

time