Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICML2018読み会 Policy and Value Transfer in Lifelo...

ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning

Yuu David Jinnai

July 28, 2018
Tweet

Other Decks in Research

Transcript

  1. Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,

    Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
  2. Markov Decision Processes M = (S, A, T, R, γ)

    S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
  3. Optimal Fixed Policy Given a distribution of task what policy

    maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
  4. Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5

    Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
  5. Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5

    Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
  6. (Theorem) Average MDP is an optimal policy if only reward

    function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
  7. Optimal Fixed Policy Given a distribution of task what policy

    maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
  8. Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an

    MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
  9. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

    ‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
  10. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

    ‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
  11. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe

    ‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
  12. Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task

    M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
  13. Conclusions Average MDP Task M 1 Policy 1 Task M

    2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3