Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning

ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning

Yuu David Jinnai

July 28, 2018
Tweet

Other Decks in Research

Transcript

  1. Policy and Value Transfer in
    Lifelong Reinforcement
    Learning
    David Abel*, Yuu Jinnai*, George Konidaris,
    Michael Littman, Yue Guo
    Brown University

    View Slide

  2. Motivation: Solving Multiple Tasks
    (Arumugam et al. 2017) (Konidaris et al. 2017)

    View Slide

  3. Markov Decision Processes
    M = (S, A, T, R, γ)
    S: set of states
    A: set of actions
    T: transitions
    R: reward
    γ: discount factor
    Objective: Find a policy π(a | s) which maximizes total discounted reward
    s
    t
    s
    t+1
    r
    t+1
    a
    t
    ・・・・

    View Slide

  4. Optimal Fixed Policy
    Given a distribution of task what policy maximizes the expected performance?
    Task
    M
    1
    Policy
    Task
    M
    2
    Task
    M
    3

    View Slide

  5. Previous Work: Action Prior (Rosman&Ramamoorthy2012)
    Pr(M
    1
    ) = 0.5
    Pr(M
    2
    ) = 0.5
    0.5 0.5
    Probability of the action being the optimal action

    View Slide

  6. Pr(M
    1
    ) = 0.5
    Pr(M
    2
    ) = 0.5
    Probability of the action being the optimal action
    Previous Work: Action Prior (Rosman&Ramamoorthy2012)
    0.5 0.5

    View Slide

  7. Algorithm: Average MDP
    Pr(M
    1
    ) = 0.5
    Pr(M
    2
    ) = 0.5
    0.0 1.0

    View Slide

  8. (Theorem) Average MDP is an optimal policy if only reward function is distributed
    (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007)
    Results

    View Slide

  9. Optimal Fixed Policy
    Given a distribution of task what policy maximizes the expected performance?
    Task
    M
    1
    Policy
    Task
    M
    2
    Task
    M
    3

    View Slide

  10. Lifelong Reinforcement Learning D
    ・・
    Repeat:
    1. Agent samples an MDP from a distribution
    M ← sample(D)
    2. Solve it
    π ← solve(M)
    M
    1
    M
    2
    M
    3

    View Slide

  11. Optimistic Initialization (Keans&Singh 2002)
    Initialize Q-value:
    Initialize Q-value optimistically to encourage exploration

    View Slide

  12. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe ‘13)
    (Theorem) Sample complexity of PAC-MDP algorithms are:
    IF:

    View Slide

  13. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe ‘13)
    (Theorem) Sample complexity of PAC-MDP algorithms are:
    Minimize: the overestimate
    Subject to:

    View Slide

  14. PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe ‘13)
    (Theorem) Sample complexity of PAC-MDP algorithms are:
    Minimize: the overestimate
    Subject to:
    Solution:

    View Slide

  15. Algorithm: MaxQInit
    Task
    M
    1
    Task
    M
    2
    ・・・
    Task
    M
    m
    ・・・
    (Theorem) For m sufficiently large, MaxQInit
    preserves the PAC-MDP property with high
    probability

    View Slide

  16. Results: Delayed Q-Learning

    View Slide

  17. Results: Delayed Q-Learning

    View Slide

  18. Results: R-Max (Brafman&Tennenholtz 2002)

    View Slide

  19. Results: Q-Learning (Watkins 1992)
    Tradeoff in jumpstart performance vs. convergence time

    View Slide

  20. Conclusions
    Average MDP
    Task
    M
    1
    Policy
    1
    Task
    M
    2
    Policy
    2
    ・・・・
    MaxQInit
    Task
    M
    1
    Policy
    Task
    M
    2
    Task
    M
    3

    View Slide