Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Search
Yuu David Jinnai
July 28, 2018
Research
3
560
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
Online Nonstationary and Nonlinear Bandits with Recursive Weighted Gaussian Process
monochromegane
0
100
SSII2024 [OS2] GPT-4Vで画像認識は終わるのか(オープニング)
ssii
PRO
0
640
AIを前提とした体験の実現に向けて/toward_ai_based_experiences
monochromegane
1
430
研究効率化Tips_2024 / Research Efficiency Tips 2024
ryo_nakamura
5
4.1k
Mathematical Optimization +Artificial Intelligence =MOAI
mickey_kubo
1
230
SSII2024 [OS2] 大規模言語モデルとVision & Languageのこれから
ssii
PRO
5
1.3k
初めての研究発表を成功させよう! スライド作成の基本
ayaco0
10
4.1k
CARA MEMBUKA VIDEO DEWASA DI INDONESIA
bloglangit
0
320
SSII2024 [OS3] 企業における基盤モデル開発の実際
ssii
PRO
0
490
医療分野におけるLLMの現状と応用可能性について
kento1109
5
550
「人間にAIはどのように辿り着けばよいのか?ー 系統的汎化からの第一歩 ー」@第22回 Language and Robotics研究会
maguro27
0
410
#SRE論文紹介 Detection is Better Than Cure: A Cloud Incidents Perspective V. Ganatra et. al., ESEC/FSE’23
yuukit
3
950
Featured
See All Featured
Building an army of robots
kneath
301
42k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
121
18k
Designing Experiences People Love
moore
136
23k
Designing with Data
zakiwarfel
96
5k
Documentation Writing (for coders)
carmenintech
63
4.2k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
17
1.5k
RailsConf 2023
tenderlove
16
720
Ruby is Unlike a Banana
tanoku
96
10k
Docker and Python
trallard
37
2.9k
From Idea to $5000 a Month in 5 Months
shpigford
377
46k
How to name files
jennybc
67
96k
Fireside Chat
paigeccino
25
2.8k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3