Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
560
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
大規模言語モデルを用いた日本語視覚言語モデルの評価方法とベースラインモデルの提案 【MIRU 2024】
kentosasaki
2
450
20240626_金沢大学_新機能集積回路設計特論_配布用 #makelsi
takasumasakazu
0
150
「並列化時代の乱数生成」
abap34
3
690
さんかくのテスト.pdf
sankaku0724
0
180
Weekly AI Agents News! 8月号 論文のアーカイブ
masatoto
1
120
【ICASSP2024】音声変換に関する全論文まとめ【Parakeet株式会社】
supikiti
0
720
JMED-LLM: 日本語医療LLM評価データセットの公開
fta98
4
1k
SNLP2024:Planning Like Human: A Dual-process Framework for Dialogue Planning
yukizenimoto
1
270
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
sosk
1
860
MIRU2024_招待講演_RALF_in_CVPR2024
udonda
1
320
SSII2024 [OS2] 大規模言語モデルと基盤モデルの射程
ssii
PRO
0
480
SSII2024 [OS2] 画像、その先へ 〜モーション解析への誘い〜
ssii
PRO
1
1.2k
Featured
See All Featured
The Power of CSS Pseudo Elements
geoffreycrofte
71
5.3k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
4
99
No one is an island. Learnings from fostering a developers community.
thoeni
19
2.9k
Put a Button on it: Removing Barriers to Going Fast.
kastner
58
3.4k
Building Better People: How to give real-time feedback that sticks.
wjessup
360
19k
Fantastic passwords and where to find them - at NoRuKo
philnash
50
2.8k
Stop Working from a Prison Cell
hatefulcrawdad
267
20k
Building Your Own Lightsaber
phodgson
102
6k
Bootstrapping a Software Product
garrettdimon
PRO
304
110k
Designing the Hi-DPI Web
ddemaree
279
34k
Rails Girls Zürich Keynote
gr2m
93
13k
Happy Clients
brianwarren
97
6.6k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3