Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
580
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
Transparency to sustain open science infrastructure - Printemps Couperin
mlarrieu
1
170
【緊急警告】日本の未来設計図 ~沈没か、再生か。国民と断行するラストチャンス~
yuutakasan
0
130
クラウドのテレメトリーシステム研究動向2025年
yuukit
3
950
電力システム最適化入門
mickey_kubo
1
630
Creation and environmental applications of 15-year daily inundation and vegetation maps for Siberia by integrating satellite and meteorological datasets
satai
3
110
MGDSS:慣性式モーションキャプチャを用いたジェスチャによるドローンの操作 / ec75-yamauchi
yumulab
0
230
Agentic AIとMCPを利用したサービス作成入門
mickey_kubo
0
240
大規模な2値整数計画問題に対する 効率的な重み付き局所探索法
mickey_kubo
1
240
2025年度人工知能学会全国大会チュートリアル講演「深層基盤モデルの数理」
taiji_suzuki
24
14k
研究テーマのデザインと研究遂行の方法論
hisashiishihara
5
1.4k
SI-D案内資料_京都文教大学
ryojitakeuchi1116
0
1.6k
データサイエンティストの採用に関するアンケート
datascientistsociety
PRO
0
980
Featured
See All Featured
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
107
19k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
A better future with KSS
kneath
239
17k
How to Think Like a Performance Engineer
csswizardry
24
1.7k
Raft: Consensus for Rubyists
vanstee
140
7k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.8k
What’s in a name? Adding method to the madness
productmarketing
PRO
23
3.5k
Facilitating Awesome Meetings
lara
54
6.4k
Side Projects
sachag
455
42k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
Typedesign – Prime Four
hannesfritz
42
2.7k
Fireside Chat
paigeccino
37
3.5k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3