Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
580
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
最適決定木を用いた処方的価格最適化
mickey_kubo
4
1.7k
Adaptive fusion of multi-modal remote sensing data for optimal sub-field crop yield prediction
satai
3
210
MGDSS:慣性式モーションキャプチャを用いたジェスチャによるドローンの操作 / ec75-yamauchi
yumulab
0
230
A multimodal data fusion model for accurate and interpretable urban land use mapping with uncertainty analysis
satai
3
220
Agentic AIとMCPを利用したサービス作成入門
mickey_kubo
0
240
Cross-Media Information Spaces and Architectures
signer
PRO
0
220
学生向けアンケート<データサイエンティストについて>
datascientistsociety
PRO
0
3k
クラウドのテレメトリーシステム研究動向2025年
yuukit
3
950
SSII2025 [TS2] リモートセンシング画像処理の最前線
ssii
PRO
7
2.8k
NLP Colloquium
junokim
1
150
2025年度人工知能学会全国大会チュートリアル講演「深層基盤モデルの数理」
taiji_suzuki
24
14k
Computational OT #4 - Gradient flow and diffusion models
gpeyre
0
290
Featured
See All Featured
YesSQL, Process and Tooling at Scale
rocio
173
14k
Being A Developer After 40
akosma
90
590k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
A designer walks into a library…
pauljervisheath
207
24k
Building a Scalable Design System with Sketch
lauravandoore
462
33k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
357
30k
Facilitating Awesome Meetings
lara
54
6.4k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
The Pragmatic Product Professional
lauravandoore
35
6.7k
4 Signs Your Business is Dying
shpigford
184
22k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3