Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
590
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
[CV勉強会@関東 CVPR2025] VLM自動運転model S4-Driver
shinkyoto
2
420
一人称視点映像解析の最先端(MIRU2025 チュートリアル)
takumayagi
6
3.1k
Computational OT #1 - Monge and Kantorovitch
gpeyre
0
220
電力システム最適化入門
mickey_kubo
1
800
EOGS: Gaussian Splatting for Efficient Satellite Image Photogrammetry
satai
4
400
近似動的計画入門
mickey_kubo
4
1k
SSII2025 [TS3] 医工連携における画像情報学研究
ssii
PRO
2
1.3k
Agentic AIとMCPを利用したサービス作成入門
mickey_kubo
0
390
CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations
satai
3
240
データxデジタルマップで拓く ミラノ発・地域共創最前線
mapconcierge4agu
0
200
Combinatorial Search with Generators
kei18
0
530
Principled AI ~深層学習時代における課題解決の方法論~
taniai
3
1.2k
Featured
See All Featured
Building a Modern Day E-commerce SEO Strategy
aleyda
43
7.4k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1.1k
Product Roadmaps are Hard
iamctodd
PRO
54
11k
Why You Should Never Use an ORM
jnunemaker
PRO
58
9.5k
Building Applications with DynamoDB
mza
96
6.5k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
VelocityConf: Rendering Performance Case Studies
addyosmani
332
24k
The Pragmatic Product Professional
lauravandoore
36
6.8k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Fireside Chat
paigeccino
38
3.6k
Reflections from 52 weeks, 52 projects
jeffersonlam
351
21k
Build your cross-platform service in a week with App Engine
jlugia
231
18k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3