Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
600
3
Share
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Other Decks in Research
See All in Research
Multi-Agent Large Language Models for Code Intelligence: Opportunities, Challenges, and Research Directions
fatemeh_fard
0
150
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
shunk031
4
760
データサイエンティストをめぐる環境の違い2025年版〈一般ビジネスパーソン調査の国際比較〉
datascientistsociety
PRO
0
1.1k
Ankylosing Spondylitis
ankh2054
0
160
データサイエンティストの業務変化
datascientistsociety
PRO
0
350
明日から使える!研究効率化ツール入門
matsui_528
11
6k
Aurora Serverless からAurora Serverless v2への課題と知見を論文から読み解く/Understanding the challenges and insights of moving from Aurora Serverless to Aurora Serverless v2 from a paper
bootjp
6
1.6k
「行ける・行けない表」による地域公共交通の性能評価
bansousha
0
130
討議:RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
0
730
2026.01ウェビナー資料
elith
0
330
AIを叩き台として、 「検証」から「共創」へと進化するリサーチ
mela_dayo
0
230
競合や要望に流されない─B2B SaaSでミニマム要件を決めるリアルな取り組み / Don't be swayed by competitors or requests - A real effort to determine minimum requirements for B2B SaaS
kaminashi
0
1.3k
Featured
See All Featured
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.1k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
96
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
130
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
1
260
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
180
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
360
30k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.3k
Designing for humans not robots
tammielis
254
26k
Why Our Code Smells
bkeepers
PRO
340
58k
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
290
GitHub's CSS Performance
jonrohan
1032
470k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3