Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
590
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
最適化と機械学習による問題解決
mickey_kubo
0
170
SSII2025 [SS1] レンズレスカメラ
ssii
PRO
2
1.1k
【緊急警告】日本の未来設計図 ~沈没か、再生か。国民と断行するラストチャンス~
yuutakasan
0
150
PhD Defense 2025: Visual Understanding of Human Hands in Interactions
tkhkaeio
1
190
2025/7/5 応用音響研究会招待講演@北海道大学
takuma_okamoto
1
180
[輪講] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
nk35jk
2
990
とあるSREの博士「過程」 / A Certain SRE’s Ph.D. Journey
yuukit
10
4.2k
時系列データに対する解釈可能な 決定木クラスタリング
mickey_kubo
2
940
Time to Cash: The Full Stack Breakdown of Modern ATM Attacks
ratatata
0
130
MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation
satai
4
200
AI エージェントを活用した研究再現性の自動定量評価 / scisci2025
upura
1
150
投資戦略202508
pw
0
560
Featured
See All Featured
Designing Experiences People Love
moore
142
24k
How GitHub (no longer) Works
holman
315
140k
Mobile First: as difficult as doing things right
swwweet
224
9.9k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Java REST API Framework Comparison - PWX 2021
mraible
33
8.8k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
30
9.7k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
53k
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Writing Fast Ruby
sferik
628
62k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
36
2.5k
The Invisible Side of Design
smashingmag
301
51k
Visualization
eitanlees
148
16k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3