Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
590
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
J-RAGBench: 日本語RAGにおける Generator評価ベンチマークの構築
koki_itai
0
770
不確実性下における目的と手段の統合的探索に向けた連続腕バンディットの応用 / iot70_gp_rff_mab
monochromegane
2
190
引力・斥力を制御可能なランダム部分集合の確率分布
wasyro
0
260
説明可能な機械学習と数理最適化
kelicht
0
190
大学見本市2025 JSTさきがけ事業セミナー「顔の見えないセンシング技術:多様なセンサにもとづく個人情報に配慮した人物状態推定」
miso2024
0
170
投資戦略202508
pw
0
570
「どう育てるか」より「どう働きたいか」〜スクラムマスターの最初の一歩〜
hirakawa51
0
930
IMC の細かすぎる話 2025
smly
2
690
Generative Models 2025
takahashihiroshi
25
14k
2025年度人工知能学会全国大会チュートリアル講演「深層基盤モデルの数理」
taiji_suzuki
26
19k
Language Models Are Implicitly Continuous
eumesy
PRO
0
300
Towards a More Efficient Reasoning LLM: AIMO2 Solution Summary and Introduction to Fast-Math Models
analokmaus
2
910
Featured
See All Featured
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
23
1.5k
We Have a Design System, Now What?
morganepeng
53
7.8k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.5k
Navigating Team Friction
lara
190
15k
Scaling GitHub
holman
463
140k
Building a Modern Day E-commerce SEO Strategy
aleyda
44
7.8k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
30
2.9k
KATA
mclloyd
PRO
32
15k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
GraphQLの誤解/rethinking-graphql
sonatard
73
11k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3