Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
570
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
Introducing Research Units of Matsuo-Iwasawa Laboratory
matsuolab
0
1.5k
Neural Fieldの紹介
nnchiba
1
520
機械学習でヒトの行動を変える
hiromu1996
1
430
情報処理学会関西支部2024年度定期講演会「自然言語処理と大規模言語モデルの基礎」
ksudoh
10
2.3k
第79回 産総研人工知能セミナー 発表資料
agiats
2
180
医療支援AI開発における臨床と情報学の連携を円滑に進めるために
moda0
0
130
ダイナミックプライシング とその実例
skmr2348
3
520
テキストマイニングことはじめー基本的な考え方からメディアディスコース研究への応用まで
langstat
1
160
Leveraging LLMs for Unsupervised Dense Retriever Ranking (SIGIR 2024)
kampersanda
2
270
博士学位論文予備審査 / Scaling Telemetry Workloads in Cloud Applications: Techniques for Instrumentation, Storage, and Mining
yuukit
1
1.6k
Weekly AI Agents News! 10月号 論文のアーカイブ
masatoto
1
440
20240918 交通くまもとーく 未来の鉄道網編(太田恒平)
trafficbrain
0
410
Featured
See All Featured
GraphQLとの向き合い方2022年版
quramy
44
13k
Documentation Writing (for coders)
carmenintech
67
4.5k
Practical Orchestrator
shlominoach
186
10k
Building Flexible Design Systems
yeseniaperezcruz
328
38k
Writing Fast Ruby
sferik
628
61k
Why Our Code Smells
bkeepers
PRO
335
57k
The Pragmatic Product Professional
lauravandoore
32
6.4k
Producing Creativity
orderedlist
PRO
343
39k
How GitHub (no longer) Works
holman
312
140k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
A Tale of Four Properties
chriscoyier
157
23k
Bootstrapping a Software Product
garrettdimon
PRO
305
110k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3