Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
570
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
文献紹介:A Multidimensional Framework for Evaluating Lexical Semantic Change with Social Science Applications
a1da4
1
230
論文紹介: COSMO: A Large-Scale E-commerce Common Sense Knowledge Generation and Serving System at Amazon (SIGMOD 2024)
ynakano
1
100
Introducing Research Units of Matsuo-Iwasawa Laboratory
matsuolab
0
960
2024/10/30 産総研AIセミナー発表資料
keisuke198619
1
330
Practical The One Person Framework
asonas
1
1.7k
SNLP2024:Planning Like Human: A Dual-process Framework for Dialogue Planning
yukizenimoto
1
330
尺度開発における質的研究アプローチ(自主企画シンポジウム7:認知行動療法における尺度開発のこれから)
litalicolab
0
350
文化が形作る音楽推薦の消費と、その逆
kuri8ive
0
170
大規模言語モデルのバイアス
yukinobaba
PRO
4
710
Large Vision Language Model (LVLM) に関する最新知見まとめ (Part 1)
onely7
21
3.8k
marukotenant01/tenant-20240916
marketing2024
0
520
Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views
satai
1
110
Featured
See All Featured
Build your cross-platform service in a week with App Engine
jlugia
229
18k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
42
9.2k
Reflections from 52 weeks, 52 projects
jeffersonlam
346
20k
Practical Orchestrator
shlominoach
186
10k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
329
21k
How to Think Like a Performance Engineer
csswizardry
20
1.1k
Documentation Writing (for coders)
carmenintech
65
4.4k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
44
2.2k
Keith and Marios Guide to Fast Websites
keithpitt
409
22k
Java REST API Framework Comparison - PWX 2021
mraible
PRO
28
8.2k
Building a Scalable Design System with Sketch
lauravandoore
459
33k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3