Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Yuu David Jinnai
July 28, 2018
Research
3
590
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Tweet
Share
Other Decks in Research
See All in Research
RHO-1: Not All Tokens Are What You Need
sansan_randd
1
190
Language Models Are Implicitly Continuous
eumesy
PRO
0
290
音声感情認識技術の進展と展望
nagase
0
250
問いを起点に、社会と共鳴する知を育む場へ
matsumoto_r
PRO
0
660
AWSで実現した大規模日本語VLM学習用データセット "MOMIJI" 構築パイプライン/buiding-momiji
studio_graph
2
710
Combinatorial Search with Generators
kei18
0
980
Vision and LanguageからのEmbodied AIとAI for Science
yushiku
PRO
1
560
Google Agent Development Kit (ADK) 入門 🚀
mickey_kubo
2
2k
20250605_新交通システム推進議連_熊本都市圏「車1割削減、渋滞半減、公共交通2倍」から考える地方都市交通政策
trafficbrain
0
860
Agentic AIとMCPを利用したサービス作成入門
mickey_kubo
0
640
Sat2City:3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion
satai
3
120
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
shunk031
17
10k
Featured
See All Featured
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.7k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.6k
We Have a Design System, Now What?
morganepeng
53
7.8k
Optimising Largest Contentful Paint
csswizardry
37
3.4k
GitHub's CSS Performance
jonrohan
1032
470k
Facilitating Awesome Meetings
lara
56
6.6k
Java REST API Framework Comparison - PWX 2021
mraible
33
8.9k
Balancing Empowerment & Direction
lara
4
690
Agile that works and the tools we love
rasmusluckow
331
21k
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3