Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
ICML2018読み会 Policy and Value Transfer in Lifelo...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Yuu David Jinnai
July 28, 2018
Research
600
3
Share
ICML2018読み会 Policy and Value Transfer in Lifelong Reinforcement Learning
Yuu David Jinnai
July 28, 2018
Other Decks in Research
See All in Research
Using our influence and power for patient safety
helenbevan
0
340
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
shunk031
4
900
2026年度 生成AI を活用した論文執筆ガイド/ワークショップ / 2026 Academic Year Guide to Writing Papers Using Generative AI - Workshop
ks91
PRO
0
120
データサイエンティストをめぐる環境の違い2025年版〈一般ビジネスパーソン調査の国際比較〉
datascientistsociety
PRO
0
1.2k
COFFEE-Japan PROJECT Impact Report(海ノ向こうコーヒー)
ontheslope
0
1.5k
20年前に50代だった人たちの今
hysmrk
0
190
第66回コンピュータビジョン勉強会@関東 Epona: Autoregressive Diffusion World Model for Autonomous Driving
kentosasaki
0
590
Collective Predictive Coding and World Models in LLMs: A System 0/1/2/3 Perspective on Hierarchical Physical AI (IEEE SII 2026 Plenary Talk)
tanichu
1
380
LLM Compute Infrastructure Overview
karakurist
2
1.3k
An Open and Reproducible Deep Research Agent for Long-Form Question Answering
ikuyamada
0
430
AIエージェント時代のLLM-jpモデルのあるべき姿
k141303
0
320
Can We Teach Logical Reasoning to LLMs? – An Approach Using Synthetic Corpora (AAAI 2026 bridge keynote)
morishtr
1
220
Featured
See All Featured
Building a A Zero-Code AI SEO Workflow
portentint
PRO
0
490
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
How to audit for AI Accessibility on your Front & Back End
davetheseo
0
350
What does AI have to do with Human Rights?
axbom
PRO
1
2.1k
sira's awesome portfolio website redesign presentation
elsirapls
0
230
Become a Pro
speakerdeck
PRO
31
5.9k
Paper Plane (Part 1)
katiecoart
PRO
0
7.1k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.4k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.2k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
3k
Dominate Local Search Results - an insider guide to GBP, reviews, and Local SEO
greggifford
PRO
0
160
What’s in a name? Adding method to the madness
productmarketing
PRO
24
4k
Transcript
Policy and Value Transfer in Lifelong Reinforcement Learning David Abel*,
Yuu Jinnai*, George Konidaris, Michael Littman, Yue Guo Brown University
Motivation: Solving Multiple Tasks (Arumugam et al. 2017) (Konidaris et
al. 2017)
Markov Decision Processes M = (S, A, T, R, γ)
S: set of states A: set of actions T: transitions R: reward γ: discount factor Objective: Find a policy π(a | s) which maximizes total discounted reward s t s t+1 r t+1 a t ・・・・
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Previous Work: Action Prior (Rosman&Ramamoorthy2012) Pr(M 1 ) = 0.5
Pr(M 2 ) = 0.5 0.5 0.5 Probability of the action being the optimal action
Pr(M 1 ) = 0.5 Pr(M 2 ) = 0.5
Probability of the action being the optimal action Previous Work: Action Prior (Rosman&Ramamoorthy2012) 0.5 0.5
Algorithm: Average MDP Pr(M 1 ) = 0.5 Pr(M 2
) = 0.5 0.0 1.0
(Theorem) Average MDP is an optimal policy if only reward
function is distributed (e.g. S, A, T, γ are fixed) (Ramachandran&Amir 2007) Results
Optimal Fixed Policy Given a distribution of task what policy
maximizes the expected performance? Task M 1 Policy Task M 2 Task M 3
Lifelong Reinforcement Learning D ・・ Repeat: 1. Agent samples an
MDP from a distribution M ← sample(D) 2. Solve it π ← solve(M) M 1 M 2 M 3
Optimistic Initialization (Keans&Singh 2002) Initialize Q-value: Initialize Q-value optimistically to
encourage exploration
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: IF:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to:
PAC-MDP (Strehl et al. ‘09; Rao, Whiteson ‘12; Mann, Choe
‘13) (Theorem) Sample complexity of PAC-MDP algorithms are: Minimize: the overestimate Subject to: Solution:
Algorithm: MaxQInit Task M 1 Task M 2 ・・・ Task
M m ・・・ (Theorem) For m sufficiently large, MaxQInit preserves the PAC-MDP property with high probability
Results: Delayed Q-Learning
Results: Delayed Q-Learning
Results: R-Max (Brafman&Tennenholtz 2002)
Results: Q-Learning (Watkins 1992) Tradeoff in jumpstart performance vs. convergence
time
Conclusions Average MDP Task M 1 Policy 1 Task M
2 Policy 2 ・・・・ MaxQInit Task M 1 Policy Task M 2 Task M 3