ICLR2018 Yomikai: Deep Reinforcement Learning

Deep Reinforcement Learning 2018-05-26 (Sat) ICLRಡΈձ Toshiki Kataoka Engineer /
Preferred Networks, Inc.

ࣗݾ঺հ • ยԬ ढ़ج • github: toslunar • ChainerRL •
github: chainer/chainerrl

Notation • π: policy ํࡦ • V(s) ~ E[R|s0=s]:  state
value ঢ়ଶՁ஋ • Q(s, a) ~ E[R|s0=s, a0=a]:  action value ߦಈՁ஋ • η(π) (or J(π)) = Eπ[R] • s (or x): state ঢ়ଶ • a: action ߦಈ • r: reward ใु • γ: discount factor • R = ∑t γtrt : return

Q-learning, policy gradient method • V(s) = Ea~π(-|s)[Q(s, a)] •
Q(s, a) = Eenvironment[r + V(s')] • ∇η(π)   = ∇( E(s0, a0, ...) ~ trajectory(π)[R] )  = E(s0, a0, ...) ~ trajectory(π)[R ∇log( ∏t π(at|st) )]  = E(s0, a0, ...) ~ trajectory(π)[R ∑t ∇log π(at|st)] • log-derivative trick   ∇θ(Ex~p(-; θ)[f(x)])  = ∇θ∫f(x)p(x; θ)dx  = ∫f(x)(∇θp(x; θ))dx  = Ex~p(-; θ)[f(x)∇θlog p(x; θ)]

Action-dependent baseline

Baseline for policy gradient • ∇η(π) = E(s0, a0, ...)
~ trajectory(π)[ ∑t (∑t'≥t rt')∇log π(at|st)] • REINFORCE (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) • i.e. Update θ ← (R − b)∇θlog π (with learning rate > 0) • b: baseline

Baseline for policy gradient • ∇η(π) = E(s0, a0, ...)
~ trajectory(π)[ ∑t ((∑t'≥t rt') − b(st))∇log π(at|st)] • still, unbiased estimate. lower variance • Ex~p(-; θ)[(f(x) − b)∇θlog p(x; θ)] = ∇θ(Ex~p(-; θ)[f(x) − b])  = ∇θ(Ex~p(-; θ)[f(x)] − b)  = ∇θ(Ex~p(-; θ)[f(x)]) • advantage actor-critic  E(s0, a0, ...) ~ trajectory(π)[ ∑t (Q(st, at) − V(st))∇log π(at|st)]

Variance Reduction for Policy Gradient with Action- Dependent Factorized Baselines
(1803.07246) • assume factorized policy: π(a|s) = π1(a1|s)…πd(ad|s) • “very common” for action space {a = (a1, …, ad)} • e.g. Gaussian with diagonal covariance • baseline b(s, a) = b1 + … + bd • use all the other dims' values a−i = (a1, …, ai−1, ai+1 …, ad)  in baseline for i-th dim, because unbiased:

(1803.07246) • [Appendix A] optimal action-independent baseline b*(s) • weighted mean of Q(s, a) • but not V(s) = Ea~π[Q(s, a)] • depends on parameterization of policy

(1803.07246) • b(s) = V(s) = Ea~π[Q(s, a)] “is often used” • bi(s, a−i) = Eai~πi[Q(s, a)] • Monte Carlo marginalized: • mean marginalized:

(1803.07246) • env • MuJoCo, door opening, multi-fingered hand, • blind peg-insertion (PO), • multi-agent communication • Random Fourier Feature   baseline for a fair comparison

(1803.07246) • [Appendix E] for general actions • given DAG model of dependency among dimensions • bi can depend on aj 㱻 not j → i

Q-Prop (1611.02247, ICLR'17) • assume policy has (analytic) mean: μ(s)
= Ea~π(-|s)[a] • baseline affine in action a • use first-order Taylor expansion of (off-policy) critic • reduce variance of on-policy estimate

Action-dependent Control Variates for Policy Optimization via Stein Identity (1710.11198)
• assume re-parameterizable policy • baseline φ(s, a): any dependency in action a • cf. ∇a,aφ(s, a) = 0 in Q-Prop • Gaussian policy. θ1: mean param., θℓ: variance param.

Action-dependent Control Variates for Policy Optimization via Stein Identity (1710.11198)
• I didn't know Stein Identity but I regard the eq. (6) as a direct result of re-param'n trick   (I agree with AnonReviewer2)

With/without  2nd-order grad

Guide Actor-Critic for Continuous Control (1705.07606) • second-order update on
action space • method • learn guide actor (for each state) using second-order • learn parameterized actor πθ based on guide actor • assume Gaussian policy • Authors recommend state-independent covariance

Guide Actor-Critic for Continuous Control (1705.07606) • compute guide actor
• solve: (dual of) • use guide actor • first-order method (e.g. Adam) to minimize MSE

TRPO (ICML'15, 1502.05477) • trust region: constraint on KL(πθold||πθ) •
§C Efficiently Solving the   Trust-Region Constrained Optimization Problem • cf. Conservative Policy Iteration • Approximately Optimal   Approximate Reinforcement Learning (ICML'02)

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control (1707.01891)
• Path Consistency Learning (NIPS'17, 1702.08892)   + “surrogate” trust region method • PCL: consistency with entropy bonus • 1-step: • “surrogate” TR: • where

Trust-PCL: An Off-Policy Trust Region Method for Continuous Control (1707.01891)
• “surrogate” TR: • where • automatic tune λ to aim • KL(π*||π ~ ) is expressed analytically • with expectation over trajectory of π ~ • last 100 episodes to approx.

PPO Algorithms (_, 1707.06347) • “surrogate” trust region had been
mentioned in   Deep Reinforcement Learning Through   Policy Optimization [NIPS'16 Tutorial] • clipping surrogate objective 

Exploration

Parameter Space Noise for Exploration (1706.01905) • “To achieve structured
exploration, we sample from a set of policies by applying additive Gaussian noise to the parameter vector of the current policy” • sample per episode • adaptive noise scaling • re-param'n trick for on-policy

Parameter Space Noise for Exploration (1706.01905) • state-dependent exploration •
“This ensures consistency in actions” • experiments • DQN/ALE, DDPG/MuJoCo, TRPO/MuJoCo • to test sample efficiency • toy discrete env • continuous control env with sparse rewards

Parameter Space Noise for Exploration (1706.01905) • separate policy head:
π(-|s) to predict argmax Q(s, -) • perturb π

Noisy Networks for Exploration (1706.10295) • noisy linear layer •
learn μ and σ • sample per step if off-policy • factorized Gaussian noise

Noisy Networks for Exploration (1706.10295) • experiments: • (DQN, Dueling,
A3C)/ALE • analysis of  learning  σw

Unifying Count-Based Exploration and Intrinsic Motivation (NIPS'16, 1606.01868) • for
x: state  pseudo-count N ^ n(x)  and pseudo-count total n ^ • from density model ρ(x) • ρ'(x): density after observing new x • experiments: ALE • improved much in Montezuma's Revenge

DORA The Explorer: Directed Outreaching Reinforcement Action-Selection (1804.04012) • Initialize
E(s, a) = 1. Update: • “The logarithm of E-Values can be thought of as a generalization of visit counters” • choose argmax((policy using Q)/counter) • experiment • grid bridge, Freeway (Atari) ※ʰυʔϥͱ͍ͬ͠ΐʹେ๯ݥʱ

Meta learning

MAML (1703.03400, ICML'17) • Model-Agnostic Meta-Learning for Fast Adaptation of
Deep Networks • Average second updates over sampled tasks meta update adaptation update

MAML (1703.03400, ICML'17) • Algorithm 3 MAML for Reinforcement Learning
(§3.2) • loss for an RL task 1. sample trajectories 2. compute policy gradient meta update adaptation update

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)
• “A nonstationary environment can be   seen as a sequence of stationary tasks, and hence we propose to tackle it as a multi-task learning problem” • method • meta-learning at training time • adaptation at execution time

meta-learning at training time • Markov chain of tasks •
meta-loss on a pair of consecutive tasks • train with (K trajectories of) Ti ; test with Ti+1 • start with θ (not φi) “due to stability considerations” Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)

adaptation at execution time • importance weight correction • trained:
adaptation update from θ • available: trajectory of Ti-1 using φi-1 Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)

• experiment • dynamic env: torques of 2 legs decay
(b) • competitive env: RoboSumo (c) Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)

Learning an Embedding Space for Transferable Robot Skills Figure 1

Learning an Embedding Space for Transferable Robot Skills • §7
Experimental Results “Our experiments aim to the following questions:” 1. Can our method learn versatile skills? 2. Can it determine how many distinct skills are necessary to accomplish a set of tasks? 3. Can we use the learned skill embedding for control in an unseen scenario? 4. Is it important for the skills to be versatile to use their embedding for control? 5. Is it more efficient to use the learned embedding rather than to learn to solve a task from scratch?

Learning an Embedding Space for Transferable Robot Skills • multi-goal
with  sparse rewards

Learning an Embedding Space for Transferable Robot Skills • manipulation
tasks: learn two skills (col. 1, 2) and use (col. 3) • Spring-wall • L-wall • Rail-push

·ͱΊ • ద౰ͳςʔϚʹ෼ׂͯ͠঺հ͠·ͨ͠ɿ • baseline, 2nd-order grad, exploration, meta learning
• ෳ਺ςʔϚʹ·͕ͨΔ͜ͱ΋ΘΓͱ͋Δ • ಈػ͕ࣅ͍ͯͯ΋ఏҊ͸ҧͬͨΓ΋͢Δ • arxivͰ͸1ϲ݄ҎԼͷࠩͰڝ૪͞ΕͯΔ

ICLR2018 Yomikai: Deep Reinforcement Learning

ICLR2018 Yomikai: Deep Reinforcement Learning

More Decks by tos

Featured

Transcript