Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ICLR2018 Yomikai: Deep Reinforcement Learning

tos
May 26, 2018
67

ICLR2018 Yomikai: Deep Reinforcement Learning

tos

May 26, 2018
Tweet

Transcript

  1. Notation • π: policy ํࡦ • V(s) ~ E[R|s0=s]:
 state

    value ঢ়ଶՁ஋ • Q(s, a) ~ E[R|s0=s, a0=a]:
 action value ߦಈՁ஋ • η(π) (or J(π)) = Eπ[R] • s (or x): state ঢ়ଶ • a: action ߦಈ • r: reward ใु • γ: discount factor • R = ∑t γtrt : return
  2. Q-learning, policy gradient method • V(s) = Ea~π(-|s)[Q(s, a)] •

    Q(s, a) = Eenvironment[r + V(s')] • ∇η(π) 
 = ∇( E(s0, a0, ...) ~ trajectory(π)[R] )
 = E(s0, a0, ...) ~ trajectory(π)[R ∇log( ∏t π(at|st) )]
 = E(s0, a0, ...) ~ trajectory(π)[R ∑t ∇log π(at|st)] • log-derivative trick 
 ∇θ(Ex~p(-; θ)[f(x)])
 = ∇θ∫f(x)p(x; θ)dx
 = ∫f(x)(∇θp(x; θ))dx
 = Ex~p(-; θ)[f(x)∇θlog p(x; θ)]
  3. Baseline for policy gradient • ∇η(π) = E(s0, a0, ...)

    ~ trajectory(π)[ ∑t (∑t'≥t rt')∇log π(at|st)] • REINFORCE (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) • i.e. Update θ ← (R − b)∇θlog π (with learning rate > 0) • b: baseline
  4. Baseline for policy gradient • ∇η(π) = E(s0, a0, ...)

    ~ trajectory(π)[ ∑t ((∑t'≥t rt') − b(st))∇log π(at|st)] • still, unbiased estimate. lower variance • Ex~p(-; θ)[(f(x) − b)∇θlog p(x; θ)] = ∇θ(Ex~p(-; θ)[f(x) − b])
 = ∇θ(Ex~p(-; θ)[f(x)] − b)
 = ∇θ(Ex~p(-; θ)[f(x)]) • advantage actor-critic
 E(s0, a0, ...) ~ trajectory(π)[ ∑t (Q(st, at) − V(st))∇log π(at|st)]
  5. Variance Reduction for Policy Gradient with Action- Dependent Factorized Baselines

    (1803.07246) • assume factorized policy: π(a|s) = π1(a1|s)…πd(ad|s) • “very common” for action space {a = (a1, …, ad)} • e.g. Gaussian with diagonal covariance • baseline b(s, a) = b1 + … + bd • use all the other dims' values a−i = (a1, …, ai−1, ai+1 …, ad)
 in baseline for i-th dim, because unbiased:
  6. Variance Reduction for Policy Gradient with Action- Dependent Factorized Baselines

    (1803.07246) • [Appendix A] optimal action-independent baseline b*(s) • weighted mean of Q(s, a) • but not V(s) = Ea~π[Q(s, a)] • depends on parameterization of policy
  7. Variance Reduction for Policy Gradient with Action- Dependent Factorized Baselines

    (1803.07246) • b(s) = V(s) = Ea~π[Q(s, a)] “is often used” • bi(s, a−i) = Eai~πi[Q(s, a)] • Monte Carlo marginalized: • mean marginalized:
  8. Variance Reduction for Policy Gradient with Action- Dependent Factorized Baselines

    (1803.07246) • env • MuJoCo, door opening, multi-fingered hand, • blind peg-insertion (PO), • multi-agent communication • Random Fourier Feature 
 baseline for a fair comparison
  9. Variance Reduction for Policy Gradient with Action- Dependent Factorized Baselines

    (1803.07246) • [Appendix E] for general actions • given DAG model of dependency among dimensions • bi can depend on aj 㱻 not j → i
  10. Q-Prop (1611.02247, ICLR'17) • assume policy has (analytic) mean: μ(s)

    = Ea~π(-|s)[a] • baseline affine in action a • use first-order Taylor expansion of (off-policy) critic • reduce variance of on-policy estimate
  11. Action-dependent Control Variates for Policy Optimization via Stein Identity (1710.11198)

    • assume re-parameterizable policy • baseline φ(s, a): any dependency in action a • cf. ∇a,aφ(s, a) = 0 in Q-Prop • Gaussian policy. θ1: mean param., θℓ: variance param.
  12. Action-dependent Control Variates for Policy Optimization via Stein Identity (1710.11198)

    • I didn't know Stein Identity but I regard the eq. (6) as a direct result of re-param'n trick 
 (I agree with AnonReviewer2)
  13. Guide Actor-Critic for Continuous Control (1705.07606) • second-order update on

    action space • method • learn guide actor (for each state) using second-order • learn parameterized actor πθ based on guide actor • assume Gaussian policy • Authors recommend state-independent covariance
  14. Guide Actor-Critic for Continuous Control (1705.07606) • compute guide actor

    • solve: (dual of) • use guide actor • first-order method (e.g. Adam) to minimize MSE
  15. TRPO (ICML'15, 1502.05477) • trust region: constraint on KL(πθold||πθ) •

    §C Efficiently Solving the 
 Trust-Region Constrained Optimization Problem • cf. Conservative Policy Iteration • Approximately Optimal 
 Approximate Reinforcement Learning (ICML'02)
  16. Trust-PCL: An Off-Policy Trust Region Method for Continuous Control (1707.01891)

    • Path Consistency Learning (NIPS'17, 1702.08892) 
 + “surrogate” trust region method • PCL: consistency with entropy bonus • 1-step: • “surrogate” TR: • where
  17. Trust-PCL: An Off-Policy Trust Region Method for Continuous Control (1707.01891)

    • “surrogate” TR: • where • automatic tune λ to aim • KL(π*||π ~ ) is expressed analytically • with expectation over trajectory of π ~ • last 100 episodes to approx.
  18. PPO Algorithms (_, 1707.06347) • “surrogate” trust region had been

    mentioned in 
 Deep Reinforcement Learning Through 
 Policy Optimization [NIPS'16 Tutorial] • clipping surrogate objective

  19. Parameter Space Noise for Exploration (1706.01905) • “To achieve structured

    exploration, we sample from a set of policies by applying additive Gaussian noise to the parameter vector of the current policy” • sample per episode • adaptive noise scaling • re-param'n trick for on-policy
  20. Parameter Space Noise for Exploration (1706.01905) • state-dependent exploration •

    “This ensures consistency in actions” • experiments • DQN/ALE, DDPG/MuJoCo, TRPO/MuJoCo • to test sample efficiency • toy discrete env • continuous control env with sparse rewards
  21. Parameter Space Noise for Exploration (1706.01905) • separate policy head:

    π(-|s) to predict argmax Q(s, -) • perturb π
  22. Noisy Networks for Exploration (1706.10295) • noisy linear layer •

    learn μ and σ • sample per step if off-policy • factorized Gaussian noise
  23. Unifying Count-Based Exploration and Intrinsic Motivation (NIPS'16, 1606.01868) • for

    x: state
 pseudo-count N ^ n(x)
 and pseudo-count total n ^ • from density model ρ(x) • ρ'(x): density after observing new x • experiments: ALE • improved much in Montezuma's Revenge
  24. DORA The Explorer: Directed Outreaching Reinforcement Action-Selection (1804.04012) • Initialize

    E(s, a) = 1. Update: • “The logarithm of E-Values can be thought of as a generalization of visit counters” • choose argmax((policy using Q)/counter) • experiment • grid bridge, Freeway (Atari) ※ʰυʔϥͱ͍ͬ͠ΐʹେ๯ݥʱ
  25. MAML (1703.03400, ICML'17) • Model-Agnostic Meta-Learning for Fast Adaptation of

    Deep Networks • Average second updates over sampled tasks meta update adaptation update
  26. MAML (1703.03400, ICML'17) • Algorithm 3 MAML for Reinforcement Learning

    (§3.2) • loss for an RL task 1. sample trajectories 2. compute policy gradient meta update adaptation update
  27. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)

    • “A nonstationary environment can be 
 seen as a sequence of stationary tasks, and hence we propose to tackle it as a multi-task learning problem” • method • meta-learning at training time • adaptation at execution time
  28. meta-learning at training time • Markov chain of tasks •

    meta-loss on a pair of consecutive tasks • train with (K trajectories of) Ti ; test with Ti+1 • start with θ (not φi) “due to stability considerations” Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)
  29. adaptation at execution time • importance weight correction • trained:

    adaptation update from θ • available: trajectory of Ti-1 using φi-1 Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)
  30. • experiment • dynamic env: torques of 2 legs decay

    (b) • competitive env: RoboSumo (c) Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments (1710.03641)
  31. Learning an Embedding Space for Transferable Robot Skills • §7

    Experimental Results “Our experiments aim to the following questions:” 1. Can our method learn versatile skills? 2. Can it determine how many distinct skills are necessary to accomplish a set of tasks? 3. Can we use the learned skill embedding for control in an unseen scenario? 4. Is it important for the skills to be versatile to use their embedding for control? 5. Is it more efficient to use the learned embedding rather than to learn to solve a task from scratch?
  32. Learning an Embedding Space for Transferable Robot Skills • manipulation

    tasks: learn two skills (col. 1, 2) and use (col. 3) • Spring-wall • L-wall • Rail-push
  33. ·ͱΊ • ద౰ͳςʔϚʹ෼ׂͯ͠঺հ͠·ͨ͠ɿ • baseline, 2nd-order grad, exploration, meta learning

    • ෳ਺ςʔϚʹ·͕ͨΔ͜ͱ΋ΘΓͱ͋Δ • ಈػ͕ࣅ͍ͯͯ΋ఏҊ͸ҧͬͨΓ΋͢Δ • arxivͰ͸1ϲ݄ҎԼͷࠩͰڝ૪͞ΕͯΔ