Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deep Reinforcement Learning

Yuchu Luo
August 28, 2017

Deep Reinforcement Learning

Yuchu Luo

August 28, 2017
Tweet

More Decks by Yuchu Luo

Other Decks in Research

Transcript

  1. Play Ataria Game • Objective: Complete the game with the

    highest score • State: Raw pixel inputs of the game state • Action: Game controls e.g. Left, Right, Up, Down • Reward: Score increase/decrease at each time step
  2. • A set of states s ∈ S • A

    set of actions (per state) a ∈ A • A model T(s,a,s’) • A reward function R(s,a,s’) • Looking for a policy π*(s) that maximizes cumulative discounted reward: ∑γtrt Markov Decision Process
  3. Find optimal policy π*(s) Policy Learning Value Learning a ~

    π*(s) Find optimal Q-Value Function Q*(s,a) a = arg max Q*(s,a’) a’ local information global information
  4. Maximum expected future rewards starting at state si, choosing action

    ai, and then following an optimal policy π* Q*(s t ,a t ) = max π E γ i−tr i i=t T ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥
  5. Bellman Equation Q*(s,a) = E s'~ε r +γ maxQ*(s',a')| s,a

    a' ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ 从最佳选择的路路径末端截取⼀一⼩小部分,余下的路路径仍然是最佳路路径 Intuition:
  6. Solving Optimal Q-Value Q k+1 (s,a) = Ε R(s,a,s')+γ maxQ

    k (s',a')| s,a a' ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = T(s,a,s') s' ∑ [R(s,a,s')+γ max a' Q k (s',a')] Value Iteration Convergent
  7. Partially Observed MDP (POMDP) Life is always hard — We

    don’t know P(s,a,s’) and R(s,a,s’) Reinforcement Learning
  8. Episode: sequence of states and actions s0,a0,r0,s1,a1,r1,s2,a2,r2,…,sT-1,aT-1,rT-1,sT,rT observations (states), actions

    obtain reward R Game Over Learn to maximize the expected cumulative reward per episode
  9. Recap: Approximate Q-Learning Linear Value Functions Q(s,a) = w 1

    f 1 + w 2 f 2 +...+ w n f n (s,a) Feature-Based Representations • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1 / (dist to dot)2 • Is Pacman in a tunnel? (0/1) • …
  10. Historical experience Q(s,a) ← (1−α)Q(s,a)+α(r +γ max a' Q(s',a')) Learned

    from new (s,a,r,s’) pair difference = [r +γ max a' Q(s',a')]−Q(s,a) Q(s,a) ← Q(s,a)+α ⋅difference w i ← w i +α ⋅difference⋅ f i (s,a) Update
  11. Now, we have deep learning Q(s,a;θ) ≈ Q*(s,a) Make the

    function approximate be a deep neural network L i (θ i ) = Ε s,a∼ρ(⋅) [(y i −Q(s,a;θ i ))2 ] Loss function: Where y i = Ε s'∼ε [r +γ max a' Q(s',a';θ i−1 )| s,a] Deep Q-Learning
  12. Cutest state st (84x84x4) stack of last 4 frames Convolutional

    Neural Network Fully Connected Layer Output (4 Q-Values) Feedforward Pass Deep Q-Learning
  13. Experience Replay Learning from batches of consecutive samples is problematic:

    • Samples are correlated => inefficient learning • current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size => can lead to bad feedback loops Address these problems using experience replay • Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played • Train Q-network on random mini batches of transitions from the replay memory, instead of consecutive samples • Each transition can also contribute to multiple weight updates => greater data efficiency From CS231n
  14. Policy Gradients Instead of learning exact value of every (state,

    action) pair, just riding the best policy from a collection of policies Neural Network left right fire 0.6 0.1 0.3 Probability
  15. REINFORCE algorithm ∇ θ J(θ) ≈ r(τ ) t≥0 ∑

    ∇ θ logπ θ (a t | s t ) Intuition: • If r() is high, push up the probabilities of the actions seen • If r() is low, push down the probabilities of the actions seen Learn more in supplied materials
  16. Actor-Critic Neural Network Actor Network Critic Network left right fire

    left right fire 0.6 0.1 0.3 40 33 72 Q-Value Probability
  17. Summary • Policy gradients: general but suffer from high variance

    so requires a lot of samples. Challenge: sample-efficiency • Q-learning: does not always work but when it works, usually more sample-efficient. Challenge: exploration • Guarantees: • Policy Gradients: Converges to a local minima of J(ᶚ), often good enough! • Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator From CS231n