Deep Reinforcement Learning

Deep Reinforcement Learning Yuchu Luo Computer Animation and Multimedia Analysis
LAB Monday, 28 August 2017

Environment Goal action reward state Agent

Play Ataria Game • Objective: Complete the game with the
highest score • State: Raw pixel inputs of the game state • Action: Game controls e.g. Left, Right, Up, Down • Reward: Score increase/decrease at each time step

• A set of states s ∈ S • A
set of actions (per state) a ∈ A • A model T(s,a,s’) • A reward function R(s,a,s’) • Looking for a policy π*(s) that maximizes cumulative discounted reward: ∑γtrt Markov Decision Process

Find optimal policy π*(s) Policy Learning Value Learning a ~
π*(s) Find optimal Q-Value Function Q*(s,a) a = arg max Q*(s,a’) a’ local information global information

Maximum expected future rewards starting at state si, choosing action
ai, and then following an optimal policy π* Q*(s t ,a t ) = max π E γ i−tr i i=t T ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥

Bellman Equation Q*(s,a) = E s'~ε r +γ maxQ*(s',a')| s,a
a' ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ 从最佳选择的路路径末端截取⼀一⼩小部分，余下的路路径仍然是最佳路路径 Intuition:

Solving Optimal Q-Value Q k+1 (s,a) = Ε R(s,a,s')+γ maxQ
k (s',a')| s,a a' ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = T(s,a,s') s' ∑ [R(s,a,s')+γ max a' Q k (s',a')] Value Iteration Convergent

Partially Observed MDP (POMDP) Life is always hard — We
don’t know P(s,a,s’) and R(s,a,s’) Reinforcement Learning

Episode: sequence of states and actions s0,a0,r0,s1,a1,r1,s2,a2,r2,…,sT-1,aT-1,rT-1,sT,rT observations (states), actions
obtain reward R Game Over Learn to maximize the expected cumulative reward per episode

Model-Based or Model-Free?

Q-Learning Policy Gradient Actor-Critic Algorithm Model-Free Value-based Policy-based

Recap: Approximate Q-Learning Linear Value Functions Q(s,a) = w 1
f 1 + w 2 f 2 +...+ w n f n (s,a) Feature-Based Representations • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1 / (dist to dot)2 • Is Pacman in a tunnel? (0/1) • …

Historical experience Q(s,a) ← (1−α)Q(s,a)+α(r +γ max a' Q(s',a')) Learned
from new (s,a,r,s’) pair difference = [r +γ max a' Q(s',a')]−Q(s,a) Q(s,a) ← Q(s,a)+α ⋅difference w i ← w i +α ⋅difference⋅ f i (s,a) Update

Now, we have deep learning Q(s,a;θ) ≈ Q*(s,a) Make the
function approximate be a deep neural network L i (θ i ) = Ε s,a∼ρ(⋅) [(y i −Q(s,a;θ i ))2 ] Loss function: Where y i = Ε s'∼ε [r +γ max a' Q(s',a';θ i−1 )| s,a] Deep Q-Learning

Cutest state st (84x84x4) stack of last 4 frames Convolutional
Neural Network Fully Connected Layer Output (4 Q-Values) Feedforward Pass Deep Q-Learning

Experience Replay Learning from batches of consecutive samples is problematic:
• Samples are correlated => inefﬁcient learning • current Q-network parameters determines next training samples (e.g. if maximizing action is to move left, training samples will be dominated by samples from left-hand size => can lead to bad feedback loops Address these problems using experience replay • Continually update a replay memory table of transitions (st, at, rt, st+1) as game (experience) episodes are played • Train Q-network on random mini batches of transitions from the replay memory, instead of consecutive samples • Each transition can also contribute to multiple weight updates => greater data efﬁciency From CS231n

Experiments

Policy Gradients Instead of learning exact value of every (state,
action) pair, just riding the best policy from a collection of policies Neural Network left right ﬁre 0.6 0.1 0.3 Probability

REINFORCE algorithm ∇ θ J(θ) ≈ r(τ ) t≥0 ∑
∇ θ logπ θ (a t | s t ) Intuition: • If r() is high, push up the probabilities of the actions seen • If r() is low, push down the probabilities of the actions seen Learn more in supplied materials

Actor-Critic Neural Network Actor Network Critic Network left right ﬁre
left right ﬁre 0.6 0.1 0.3 40 33 72 Q-Value Probability

Example: Recurrent Attention Model (RAM) Considered as a control problem

Summary • Policy gradients: general but suffer from high variance
so requires a lot of samples. Challenge: sample-efﬁciency • Q-learning: does not always work but when it works, usually more sample-efﬁcient. Challenge: exploration • Guarantees: • Policy Gradients: Converges to a local minima of J(ᶚ), often good enough! • Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator From CS231n

Deep Reinforcement Learning

Deep Reinforcement Learning

Yuchu Luo

More Decks by Yuchu Luo

Other Decks in Research

Featured

Transcript