Stephen Tu on "Least Squares Policy Iteration"

LEAST SQUARES POLICY ITERATION (LSPI) M. G. Lagoudakis and R.
Parr Presented by Stephen Tu PWL 8/31

REINFORCEMENT LEARNING IN < 5 MINUTES

RL PRIMER

RL PRIMER • Formalized via a Markov Decision Process (MDP).
• An MDP is a 5-tuple (S, A, p, γ, r) • S is state-space (e.g. position on a grid), • A is action-space (e.g. left and right), • p : S x A → Δ(S) is the transition function, • γ is the discount factor, • r : S x A → ℝ is the reward function. Unknown to   algorithm!

RL PRIMER • Goal of RL is to ﬁnd a
policy π : S → A that optimizes (over all policies) • In this talk, we focus primarily on the easier problem of scoring a particular (ﬁxed) policy. V ⇡ = E " 1 X k=0 kr(sk, ak) # , sk+1 ⇠ p(·|sk, ⇡(sk)) | {z } “dynamics” .

RL PRIMER • For a policy, deﬁne the value function
as • How do we evaluate this function? V ⇡(s) = E " 1 X k=0 kr(sk, ak) s0 = s # .

RL PRIMER • Fundamental equation of RL V ⇡(s) =
r(s, ⇡(s)) + E s0⇠p(·|s,⇡(s)) [V ⇡(s0)] .

BELLMAN’S EQUATION IN ACTION

LINEAR QUADRATIC REGULATOR • (Discrete-time) linear, time-invariant system: • Quadratic
reward function: xk+1 = Axk + Buk + wk , wk ⇠ N (0 , I ) . r ( xk, uk) = x T k Qxk + u T k Ruk , Q ⌫ 0 , R 0 .

LINEAR QUADRATIC REGULATOR • Let’s derive for LQR, for feedback
• Step 1: Guess that . • Step 2: Plug assumption into Bellman equation and solve for P, q. V ⇡( x ) ⇡ ( x ) = Kx . V ⇡( x ) = x T Px + q

• Bellman: • Plug in : • Solve: V ⇡(
x ) = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ V ⇡( x 0)] . V ⇡( x ) = x T Px + q x T Px + q = x T( A + BK ) x + E z ⇠ N (( A + BK ) x,I ) [ z T Pz + q ] . 0 = LTPL P + Q + LTRL = 0 , L = A + BK , q = 1 Tr(P) .

LEAST SQUARES TEMPORAL DIFFERENCING

LEAST SQUARES TD • Let’s do the LQR example again,
with more generality. • Assume “linear architecture”: • For LQR: V ⇡( x ) = ( x )T w . ( x ) = svec ✓ xx T + 1 I ◆ .

• Bellman: • Plug in linear assumption: V ⇡( x
) = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ V ⇡( x 0)] . V ⇡( x ) = ( x )T w . ( x )T w = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)T w ] = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)]T w () r ( x, ⇡ ( x )) = h ( x ) E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)] , w i .

• Bellman + linear assumption: • This suggests solving a
system of linear equations: • We can’t evaluate the expectation though! r ( x, ⇡ ( x )) = h ( x ) E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)] , w i . Covariate : (xi) E x 0⇠ p (·| xi,⇡ ( xi))[ (x 0 )] , Target : r(xi, ⇡(xi)) .

• A natural idea is to use the transition samples
in place of the expectation, and use least squares: • This yields the estimator: Covariate : (xi) (xi+1) , Target : r(xi, ⇡(xi)) . b w = n X i=1 ( ( xi) ( xi+1))⌦2 ! 1 n X i=1 ( ( xi) ( xi+1)) ri .

• Previous estimator not quite what you want (has bias)!
• The ﬁx for this is known as the Least Squares Temporal Differencing estimator: b w = n X i=1 ( xi)( ( xi) ( xi+1))T ! 1 n X i=1 ( xi) ri .

FROM EVALUATION TO OPTIMIZATION

FROM LSTD TO LSPI • LSPI uses LSTD as an
evaluation primitive. • LSPI does an outer loop of policy iteration, and an inner loop of LSTD on the Q-function Q⇡(s, a) = E " 1 X k=0 kr(sk, ak) s0 = s, a0 = a # .

FROM LSTD TO LSPI • From a Q-function, can construct
a new policy as follows: • Then, compute (with LSTD) and repeat the process. ⇡+ ( s ) = arg max a2A Q⇡ ( s, a ) . Q⇡+ (s, a)

QUESTIONS?

• Dynamic Programming and Optimal Control - D. P. Bertsekas.
• Linear Least-Squares Algorithms for Temporal Difference Learning - S. J. Bradtke and A. G. Barto. • Reinforcement Learning: An Introduction - R. S. Sutton and A. G. Barto. • David Silver’s RL Course (videos online).

BACKUP SLIDES

• Least squares estimator has an issue: • The “noise”
is not uncorrelated with the observation. • LSTD “ﬁxes” this issue. ri = h ( xi ) ( xi+1 ) , w i + h ( xi+1 ) E x 0⇠ p( ·| xi,⇡(xi)) [ ( x 0)] , w i | {z } noise .

INTUITION • Introduce 3 matrices: • In this notation, LSTD
is: • Bellman equation is: = ( ( x1) , ..., ( xn ))T , e = ( ( x2) , ..., ( xn +1))T , = (E x 0⇠ p (·| x1,⇡ ( x1)) [ ( x 0)] , ..., E x 0⇠ p (·| x1,⇡ ( x1)) [ ( x 0)])T . R = ( )w . e w = ( T( e )) 1 TR .

• Combine LSTD estimator and Bellman: • The term is
zero-mean. • Show concentration of e w w = ( T( e )) 1 T( )w w = ( T( e )) 1 T( e + e )w w = ( T( e )) 1 T( e )w . INTUITION T( e )w min( T( e )) , k T( e )wk .

Stephen Tu on "Least Squares Policy Iteration"

Stephen Tu on "Least Squares Policy Iteration"

Papers_We_Love

More Decks by Papers_We_Love

Other Decks in Technology

Featured

Transcript

LEAST SQUARES POLICY ITERATION (LSPI) M. G. Lagoudakis and R.

REINFORCEMENT LEARNING IN < 5 MINUTES

RL PRIMER

RL PRIMER • Formalized via a Markov Decision Process (MDP).

RL PRIMER • Goal of RL is to ﬁnd a

RL PRIMER • For a policy, deﬁne the value function

RL PRIMER • Fundamental equation of RL V ⇡(s) =

BELLMAN’S EQUATION IN ACTION

LINEAR QUADRATIC REGULATOR • (Discrete-time) linear, time-invariant system: • Quadratic

LINEAR QUADRATIC REGULATOR • Let’s derive for LQR, for feedback

• Bellman: • Plug in : • Solve: V ⇡(

LEAST SQUARES TEMPORAL DIFFERENCING

LEAST SQUARES TD • Let’s do the LQR example again,

• Bellman: • Plug in linear assumption: V ⇡( x

• Bellman + linear assumption: • This suggests solving a

• A natural idea is to use the transition samples

• Previous estimator not quite what you want (has bias)!

FROM EVALUATION TO OPTIMIZATION

FROM LSTD TO LSPI • LSPI uses LSTD as an

FROM LSTD TO LSPI • From a Q-function, can construct

QUESTIONS?

• Dynamic Programming and Optimal Control - D. P. Bertsekas.

BACKUP SLIDES

• Least squares estimator has an issue: • The “noise”

INTUITION • Introduce 3 matrices: • In this notation, LSTD

• Combine LSTD estimator and Bellman: • The term is