Papers_We_Love
August 31, 2017
1.2k

# Stephen Tu on "Least Squares Policy Iteration"

Policy iteration is a classic dynamic programing algorithm for solving a Markov Decision Process (MDP). In policy iteration, the algorithm alternates between two steps: 1) a policy evaluation step which, given the current policy, computes the state-action value function (commonly known as the Q-function) for the policy, and 2) a policy improvement step, which uses the Q-function to greedily improve the current policy. When the number of states and actions of the MDP is finite and small, policy iteration performs well and comes with nice theoretical guarantees. However, when the state and action spaces are large (possibly continuous), policy iteration becomes intractable, and approximate methods for solving MDPs must be used.

Least Squares Policy Iteration (LSPI) is one method for approximately solving an MDP. The key idea here is to approximate the Q-function as a linear functional in a lifted, higher dimensional space, analogous to the idea of feature maps in supervised learning. Plugging this approximation into the Bellman equation gives a tractable linear system of equations to solve for the policy evaluation step. Furthermore, the policy improvement step remains the same as before.

This talk describes LSPI and some of its subtleties. One subtlety arises due to the fact that the Bellman operator is not necessarily invariant on our approximate function class, and hence an extra projection step is typically used to minimize the Bellman residual after projecting back on the function space. Furthermore, in order to build intuition for LSPI, I will also talk about what the LSPI algorithm does in the context of a well studied continuous optimal control problem known as the Linear Quadratic Regulator (LQR).

August 31, 2017

## Transcript

1. ### LEAST SQUARES POLICY ITERATION (LSPI) M. G. Lagoudakis and R.

Parr Presented by Stephen Tu PWL 8/31

4. ### RL PRIMER • Formalized via a Markov Decision Process (MDP).

• An MDP is a 5-tuple (S, A, p, γ, r) • S is state-space (e.g. position on a grid), • A is action-space (e.g. left and right), • p : S x A → Δ(S) is the transition function, • γ is the discount factor, • r : S x A → ℝ is the reward function. Unknown to   algorithm!
5. ### RL PRIMER • Goal of RL is to ﬁnd a

policy π : S → A that optimizes (over all policies) • In this talk, we focus primarily on the easier problem of scoring a particular (ﬁxed) policy. V ⇡ = E " 1 X k=0 kr(sk, ak) # , sk+1 ⇠ p(·|sk, ⇡(sk)) | {z } “dynamics” .
6. ### RL PRIMER • For a policy, deﬁne the value function

as • How do we evaluate this function? V ⇡(s) = E " 1 X k=0 kr(sk, ak) s0 = s # .
7. ### RL PRIMER • Fundamental equation of RL V ⇡(s) =

r(s, ⇡(s)) + E s0⇠p(·|s,⇡(s)) [V ⇡(s0)] .

9. ### LINEAR QUADRATIC REGULATOR • (Discrete-time) linear, time-invariant system: • Quadratic

reward function: xk+1 = Axk + Buk + wk , wk ⇠ N (0 , I ) . r ( xk, uk) = x T k Qxk + u T k Ruk , Q ⌫ 0 , R 0 .
10. ### LINEAR QUADRATIC REGULATOR • Let’s derive for LQR, for feedback

• Step 1: Guess that . • Step 2: Plug assumption into Bellman equation and solve for P, q. V ⇡( x ) ⇡ ( x ) = Kx . V ⇡( x ) = x T Px + q
11. ### • Bellman: • Plug in : • Solve: V ⇡(

x ) = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ V ⇡( x 0)] . V ⇡( x ) = x T Px + q x T Px + q = x T( A + BK ) x + E z ⇠ N (( A + BK ) x,I ) [ z T Pz + q ] . 0 = LTPL P + Q + LTRL = 0 , L = A + BK , q = 1 Tr(P) .

13. ### LEAST SQUARES TD • Let’s do the LQR example again,

with more generality. • Assume “linear architecture”: • For LQR: V ⇡( x ) = ( x )T w . ( x ) = svec ✓ xx T + 1 I ◆ .
14. ### • Bellman: • Plug in linear assumption: V ⇡( x

) = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ V ⇡( x 0)] . V ⇡( x ) = ( x )T w . ( x )T w = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)T w ] = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)]T w () r ( x, ⇡ ( x )) = h ( x ) E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)] , w i .
15. ### • Bellman + linear assumption: • This suggests solving a

system of linear equations: • We can’t evaluate the expectation though! r ( x, ⇡ ( x )) = h ( x ) E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)] , w i . Covariate : (xi) E x 0⇠ p (·| xi,⇡ ( xi))[ (x 0 )] , Target : r(xi, ⇡(xi)) .
16. ### • A natural idea is to use the transition samples

in place of the expectation, and use least squares: • This yields the estimator: Covariate : (xi) (xi+1) , Target : r(xi, ⇡(xi)) . b w = n X i=1 ( ( xi) ( xi+1))⌦2 ! 1 n X i=1 ( ( xi) ( xi+1)) ri .
17. ### • Previous estimator not quite what you want (has bias)!

• The ﬁx for this is known as the Least Squares Temporal Differencing estimator: b w = n X i=1 ( xi)( ( xi) ( xi+1))T ! 1 n X i=1 ( xi) ri .

19. ### FROM LSTD TO LSPI • LSPI uses LSTD as an

evaluation primitive. • LSPI does an outer loop of policy iteration, and an inner loop of LSTD on the Q-function Q⇡(s, a) = E " 1 X k=0 kr(sk, ak) s0 = s, a0 = a # .
20. ### FROM LSTD TO LSPI • From a Q-function, can construct

a new policy as follows: • Then, compute (with LSTD) and repeat the process. ⇡+ ( s ) = arg max a2A Q⇡ ( s, a ) . Q⇡+ (s, a)

22. ### • Dynamic Programming and Optimal Control - D. P. Bertsekas.

• Linear Least-Squares Algorithms for Temporal Difference Learning - S. J. Bradtke and A. G. Barto. • Reinforcement Learning: An Introduction - R. S. Sutton and A. G. Barto. • David Silver’s RL Course (videos online).

24. ### • Least squares estimator has an issue: • The “noise”

is not uncorrelated with the observation. • LSTD “ﬁxes” this issue. ri = h ( xi ) ( xi+1 ) , w i + h ( xi+1 ) E x 0⇠ p( ·| xi,⇡(xi)) [ ( x 0)] , w i | {z } noise .
25. ### INTUITION • Introduce 3 matrices: • In this notation, LSTD

is: • Bellman equation is: = ( ( x1) , ..., ( xn ))T , e = ( ( x2) , ..., ( xn +1))T , = (E x 0⇠ p (·| x1,⇡ ( x1)) [ ( x 0)] , ..., E x 0⇠ p (·| x1,⇡ ( x1)) [ ( x 0)])T . R = ( )w . e w = ( T( e )) 1 TR .
26. ### • Combine LSTD estimator and Bellman: • The term is

zero-mean. • Show concentration of e w w = ( T( e )) 1 T( )w w = ( T( e )) 1 T( e + e )w w = ( T( e )) 1 T( e )w . INTUITION T( e )w min( T( e )) , k T( e )wk .