Slide 1

Slide 1 text

LEAST SQUARES POLICY ITERATION (LSPI) M. G. Lagoudakis and R. Parr Presented by Stephen Tu PWL 8/31

Slide 2

Slide 2 text

REINFORCEMENT LEARNING IN < 5 MINUTES

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

RL PRIMER

Slide 5

Slide 5 text

RL PRIMER • Formalized via a Markov Decision Process (MDP). • An MDP is a 5-tuple (S, A, p, γ, r) • S is state-space (e.g. position on a grid), • A is action-space (e.g. left and right), • p : S x A → Δ(S) is the transition function, • γ is the discount factor, • r : S x A → ℝ is the reward function. Unknown to 
 algorithm!

Slide 6

Slide 6 text

RL PRIMER • Goal of RL is to find a policy π : S → A that optimizes (over all policies) • In this talk, we focus primarily on the easier problem of scoring a particular (fixed) policy. V ⇡ = E " 1 X k=0 kr(sk, ak) # , sk+1 ⇠ p(·|sk, ⇡(sk)) | {z } “dynamics” .

Slide 7

Slide 7 text

RL PRIMER • For a policy, define the value function as • How do we evaluate this function? V ⇡(s) = E " 1 X k=0 kr(sk, ak) s0 = s # .

Slide 8

Slide 8 text

RL PRIMER • Fundamental equation of RL V ⇡(s) = r(s, ⇡(s)) + E s0⇠p(·|s,⇡(s)) [V ⇡(s0)] .

Slide 9

Slide 9 text

BELLMAN’S EQUATION IN ACTION

Slide 10

Slide 10 text

LINEAR QUADRATIC REGULATOR • (Discrete-time) linear, time-invariant system: • Quadratic reward function: xk+1 = Axk + Buk + wk , wk ⇠ N (0 , I ) . r ( xk, uk) = x T k Qxk + u T k Ruk , Q ⌫ 0 , R 0 .

Slide 11

Slide 11 text

LINEAR QUADRATIC REGULATOR • Let’s derive for LQR, for feedback • Step 1: Guess that . • Step 2: Plug assumption into Bellman equation and solve for P, q. V ⇡( x ) ⇡ ( x ) = Kx . V ⇡( x ) = x T Px + q

Slide 12

Slide 12 text

• Bellman: • Plug in : • Solve: V ⇡( x ) = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ V ⇡( x 0)] . V ⇡( x ) = x T Px + q x T Px + q = x T( A + BK ) x + E z ⇠ N (( A + BK ) x,I ) [ z T Pz + q ] . 0 = LTPL P + Q + LTRL = 0 , L = A + BK , q = 1 Tr(P) .

Slide 13

Slide 13 text

LEAST SQUARES TEMPORAL DIFFERENCING

Slide 14

Slide 14 text

LEAST SQUARES TD • Let’s do the LQR example again, with more generality. • Assume “linear architecture”: • For LQR: V ⇡( x ) = ( x )T w . ( x ) = svec ✓ xx T + 1 I ◆ .

Slide 15

Slide 15 text

• Bellman: • Plug in linear assumption: V ⇡( x ) = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ V ⇡( x 0)] . V ⇡( x ) = ( x )T w . ( x )T w = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)T w ] = r ( x, ⇡ ( x )) + E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)]T w () r ( x, ⇡ ( x )) = h ( x ) E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)] , w i .

Slide 16

Slide 16 text

• Bellman + linear assumption: • This suggests solving a system of linear equations: • We can’t evaluate the expectation though! r ( x, ⇡ ( x )) = h ( x ) E x 0⇠ p (·| x,⇡ ( x )) [ ( x 0)] , w i . Covariate : (xi) E x 0⇠ p (·| xi,⇡ ( xi))[ (x 0 )] , Target : r(xi, ⇡(xi)) .

Slide 17

Slide 17 text

• A natural idea is to use the transition samples in place of the expectation, and use least squares: • This yields the estimator: Covariate : (xi) (xi+1) , Target : r(xi, ⇡(xi)) . b w = n X i=1 ( ( xi) ( xi+1))⌦2 ! 1 n X i=1 ( ( xi) ( xi+1)) ri .

Slide 18

Slide 18 text

• Previous estimator not quite what you want (has bias)! • The fix for this is known as the Least Squares Temporal Differencing estimator: b w = n X i=1 ( xi)( ( xi) ( xi+1))T ! 1 n X i=1 ( xi) ri .

Slide 19

Slide 19 text

FROM EVALUATION TO OPTIMIZATION

Slide 20

Slide 20 text

FROM LSTD TO LSPI • LSPI uses LSTD as an evaluation primitive. • LSPI does an outer loop of policy iteration, and an inner loop of LSTD on the Q-function Q⇡(s, a) = E " 1 X k=0 kr(sk, ak) s0 = s, a0 = a # .

Slide 21

Slide 21 text

FROM LSTD TO LSPI • From a Q-function, can construct a new policy as follows: • Then, compute (with LSTD) and repeat the process. ⇡+ ( s ) = arg max a2A Q⇡ ( s, a ) . Q⇡+ (s, a)

Slide 22

Slide 22 text

QUESTIONS?

Slide 23

Slide 23 text

• Dynamic Programming and Optimal Control - D. P. Bertsekas. • Linear Least-Squares Algorithms for Temporal Difference Learning - S. J. Bradtke and A. G. Barto. • Reinforcement Learning: An Introduction - R. S. Sutton and A. G. Barto. • David Silver’s RL Course (videos online).

Slide 24

Slide 24 text

BACKUP SLIDES

Slide 25

Slide 25 text

• Least squares estimator has an issue: • The “noise” is not uncorrelated with the observation. • LSTD “fixes” this issue. ri = h ( xi ) ( xi+1 ) , w i + h ( xi+1 ) E x 0⇠ p( ·| xi,⇡(xi)) [ ( x 0)] , w i | {z } noise .

Slide 26

Slide 26 text

INTUITION • Introduce 3 matrices: • In this notation, LSTD is: • Bellman equation is: = ( ( x1) , ..., ( xn ))T , e = ( ( x2) , ..., ( xn +1))T , = (E x 0⇠ p (·| x1,⇡ ( x1)) [ ( x 0)] , ..., E x 0⇠ p (·| x1,⇡ ( x1)) [ ( x 0)])T . R = ( )w . e w = ( T( e )) 1 TR .

Slide 27

Slide 27 text

• Combine LSTD estimator and Bellman: • The term is zero-mean. • Show concentration of e w w = ( T( e )) 1 T( )w w = ( T( e )) 1 T( e + e )w w = ( T( e )) 1 T( e )w . INTUITION T( e )w min( T( e )) , k T( e )wk .