Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Stephen Tu on "Least Squares Policy Iteration"

Stephen Tu on "Least Squares Policy Iteration"

Policy iteration is a classic dynamic programing algorithm for solving a Markov Decision Process (MDP). In policy iteration, the algorithm alternates between two steps: 1) a policy evaluation step which, given the current policy, computes the state-action value function (commonly known as the Q-function) for the policy, and 2) a policy improvement step, which uses the Q-function to greedily improve the current policy. When the number of states and actions of the MDP is finite and small, policy iteration performs well and comes with nice theoretical guarantees. However, when the state and action spaces are large (possibly continuous), policy iteration becomes intractable, and approximate methods for solving MDPs must be used.

Least Squares Policy Iteration (LSPI) is one method for approximately solving an MDP. The key idea here is to approximate the Q-function as a linear functional in a lifted, higher dimensional space, analogous to the idea of feature maps in supervised learning. Plugging this approximation into the Bellman equation gives a tractable linear system of equations to solve for the policy evaluation step. Furthermore, the policy improvement step remains the same as before.

This talk describes LSPI and some of its subtleties. One subtlety arises due to the fact that the Bellman operator is not necessarily invariant on our approximate function class, and hence an extra projection step is typically used to minimize the Bellman residual after projecting back on the function space. Furthermore, in order to build intuition for LSPI, I will also talk about what the LSPI algorithm does in the context of a well studied continuous optimal control problem known as the Linear Quadratic Regulator (LQR).

Papers_We_Love

August 31, 2017
Tweet

More Decks by Papers_We_Love

Other Decks in Technology

Transcript

  1. LEAST SQUARES POLICY
    ITERATION (LSPI)
    M. G. Lagoudakis and R. Parr
    Presented by Stephen Tu
    PWL 8/31

    View full-size slide

  2. REINFORCEMENT LEARNING
    IN < 5 MINUTES

    View full-size slide

  3. RL PRIMER
    • Formalized via a Markov Decision Process (MDP).
    • An MDP is a 5-tuple (S, A, p, γ, r)
    • S is state-space (e.g. position on a grid),
    • A is action-space (e.g. left and right),
    • p : S x A → Δ(S) is the transition function,
    • γ is the discount factor,
    • r : S x A → ℝ is the reward function.
    Unknown to 

    algorithm!

    View full-size slide

  4. RL PRIMER
    • Goal of RL is to find a policy π : S → A that
    optimizes (over all policies)
    • In this talk, we focus primarily on the easier
    problem of scoring a particular (fixed) policy.
    V ⇡ = E
    "
    1
    X
    k=0
    kr(sk, ak)
    #
    , sk+1
    ⇠ p(·|sk, ⇡(sk))
    | {z }
    “dynamics”
    .

    View full-size slide

  5. RL PRIMER
    • For a policy, define the value function as
    • How do we evaluate this function?
    V ⇡(s) = E
    "
    1
    X
    k=0
    kr(sk, ak) s0 = s
    #
    .

    View full-size slide

  6. RL PRIMER
    • Fundamental equation of RL
    V ⇡(s) = r(s, ⇡(s)) + E
    s0⇠p(·|s,⇡(s))
    [V ⇡(s0)] .

    View full-size slide

  7. BELLMAN’S EQUATION IN
    ACTION

    View full-size slide

  8. LINEAR QUADRATIC
    REGULATOR
    • (Discrete-time) linear, time-invariant system:
    • Quadratic reward function:
    xk+1 =
    Axk +
    Buk +
    wk , wk

    N
    (0
    , I
    )
    .
    r
    (
    xk, uk) =
    x
    T
    k Qxk +
    u
    T
    k Ruk , Q
    ⌫ 0
    , R
    0
    .

    View full-size slide

  9. LINEAR QUADRATIC
    REGULATOR
    • Let’s derive for LQR, for feedback
    • Step 1: Guess that .
    • Step 2: Plug assumption into Bellman equation
    and solve for P, q.
    V
    ⇡(
    x
    )

    (
    x
    ) =
    Kx .
    V
    ⇡(
    x
    ) =
    x
    T
    Px
    +
    q

    View full-size slide

  10. • Bellman:
    • Plug in :
    • Solve:
    V ⇡(
    x
    ) =
    r
    (
    x, ⇡
    (
    x
    )) + E
    x
    0⇠
    p
    (·|
    x,⇡
    (
    x
    ))
    [
    V ⇡(
    x
    0)]
    .
    V
    ⇡(
    x
    ) =
    x
    T
    Px
    +
    q
    x
    T
    Px
    +
    q
    =
    x
    T(
    A
    +
    BK
    )
    x
    + E
    z

    N
    ((
    A
    +
    BK
    )
    x,I
    )
    [
    z
    T
    Pz
    +
    q
    ]
    .
    0 = LTPL P + Q + LTRL = 0 , L = A + BK ,
    q =
    1
    Tr(P) .

    View full-size slide

  11. LEAST SQUARES TEMPORAL
    DIFFERENCING

    View full-size slide

  12. LEAST SQUARES TD
    • Let’s do the LQR example again, with more
    generality.
    • Assume “linear architecture”:
    • For LQR: V
    ⇡(
    x
    ) = (
    x
    )T
    w .
    (
    x
    ) = svec

    xx
    T +
    1 I

    .

    View full-size slide

  13. • Bellman:
    • Plug in linear assumption:
    V ⇡(
    x
    ) =
    r
    (
    x, ⇡
    (
    x
    )) + E
    x
    0⇠
    p
    (·|
    x,⇡
    (
    x
    ))
    [
    V ⇡(
    x
    0)]
    .
    V
    ⇡(
    x
    ) = (
    x
    )T
    w .
    (
    x
    )T
    w
    =
    r
    (
    x, ⇡
    (
    x
    )) + E
    x
    0⇠
    p
    (·|
    x,⇡
    (
    x
    ))
    [ (
    x
    0)T
    w
    ]
    =
    r
    (
    x, ⇡
    (
    x
    )) + E
    x
    0⇠
    p
    (·|
    x,⇡
    (
    x
    ))
    [ (
    x
    0)]T
    w
    ()
    r
    (
    x, ⇡
    (
    x
    )) = h (
    x
    ) E
    x
    0⇠
    p
    (·|
    x,⇡
    (
    x
    ))
    [ (
    x
    0)]
    , w
    i
    .

    View full-size slide

  14. • Bellman + linear assumption:
    • This suggests solving a system of linear equations:
    • We can’t evaluate the expectation though!
    r
    (
    x, ⇡
    (
    x
    )) = h (
    x
    ) E
    x
    0⇠
    p
    (·|
    x,⇡
    (
    x
    ))
    [ (
    x
    0)]
    , w
    i
    .
    Covariate : (xi)
    E
    x
    0⇠
    p
    (·|
    xi,⇡
    (
    xi))[ (x
    0
    )] ,
    Target : r(xi, ⇡(xi)) .

    View full-size slide

  15. • A natural idea is to use the transition samples in
    place of the expectation, and use least squares:
    • This yields the estimator:
    Covariate : (xi) (xi+1) ,
    Target : r(xi, ⇡(xi)) .
    b
    w
    =
    n
    X
    i=1
    ( (
    xi) (
    xi+1))⌦2
    ! 1 n
    X
    i=1
    ( (
    xi) (
    xi+1))
    ri .

    View full-size slide

  16. • Previous estimator not quite what you want (has
    bias)!
    • The fix for this is known as the Least Squares
    Temporal Differencing estimator:
    b
    w
    =
    n
    X
    i=1
    (
    xi)( (
    xi) (
    xi+1))T
    ! 1 n
    X
    i=1
    (
    xi)
    ri .

    View full-size slide

  17. FROM EVALUATION TO
    OPTIMIZATION

    View full-size slide

  18. FROM LSTD TO LSPI
    • LSPI uses LSTD as an evaluation primitive.
    • LSPI does an outer loop of policy iteration, and
    an inner loop of LSTD on the Q-function
    Q⇡(s, a) = E
    "
    1
    X
    k=0
    kr(sk, ak) s0 = s, a0 = a
    #
    .

    View full-size slide

  19. FROM LSTD TO LSPI
    • From a Q-function, can construct a new policy as
    follows:
    • Then, compute (with LSTD) and repeat
    the process.
    ⇡+
    (
    s
    ) = arg max
    a2A
    Q⇡
    (
    s, a
    )
    .
    Q⇡+
    (s, a)

    View full-size slide

  20. • Dynamic Programming and Optimal Control - D.
    P. Bertsekas.
    • Linear Least-Squares Algorithms for Temporal
    Difference Learning - S. J. Bradtke and A. G. Barto.
    • Reinforcement Learning: An Introduction - R. S.
    Sutton and A. G. Barto.
    • David Silver’s RL Course (videos online).

    View full-size slide

  21. BACKUP SLIDES

    View full-size slide

  22. • Least squares estimator has an issue:
    • The “noise” is not uncorrelated with the observation.
    • LSTD “fixes” this issue.
    ri
    = h (
    xi
    ) (
    xi+1
    )
    , w
    i
    + h (
    xi+1
    ) E
    x
    0⇠
    p(
    ·|
    xi,⇡(xi))
    [ (
    x
    0)]
    , w
    i
    | {z }
    noise
    .

    View full-size slide

  23. INTUITION
    • Introduce 3 matrices:
    • In this notation, LSTD is:
    • Bellman equation is:
    = ( (
    x1)
    , ...,
    (
    xn
    ))T
    ,
    e = ( (
    x2)
    , ...,
    (
    xn
    +1))T
    ,
    = (E
    x
    0⇠
    p
    (·|
    x1,⇡
    (
    x1))
    [ (
    x
    0)]
    , ...,
    E
    x
    0⇠
    p
    (·|
    x1,⇡
    (
    x1))
    [ (
    x
    0)])T
    .
    R = ( )w .
    e
    w = ( T( e )) 1 TR .

    View full-size slide

  24. • Combine LSTD estimator and Bellman:
    • The term is zero-mean.
    • Show concentration of
    e
    w w = ( T( e )) 1 T( )w w
    = ( T( e )) 1 T( e + e )w w
    = ( T( e )) 1 T( e )w .
    INTUITION
    T( e )w
    min( T( e )) , k T( e )wk .

    View full-size slide