LQR Learning Pipelines - Speaker Deck

Slide 1

Slide 1 text

LQR Learning Pipelines Florian Dörfler KIOS Graduate Training School 2024

Slide 9

Slide 9 text

9 are identification, and certainty-equivalence control The conventional approach to data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity are identification, and certainty-equivalence control The conventional approach to data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the > first a parametric state-space model is iden and later on controllers are synthesized bas as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 X . > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the first a parametric state-space model is iden and later on controllers are synthesized base as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 . X 1 = AX 0 + BU0 + D0 Indirect & certainty-equivalence LQR • collect I/O data (𝑋 , 𝑈 , 𝑋 ) with 𝐷 unknown & PE: rank 𝑈 𝑋 = 𝑛 + 𝑚 • indirect & certainty- equivalence LQR (optimal in MLE setting) least squares SysID certainty- equivalent LQR

Slide 10

Slide 10 text

10 x The conventional approach to data-driven LQR is indirect: st a parametric state-space model is identified from data, d later on controllers are synthesized based on this model in Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time ries of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T isfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, ., column i of X 1 coincides with column i + 1 of X 0 , but s isnot strictly needed for our developments: the data may ginate from independent experiments. Let for brevity The conventional approach to data-driven LQR is indirect: st a parametric state-space model is identified from data, d later on controllers are synthesized based on this model in Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time ries of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T isfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, ., column i of X 1 coincides with column i + 1 of X 0 , but s isnot strictly needed for our developments: the data may ginate from independent experiments. Let for brevity where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the Lyapunov equation (A + BK )P(A + BK )> − P + I = 0. We refer to [34] for properties and interpretations of the and later on controllers are synthesized bas as in Section II-A. We will briefly review Regarding the identification task, consid series of inputs, disturbances, states, and s U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 R satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. L W0 := U0 X 0 . where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the Lyapunov equation (A + BK )P(A + BK )> − P + I = 0. We refer to [34] for properties and interpretations of the and later on controllers are synthesized bas as in Section II-A. We will briefly review Regarding the identification task, consid series of inputs, disturbances, states, and s U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 R satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. L W0 := U0 X 0 . X 1 = AX 0 + BU0 + D0 Recall indirect approach on the board • I/O data (𝑋 , 𝑈 , 𝑋 ) with 𝐷 unknown & PE: rank 𝑈 𝑋 = 𝑛 + 𝑚 G = 0 : 0 = z(X . - [ii][) SG) · Log i ⑫ j - > [i] = X (7) t [ Moor - Penrose invertible are uniquely determined = X W I right / inverse I due to = X []"(i][7") - 1 ↑ E us model-based design

Slide 11

Slide 11 text

11 The conventional approach to data-driven LQR is indirect: st a parametric state-space model is identified from data, d later on controllers are synthesized based on this model in Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time ries of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T isfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, ., column i of X 1 coincides with column i + 1 of X 0 , but s isnot strictly needed for our developments: the data may ginate from independent experiments. Let for brevity The conventional approach to data-driven LQR is indirect: st a parametric state-space model is identified from data, d later on controllers are synthesized based on this model in Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time ries of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T isfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, ., column i of X 1 coincides with column i + 1 of X 0 , but s isnot strictly needed for our developments: the data may ginate from independent experiments. Let for brevity where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the Lyapunov equation (A + BK )P(A + BK )> − P + I = 0. We refer to [34] for properties and interpretations of the and later on controllers are synthesized bas as in Section II-A. We will briefly review Regarding the identification task, consid series of inputs, disturbances, states, and s U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 R satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. L W0 := U0 X 0 . where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the Lyapunov equation (A + BK )P(A + BK )> − P + I = 0. We refer to [34] for properties and interpretations of the and later on controllers are synthesized bas as in Section II-A. We will briefly review Regarding the identification task, consid series of inputs, disturbances, states, and s U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 R satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. L W0 := U0 X 0 . X 1 = AX 0 + BU0 + D0 Derivation of a direct approach on the board • I/O data (𝑋 , 𝑈 , 𝑋 ) with 𝐷 unknown & PE: rank 𝑈 𝑋 = 𝑛 + 𝑚 · PE implies that UK 76 so that [i] = [Y6 · subspace relations for closed-loop matrix A + Bu = [ BA](E) = (BA][Y] = [AX + BU0]6 = (x= - Do)6 m can replace A + BK in any LME by (1 - Do)6 ~ data driven parameterization of linear control design

Slide 12

Slide 12 text

12 are identification, and certainty-equivalence control The conventional approach to data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity are identification, and certainty-equivalence control The conventional approach to data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the > first a parametric state-space model is iden and later on controllers are synthesized bas as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 X . > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the first a parametric state-space model is iden and later on controllers are synthesized base as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 . X 1 = AX 0 + BU0 + D0 Direct approach from subspace relations in data • PE data: rank 𝑈 𝑋 = 𝑛 + 𝑚 • subspace relations • data-driven LQR LMIs by substituting  certainty equivalence by neglecting noise :

Slide 26

Slide 26 text

27 Online & adaptive solutions • shortcoming of separating offline learning & online control → cannot improve policy online & cheaply / rapidly adapt to changes • (elitist) desired adaptive solution: direct, online (non-episodic/non-batch) algorithms, with closed-loop data, & recursive algorithmic implementation • “best” way to improve policy with new data → go down the gradient ! PII: S0005–1098(98)00089–2 Automatica, Vol. 34, No. 10, pp. 1161— 1167, 1998 1998 IFAC. Published by Elsevier Science Ltd All rights reserved. Printed in Great Britain 0005-1098/98 $—see front matter Adaptive Control: Towards a Complexity-Based General Theory* G. ZAM ES- Key Words—H control; adaptive control; learning control; performance analysis. Abstract—Two recent developments are pointing the way towards an input— output theory of H ! l adaptive feedback: The solution of problems involving: (1) feedback performance exact optimization under large plant uncertainty on the one hand (thetwo-disc problem of H ); and (2) optimally fast identification in H on the other. Taken together, these are yielding adaptive algorithms for slowly varying data in H ! l . At a conceptual level, theseresultsmotivatea general input— output theory linking identification, adaptation, and control learning. In such a theory, thedefinition of adaptation isbased on system performance under uncertainty, and is independent of internal structure, presence or absence of variable parameters, or even feedback. 1998 IFAC. Published by Elsevier Science Ltd. All rights reserved. 1. INTRODUCTION certain difficulties. Controllers with identical external behavior can have an endless variety of parametrizations; variable parameters in one parametrization may be replaced by a fixed para- meter nonlinearity in another. In most of therecent control literature there is no clear separation between the concepts of adaptation and nonlinear feedback, or between research on adaptive control and nonlinear stability. This lack of clarity extends to fields other than control; e.g. in debates as to whether neural nets do or do not have a learning capacity;or in theclassical 1960sChomsky vsSkin- ner argument as to whether children’s language “adaptive = improve over best control with a priori info” * disclaimer: a large part of the adaptive control community focuses on stability & not optimality

Slide 27

Slide 27 text

28 Ingredient 1: policy gradient methods • LQR viewed as smooth program (many formulations) • 𝐽 𝐾 is not convex … after eliminating (unique) P, denote this as 𝐽 𝐾 Fact: policy gradient descent 𝐾 = 𝐾 − 𝜂 ∇𝐽 𝐾 initialized from a stabilizing policy converges linearly to 𝐾∗. Annual Review of Control, Robotics , and AutonomousSys tems Toward aT heoretical Foundation of Policy Optimization for Learning Control Policies Bin Hu,1 Kaiqing Zhang,2,3 Na Li,4 Mehran Mesbahi,5 Maryam Fazel,6 and Tamer Ba¸ sar1 1Coordinated Science Laboratory and Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois, USA; email: binhu7@ illinois.edu, basar1@ illinois.edu 2Laboratory for Information and Decision Systems and Computer Science and Artif cial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 3Current aff liation: Department of Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, College Park, Maryland, USA; email: kaiqing@ umd.edu 4School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA; email: nali@ seas.harvard.edu 5Department of Aeronautics and Astronautics, University of Washington, Seattle, Washington, USA; email: mesbahi@ uw.edu 6Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA; email: mfazel@ uw.edu Annu. Rev. Control Robot. Auton. Syst. 2023. 6:123–58 T he Annual Review of Control, Robotics , and AutonomousSystemsisonline at control.annualreviews.org https://doi.org/10.1146/annurev-control-042920- 020021 Copyright © 2023 by the author(s). T his work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information. Keywords policy optimization, reinforcement learning, feedback control synthesis Abstract Gradient-based methodshave been widely used for system design and optimization in diverse application domains. Recently, there hasbeen arenewed interest in studying theoretical propertiesof thesemethodsin thecontext of control and reinforcement learning. T hisarticle surveyssome of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis that hasbeen popularized by successesof reinforcement learning. We take an interdisciplinary perspective in our expo- sition that connects control theory, reinforcement learning, and large-scale optimization. We review anumber of recently developed theoretical results on the optimization landscape, global convergence, and sample complexity 123 but on the set of stabilizing gains K , it’s • coercive with compact sublevel sets, • smooth with bounded Hessian, & • degree-2 gradient dominated 𝐽 𝐾 − 𝐽∗ ≤ 𝑐𝑜𝑛𝑠𝑡. ∇𝐽 𝐾

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text