Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LQR Learning Pipelines

Florian Dörfler
October 27, 2024
20

LQR Learning Pipelines

Florian Dörfler

October 27, 2024
Tweet

Transcript

  1. 2 Data-driven control • indirect data-driven control data → model

    + uncertainty → control • direct data-driven control by-passes models ID The direct approach is a viable alternative • for some applications: model-based approach is too complex to be useful → complex processes, sensing modalities, environment • due to shortcomings of ID → cumbersome, models not identified for control, model selection, or incompatible uncertainty estimates • in adaptive settings: stability, nothing unmodeled, online efficiency • when sufficient brute force data / compute / storage is available • … long debated topic & long list of pros & cons: [link] • trade-offs • (non)modular • (in)tractable • (sub)optimal • data size • online adaptation today: give explicit answers for LQR
  2. 3 LQR • cornerstone of automatic control • parameterization (can

    be posed as convex SDP, as differentiable program, as… ) • the benchmark for all data-driven control approaches in last decades(!) x+ = Ax + Bu + d z = Q1/ 2x + R1/ 2u K x d u = K x z indirect direct online adaptive offline batch
  3. 4 Contents 1. model-based pipeline with model-free elements → data-driven

    parametrization & robustifying regularization 2. model-free pipeline with model-based elements → adaptive method: policy gradient & sample covariance 3. case studies: academic & power systems/electronics → LQR is academic example but can be made useful
  4. 5 Contents 1. model-based pipeline with model-free elements → data-driven

    parametrization & robustifying regularization with Pietro Tesi (Florence) & Claudio de Persis (Groningen)
  5. 6 are identification, and certainty-equivalence control The conventional approach to

    data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity are identification, and certainty-equivalence control The conventional approach to data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the > first a parametric state-space model is iden and later on controllers are synthesized bas as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 X . > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the first a parametric state-space model is iden and later on controllers are synthesized base as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 . X 1 = AX 0 + BU0 + D0 Indirect & certainty-equivalence LQR • collect I/O data (𝑋0 , 𝑈0 , 𝑋1 ) with 𝐷0 unknown & PE: rank 𝑈0 𝑋0 = 𝑛 + 𝑚 • indirect & certainty- equivalence LQR (optimal in MLE setting) least squares SysID certainty- equivalent LQR
  6. 7 are identification, and certainty-equivalence control The conventional approach to

    data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity are identification, and certainty-equivalence control The conventional approach to data-driven LQR is indirect: t a parametric state-space model is identified from data, d later on controllers are synthesized based on this model n Section II-A. We will briefly review this approach. Regarding the identification task, consider a T-long time es of inputs, disturbances, states, and successor states U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 Rm ⇥T , D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 Rn⇥T , X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 Rn⇥T , X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn⇥T sfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 . (5) s convenient to record the data as consecutive time series, , column i of X 1 coincides with column i + 1 of X 0 , but s is not strictly needed for our developments: the data may ginate from independent experiments. Let for brevity > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the > first a parametric state-space model is iden and later on controllers are synthesized bas as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 X . > > : z(k) = Q 0 0 R1/ 2 x(k) u(k) where k 2 N, x 2 Rn is the state, u 2 Rm is the control input, d isadisturbanceterm, and z istheperformancesignal of interest. We assume that (A, B) is stabilizable. Finally, Q 0 and R 0 are weighting matrices. Here, (⌫ ) and ≺ ( ) denote positive and negative (semi)definiteness. The problem of interest is linear quadratic regulation phrased as designing a state-feedback gain K that renders A + BK Schur and minimizes the H2 -norm of the transfer function T (K ) := d ! z of the closed-loop system1 x(k + 1) z(k) = 2 4 A + BK I Q1/ 2 R1/ 2K 0 3 5 x(k) d(k) , (2) where our notation T (K ) emphasizes the dependence of the transfer function on K . When A + BK is Schur, it holds that kT (K )k2 2 = trace(QP) + trace K > RK P , (3) where P is the controllability Gramian of the closed-loop system (2), which coincides with the unique solution to the first a parametric state-space model is iden and later on controllers are synthesized base as in Section II-A. We will briefly review t Regarding the identification task, conside series of inputs, disturbances, states, and su U0 := ⇥ u(0) u(1) . . . u(T − 1) ⇤ 2 D0 := ⇥ d(0) d(1) . . . d(T − 1) ⇤ 2 X 0 := ⇥ x(0) x(1) . . . x(T − 1) ⇤ 2 X 1 := ⇥ x(1) x(2) . . . x(T) ⇤ 2 Rn satisfying the dynamics (1), that is, X 1 − D0 = ⇥ B A ⇤ U0 X 0 It is convenient to record the data as consec i.e., column i of X 1 coincides with column this is not strictly needed for our developme originate from independent experiments. Le W0 := U0 . X 1 = AX 0 + BU0 + D0 Direct approach from subspace relations in data • PE data: rank 𝑈0 𝑋0 = 𝑛 + 𝑚 • subspace relations • data-driven LQR LMIs by substituting → certainty equivalence by neglecting noise :
  7. 8 Equivalence: direct + xxx  indirect • direct approach

    • indirect approach → optimizer has nullspace → orthogonality constraint equivalent constraints:
  8. 9 Regularized, certainty-equivalent, & direct LQR • orthogonality constraint lifted

    to regularizer (equivalent for large) … but may not be robust (?) • interpolates between control & SysID • effect of noise entering data: Lyapunov constraint becomes for robustness should be small → forced by small
  9. 10 Performance & robustness certificates realized cost from regularized design

    with large if exact system matrices A & B were known • SNR (signal-to-noise-ratio) • relative performance metric Certificate for sufficiently large SNR: the optimal control problem is feasible (robustly stabilizing) with relative performance ~ 𝒪 Τ (1 𝑆𝑁𝑅).
  10. 11 Numerical case study • case study [Dean et al.

    ‘19]: discrete-time system with noise variance 𝜎2 = 0.01 & variable regularization coefficient 𝜆 • take-home message: regularization is needed for robustness & performance % of stabilizing controllers median relative performance error breaks without regularizer → works well … but learning is offline regularization coefficient 𝜆
  11. 12 Why online, adaptive, & what does it mean anyways

    ? • shortcoming of separating offline learning & online control → cannot improve policy online & cheaply / rapidly adapt to changes • (elitist) desired adaptive solution: direct, online (non-episodic / batch) algorithms, with closed-loop data, & recursive algorithmic implementation • “best” way to improve policy with new data → go down the gradient ! PII: S0005–1098(98)00089–2 Automatica, Vol. 34, No. 10, pp. 1161— 1167, 1998 1998 IFAC. Published by Elsevier Science Ltd All rights reserved. Printed in Great Britain 0005-1098/98 $—see front matter Adaptive Control: Towards a Complexity-Based General Theory* G. ZAM ES- Key Words—H control; adaptive control; learning control; performance analysis. Abstract—Two recent developments are pointing the way to- wards an input— output theory of H ! l adaptive feedback: The solution of problems involving: (1) feedback performance exact optimization under large plant uncertainty on the one hand (thetwo-disc problem of H ); and (2) optimally fast identi- fication in H on the other. Taken together, these are yielding adaptive algorithms for slowly varying data in H ! l . At a conceptual level, theseresultsmotivatea general input— output theory linking identification, adaptation, and control learning. In such a theory, thedefinition of adaptation isbased on system performance under uncertainty, and is independent of internal structure, presence or absence of variable parameters, or even feedback. 1998 IFAC. Published by Elsevier Science Ltd. All rights reserved. 1. INTRODUCTION certain difficulties. Controllers with identical external behavior can have an endless variety of parametrizations; variable parameters in one parametrization may be replaced by a fixed para- meter nonlinearity in another. In most of therecent control literature there is no clear separation be- tween the concepts of adaptation and nonlinear feedback, or between research on adaptive control and nonlinear stability. This lack of clarity extends to fields other than control; e.g. in debates as to whether neural nets do or do not have a learning capacity;or in theclassical 1960sChomsky vsSkin- ner argument as to whether children’s language “adaptive = improve over best control with a priori info” * disclaimer: a large part of the adaptive control community focuses on stability & not optimality
  12. 13 Contents 2. model-free pipeline with model-based elements → adaptive

    method: policy gradient & sample covariance with Alessandro Chiuso (Padova) , Feiran Zhao, & Keyou You (Tsinghua)
  13. 14 Ingredient 1: policy gradient methods • LQR viewed as

    smooth program (many formulations) • 𝐽 𝐾 is not convex … after eliminating (unique) P, denote this as 𝐽 𝐾 Fact: policy gradient descent 𝐾+ = 𝐾 − 𝜂 ∇𝐽 𝐾 initialized from a stabilizing policy converges linearly to 𝐾∗. Annual Review of Control, Robotics , and AutonomousSys tems Toward aT heoretical Foundation of Policy Optimization for Learning Control Policies Bin Hu,1 Kaiqing Zhang,2,3 Na Li,4 Mehran Mesbahi,5 Maryam Fazel,6 and Tamer Ba¸ sar1 1Coordinated Science Laboratory and Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois, USA; email: binhu7@ illinois.edu, basar1@ illinois.edu 2Laboratory for Information and Decision Systems and Computer Science and Artif cial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 3Current aff liation: Department of Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, College Park, Maryland, USA; email: kaiqing@ umd.edu 4School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA; email: nali@ seas.harvard.edu 5Department of Aeronautics and Astronautics, University of Washington, Seattle, Washington, USA; email: mesbahi@ uw.edu 6Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA; email: mfazel@ uw.edu Annu. Rev. Control Robot. Auton. Syst. 2023. 6:123–58 T he Annual Review of Control, Robotics , and AutonomousSystemsisonline at control.annualreviews.org https://doi.org/10.1146/annurev-control-042920- 020021 Copyright © 2023 by the author(s). T his work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information. Keywords policy optimization, reinforcement learning, feedback control synthesis Abstract Gradient-based methodshave been widely used for system design and opti- mization in diverse application domains. Recently, there hasbeen arenewed interest in studying theoretical propertiesof thesemethodsin thecontext of control and reinforcement learning. T hisarticle surveyssome of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis that hasbeen popularized by successesof re- inforcement learning. We take an interdisciplinary perspective in our expo- sition that connects control theory, reinforcement learning, and large-scale optimization. We review anumber of recently developed theoretical results on the optimization landscape, global convergence, and sample complexity 123 but on the set of stabilizing gains K , it’s • coercive with compact sublevel sets, • smooth with bounded Hessian, & • degree-2 gradient dominated 𝐽 𝐾 − 𝐽∗ ≤ 𝑐𝑜𝑛𝑠𝑡. ∙ ∇𝐽 𝐾 2
  14. 15 Model-free policy gradient methods • policy gradient: 𝐾+ =

    𝐾 − 𝜂 ∇𝐽 𝐾 converges linearly to 𝐾∗ • model-based setting: explicit Anderson-Moore formula for ∇𝐽 𝐾 based on closed-loop controllability + observability Gramians • model-free 0th order methods constructing two-point gradient estimate from numerous & very long trajectories → extremely sample inefficient • IMO: policy gradient is a potentially great candidate for direct adaptive control but sadly useless in practice: sample-inefficient, episodic, … relative performance gap 𝜖 = 1 𝜖 = 0.1 𝜖 = 0.01 # trajectories (100 samples) 1414 43850 142865
  15. 16 Ingredient 2: covariance parameterization prior parameterization • PE condition:

    full row rank 𝑈0 𝑋0 • 𝐴 + 𝐵𝐾 = 𝐵 𝐴 𝐾 𝐼 = 𝐵 𝐴 𝑈0 𝑋0 𝐺 = 𝑋1 𝐺 • robustness: 𝐺 = 𝑈0 𝑋0 ⊤ ∙ regularization • dimension of all matrices grows with 𝑡 covariance parameterization • sample covariance Λ = 1 𝑡 𝑈0 𝑋0 𝑈0 𝑋0 ⊤ ≻ 0 • 𝐴 + 𝐵𝐾 = 𝐵 𝐴 𝐾 𝐼 = 𝐵 𝐴 Λ𝑉 = 1 𝑡 𝑋1 𝑈0 𝑋0 ⊤ 𝑉 • robustness for free without regularization • dimension of all matrices is constant = 𝑋1 X 1 = AX 0 + BU0 U0 = ⇥ u(0) u(1) ··· u(t − 1) ⇤ X 1 = ⇥ x(1) x(2) ··· x(t) ⇤ X 0 = ⇥ x(0) x(1) ··· x(t − 1) ⇤
  16. 17 Covariance parameterization of the LQR • state / input

    sample covariance Λ = 1 𝑡 𝑈0 𝑋0 𝑈0 𝑋0 ⊤ & 𝑋1 = 1 𝑡 𝑋1 𝑈0 𝑋0 ⊤ • closed-loop matrix 𝐴 + 𝐵𝐾 = 𝑋1 𝑉 with 𝐾 −−−− 𝐼 = Λ 𝑉 = 𝑈0 −−−− 𝑋0 𝑉 • covariance parameterization after eliminating 𝐾 with variable 𝑉, smooth {LQR cost + Lyapunov eqn} & linear covariance matrix constraint min 𝑉,Σ≻0 trace 𝑄Σ + trace 𝑉𝑇𝑈0 𝑇 𝑅𝑈0 𝑉Σ s. t. Σ = 𝐼 + 𝑋1 𝑉 Σ 𝑉𝑇𝑋1 𝑇 , 𝐼 = 𝑋0 𝑉
  17. 18 Policy gradient with covariance parameterization • warm-up scenario: offline

    PE data (𝑋0 , 𝑈0 , 𝑋1 ) without disturbances 𝑑(𝑡) • parameterization: • data-enabled policy optimization (DeePO) via projected gradient where Π 𝑋0 projects on 𝐼 = 𝑋0 𝑉 & gradient ∇𝐽 𝑉 is computed by solving two Lyapunov equations parameterized by sample covariance matrices after eliminating (unique) Σ, we denote blue part by 𝐽(𝑉) min 𝑉,Σ≻0 trace 𝑄Σ + trace 𝑉𝑇𝑈0 𝑇 𝑅𝑈0 𝑉Σ s. t. Σ = 𝐼 + 𝑋1 𝑉 Σ 𝑉𝑇𝑋1 𝑇 , 𝐼 = 𝑋0 𝑉 𝑉+ = 𝑉 − 𝜂 Π 𝑋0 (∇𝐽 𝑉 )
  18. 19 Features of data-enabled policy optimization (DeePO) • optimization landscape:

    for any feasible 𝑉 ∈ 𝒮 𝑎 = {𝑉 | 𝐽 𝑉 ≤ 𝑎} • projected gradient dominance of degree 1: 𝐽 𝑉 − 𝐽∗ ≤ 𝜇 𝑎 Π 𝑋0 (∇𝐽 𝑉 ) • smoothness with a bounded Hessian: ∇2𝐽 𝑉 ≤ 𝑙(𝑎) • simulation: 4th order system, 8 samples Sublinear convergence for a feasible initialization 𝑉0 ∈ 𝒮 𝑎 & step size 𝜂 ∈ (0, 1/𝑙(𝑎)]. Then ∀ 𝜖 > 0, 𝐽 𝑉𝑘 − 𝐽∗ ≤ 𝜖, where 𝑘 ≥ 2𝜇(𝑎)2 𝜖∙ 2𝜂−𝑙(𝑎)∙𝜂2 . note: empirically observe linear rate faster than 𝑘 ≥ 𝒪(1/𝜖)
  19. 20 Online & adaptive DeePO • features: direct, online, closed-loop

    data, & recursive implementation where 𝑋0,𝑡+1 = 𝑥 0 , 𝑥 1 , … 𝑥 𝑡 , 𝑥(𝑡 + 1) & similar for other matrices • cheap & recursive implementation: rank-1 update of sample covariances, cheap computation, & no memory needed for old data (𝐴, 𝐵) 𝑥(𝑡 + 1) 𝑢(𝑡) controller ① 𝑉𝑡+1 = Λ𝑡+1 −1 𝐾𝑡 𝐼𝑛 ② 𝑉𝑡+1 ′ = 𝑉𝑡+1 − 𝜂Π ᪄ 𝑋0,𝑡+1 (∇𝐽𝑡+1 𝑉𝑡+1 ) ③ 𝐾𝑡+1 = ഥ 𝑈0,𝑡+1 𝑉𝑡+1 ′ Policy update Input: (𝑋0,𝑡+1 , 𝑈0,𝑡+1 , 𝑋1,𝑡+1 ), 𝐾𝑡 Output: 𝐾𝑡+1 𝑑(𝑡)
  20. 21 Underlying assumption for theoretic certificates • initially stabilizing controller:

    the LQR problem parameterized by offline data 𝑋0,𝑡0 , 𝑈0,𝑡0 , 𝑋1,𝑡0 is feasible with stabilizing gain 𝐾𝑡0 . • BIBO: there are ത 𝑢, ҧ 𝑥 such that 𝑢(𝑡) ≤ ത 𝑢 & 𝑥 𝑡 ≤ ҧ 𝑥 . • persistency of excitation due to process noise or probing: 𝜎 ℋ𝑛+1 𝑈0,𝑡 ≥ 𝛾 ∙ 𝑡 with Hankel matrix ℋ𝑛+1 𝑈0,𝑡 • bounded noise: 𝑑(𝑡) ≤ 𝛿 ∀ 𝑡 → signal-to-noise ratio 𝑆𝑁𝑅 ≔ Τ 𝛾 𝛿
  21. 22 Bounded regret of DeePO in adaptive setting • average

    regret performance metric Regret𝑇 ≔ 1 𝑇 σ 𝑡=𝑡0 𝑡0+𝑇−1 𝐽 𝐾𝑡 − 𝐽∗ • comments on the qualitatively expected result: • analysis is independent of the noise statistics & consistent Regret𝑇→∞ → 0 • favorable sample complexity: sublinear decrease term matches best rate 𝒪(1/ 𝑇) of first-order methods in online convex optimization • empirically observe smaller bias term: 𝒪( Τ 1 𝑆𝑁𝑅2) & not Τ 𝒪(1 𝑆𝑁𝑅) Sublinear regret: Under the assumptions, there are 𝜈1 , 𝜈2 , 𝜈3 , 𝜈4 > 0 such that for 𝜂 ∈ (0, 𝜈1 ] & 𝑆𝑁𝑅 ≥ 𝜈2, it holds that 𝐾𝑡 is stabilizing & Regret𝑇 ≤ 𝜈3 𝑇 + 𝜈4 𝑆𝑁𝑅 .
  22. 23 Comparison case study • case study [Dean et al.

    ‘19]: discrete-time system & noise variance 𝜎2 = 0.01 𝐽 𝐾𝑡 − 𝐽∗ 𝐽∗ • comparison: offline LQR vs direct adaptive DeePO vs indirect adaptive: rls + dlqr [Wang et al. ’21, Lu et al. ’23] → adaptive outperforms offline → direct/indirect rates matching but implementation effort is much higher for indirect
  23. 24 Comparison of computational & sample complexity compute time to

    reach accuracy 𝜖 compute time for increasing dimension 𝑛 relative performance gap 𝜖 = 1 𝜖 = 0.1 𝜖 = 0.01 # long trajectories (100 samples) for 0𝑡ℎ order LQR 1414 43850 142865 DeePO (# I/O samples) 10 24 48 ← direct DeePO significantly outperforms indirect adaptive design in computational effort ↓ DeePO requires significantly fewer data samples than model- free 0𝑡ℎ order gradient methods 𝜖 𝑛
  24. 25 Power systems / electronics case study • wind turbine

    becomes unstable in weak grids with nonlinear oscillations • converter, turbine, & grid are a black box for the commissioning engineer • construct state from time shifts (5ms sampling) of 𝑦 𝑡 , 𝑢(𝑡) & use DeePO synchronous generator & full-scale converter
  25. 26 Power systems / electronics case study 0 2 4

    6 8 10 12 time [s] (a) 0.84 0.86 0.88 0.9 0.92 0.94 0.96 active power (p.u.) probe & collect data oscillation observed activate DeePO without DeePO with DeePO (100 iterations) with DeePO (1 iteration)
  26. 27 Power systems / electronics case study 0 2 4

    6 8 10 12 time [s] (a) 0.84 0.86 0.88 0.9 0.92 0.94 0.96 active power (p.u.) without DeePO with adaptive DeePO probe & collect data oscillation observed activate DeePO
  27. 28 Conclusions • Summary • model-based pipeline with model-free block:

    data-driven LQR parametrization → works well when regularized (note: further flexible regularizations available) • model-free pipeline with model-based block: policy gradient & sample covariance → DeePO is adaptive, online, with closed-loop data, & recursive implementation • academic case studies & can be made useful in power systems • Future work • technicalities: weaken assumptions & improve rates • control: based on output feedback & for other objectives • further system classes: stochastic, time-varying, & nonlinear • open questions: online vs episodic? “best” batch size?