LQR Learning Pipelines - Speaker Deck

Slide 1

Slide 1 text

LQR Learning Pipelines between reinforcement learning & adaptive control Florian Dörfler, ETH Zürich IEEE 14th Data Driven Control and Learning Systems Conference, Wuxi, China, 2025 1

Slide 12

Slide 12 text

● shortcoming of separating offline learning & online control → cannot improve policy online & cheaply / rapidly adapt ● ● desired adaptive solution: online (non-episodic / batch) algorithm, with closed-loop data, recursive implementation, & (in?)direct ● how to “best” improve policy online → go down the gradient ! PII: S0005–1098(98)00089–2 Automatica, Vol. 34, No. 10, pp. 1161— 1167, 1998 IFAC. Published by Elsevier Scienc All rights reserved. Printed in Great B 0005-1098/98 $—see front m Adaptive Control: Towards a Complexity-Based General Theory* G. ZAM ES- Key Words—H control; adaptive control; learning control; performance analysis. Abstract—Two recent developments are pointing the way towards an input— output theory of H ! l adaptive feedback: The solution of problems involving: (1) feedback performance exact optimization under large plant uncertainty on the one hand (thetwo-disc problem of H ); and (2) optimally fast identification in H on the other. Taken together, these are yielding adaptive algorithms for slowly varying data in H ! l . At a conceptual level, theseresultsmotivatea general input— output theory linking identification, adaptation, and control learning. In such a theory, thedefinition of adaptation isbased on system performance under uncertainty, and is independent of internal structure, presence or absence of variable parameters, or even feedback. 1998 IFAC. Published by Elsevier Science Ltd. All rights reserved. 1. INTRODUCTION What should the terms ‘‘adaptive’’ and ‘‘learning’’ mean in the context of control?Is it possibleto tell whether or not a black box is adaptive without certain difficulties. Controllers with ident external behavior can have an endless vari of parametrizations; variable parameters in o parametrization may be replaced by a fixed pa meter nonlinearity in another. In most of therec control literature there is no clear separation tween the concepts of adaptation and nonlin feedback, or between research on adaptive cont and nonlinear stability. This lack of clarity exten to fields other than control; e.g. in debates as whether neural nets do or do not have a learn capacity;or in theclassical 1960sChomsky vsSk ner argument as to whether children’s langu skills arelearned from theenvironment tabula r style, or to a largeextent are‘‘built in’’. (How co one tell the difference anyway?). It can be argu that the lack of a conceptual framework for ad * disclaimer: a large part of the adaptive control community focuses on stability & not optimality monotonicity principles of adaptive control: acquire information & improve control performance over time Towards an online & adaptive implementation 12

Slide 19

Slide 19 text

Model-based policy gradient min 𝐾 𝐽 𝐾 = 𝑇𝑟 𝑄Σ) + 𝑇𝑟(𝐾𝑇𝑅𝐾Σ Annual Review of Control, Robotics , and AutonomousSys tems Toward aT heoretical Foundation of Policy Optimization for Learning Control Policies Bin Hu,1 Kaiqing Zhang,2,3 Na Li,4 Mehran Mesbahi,5 Maryam Fazel,6 and Tamer Ba¸ sar1 1Coordinated Science Laboratory and Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois, USA; email: binhu7@ illinois.edu, basar1@ illinois.edu 2Laboratory for Information and Decision Systems and Computer Science and Artif cial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 3Current aff liation: Department of Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, College Park, Maryland, USA; email: kaiqing@ umd.edu 4School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA; email: nali@ seas.harvard.edu 5Department of Aeronautics and Astronautics, University of Washington, Seattle, Washington, USA; email: mesbahi@ uw.edu 6Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA; email: mfazel@ uw.edu Annu. Rev. Control Robot. Auton. Syst. 2023. 6:123–58 T he Annual Review of Control, Robotics , and AutonomousSystemsisonline at control.annualreviews.org https://doi.org/10.1146/annurev-control-042920- 020021 Copyright © 2023 by the author(s). T his work is licensed under aCreative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information. Keywords policy optimization, reinforcement learning, feedback control synthesis Abstract Gradient-based methodshave been widely used for system design and optimization in diverse application domains. Recently, there hasbeen arenewed interest in studying theoretical propertiesof thesemethodsin thecontext of control and reinforcement learning. T hisarticle surveyssome of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis that hasbeen popularized by successesof reinforcement learning. We take an interdisciplinary perspective in our expo- sition that connects control theory, reinforcement learning, and large-scale Fact: For initial 𝐾0 stabilizing & small 𝜂, gradient descent 𝐾+ = 𝐾 − 𝜂 ∇𝐽 𝐾 converges linearly to 𝐾∗. where Σ = 𝐼 + 𝐴 + 𝐵𝐾 Σ 𝐴 + 𝐵𝐾 𝑇 Algorithm: model-based adaptive control via policy gradient 1. data collection: refresh (𝑋0 , 𝑈0 , 𝑋1 ) 2. identification of ෠ 𝐵, መ 𝐴 via recursive LS 3. policy gradient: 𝐾+ = 𝐾 − 𝜂 ∇𝐽 𝐾 using estimates ෠ 𝐵, መ 𝐴 & closed-loop Gramians Σ, 𝑃 actuate & repeat 19

Slide 33

Slide 33 text

Information metric • bounded noise covariance: 1 𝑡 𝑊0 𝑊0 𝑇 ≤ 𝛿𝑡 2 for some 𝛿𝑡 ≥ 0 • persistency of excitation due to probing: 𝜎 Λt ≥ 𝛾𝑡 2 for some 𝛾𝑡 ≥ 0 • information metric = signal-to-noise ratio 𝑆𝑁𝑅𝑡 ≔ Τ 𝛾𝑡 𝛿𝑡 Uniform 𝛿𝑡 ∼ 𝑂(1) Constant 𝛾𝑡 ∼ 𝑂(1) 𝑆𝑁𝑅𝑡 ∼ 𝑂(1) Gaussian 𝛿𝑡 ∼ 𝑂(1/ 𝑡) Constant 𝛾𝑡 ∼ 𝑂(1) 𝑆𝑁𝑅𝑡 ∼ 𝑂( 𝑡) Decay 𝛾𝑡 ∼ 𝑂(𝑡−1/4) 𝑆𝑁𝑅𝑡 ∼ 𝑂(𝑡1/4) satisfies Zames’ first monotonicity principle: information acquisition = SNR increases 33 PII: S0005–1098(98)00089–2 Automatica, Vol. 34, No. 10, pp. 1161— 1167, 199 1998 IFAC. Published by Elsevier Science L All rights reserved. Printed in Great Brita 0005-1098/98 $—see front matt Adaptive Control: Towards a Complexity-Based General Theory* G. ZAM ES- Key Words—H control; adaptive control; learning control; performance analysis. Abstract—Two recent developments are pointing the way towards an input— output theory of H ! l adaptive feedback: The solution of problems involving: (1) feedback performance exact optimization under large plant uncertainty on the one hand (thetwo-disc problem of H ); and (2) optimally fast identification in H on the other. Taken together, these are yielding adaptive algorithms for slowly varying data in H ! l . At a conceptual level, theseresultsmotivatea general input— output theory linking identification, adaptation, and control learning. In such a theory, thedefinition of adaptation isbased on system performance under uncertainty, and is independent of internal structure, presence or absence of variable parameters, or even feedback. 1998 IFAC. Published by Elsevier Science Ltd. All rights reserved. 1. INTRODUCTION What should the terms ‘‘adaptive’’ and ‘‘learning’’ mean in the context of control?Is it possible to tell whether or not a black box is adaptive without knowledge of its internal structure? In design, is it possible to determine beforehand whether it is ne- cessary for a controller to adapt and learn in order to meet performance specifications, or is adaptation a matter of choice? In this overview we shall describe recent work in the H framework which provides a means of computing certain kinds of adaptive controllers, but which also sheds some light on these more conceptual questions. Despite the long history of research on adaptive control, and the considerable practical success of adaptive strategies associated with the names of certain difficulties. Controllers with identica external behavior can have an endless variet of parametrizations; variable parameters in on parametrization may be replaced by a fixed para meter nonlinearity in another. In most of therecen control literature there is no clear separation be tween the concepts of adaptation and nonlinea feedback, or between research on adaptive contro and nonlinear stability. This lack of clarity extend to fields other than control; e.g. in debates as t whether neural nets do or do not have a learnin capacity;or in theclassical 1960sChomsky vsSkin ner argument as to whether children’s languag skills arelearned from theenvironment tabula ras style, or to a largeextent are‘‘built in’’. (How coul one tell the difference anyway?). It can be argue that the lack of a conceptual framework for adap tive control has inhibited research in this area an made it difficult to compare alternative designs. We would like to re-examine these issues in th light of recent developments linking the theories o feedback, identification, complexity and time-vary ing optimization. The perspective here is actuall not new, having been outlined by theauthor on an off since the 1970s (Zames, 1976, 1979, 1981, 1989 However, the key mathematical details have bee worked out only recently, notably in joint wor with Lin et al. (Lin et al., 1992; Zames and Wang

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text