Slide 19
Slide 19 text
Model-based policy gradient
min
𝐾
𝐽 𝐾 = 𝑇𝑟 𝑄Σ) + 𝑇𝑟(𝐾𝑇𝑅𝐾Σ
Annual Review of Control, Robotics
, and
AutonomousSys
tems
Toward aT heoretical
Foundation of Policy
Optimization for Learning
Control Policies
Bin Hu,1 Kaiqing Zhang,2,3 Na Li,4 Mehran Mesbahi,5
Maryam Fazel,6 and Tamer Ba¸
sar1
1Coordinated Science Laboratory and Department of Electrical and Computer Engineering,
University of Illinois Urbana-Champaign, Urbana, Illinois, USA; email: binhu7@
illinois.edu,
basar1@
illinois.edu
2Laboratory for Information and Decision Systems and Computer Science and Artif cial
Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
3Current aff liation: Department of Electrical and Computer Engineering and Institute for
Systems Research, University of Maryland, College Park, Maryland, USA;
email: kaiqing@
umd.edu
4School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts,
USA; email: nali@
seas.harvard.edu
5Department of Aeronautics and Astronautics, University of Washington, Seattle, Washington,
USA; email: mesbahi@
uw.edu
6Department of Electrical and Computer Engineering, University of Washington, Seattle,
Washington, USA; email: mfazel@
uw.edu
Annu. Rev. Control Robot. Auton. Syst. 2023.
6:123–58
T he Annual Review of Control, Robotics
, and
AutonomousSystemsisonline at
control.annualreviews.org
https://doi.org/10.1146/annurev-control-042920-
020021
Copyright © 2023 by the author(s). T his work is
licensed under aCreative Commons Attribution 4.0
International License, which permits unrestricted
use, distribution, and reproduction in any medium,
provided the original author and source are credited.
See credit lines of images or other third-party
material in this article for license information.
Keywords
policy optimization, reinforcement learning, feedback control synthesis
Abstract
Gradient-based methodshave been widely used for system design and opti-
mization in diverse application domains. Recently, there hasbeen arenewed
interest in studying theoretical propertiesof thesemethodsin thecontext of
control and reinforcement learning. T hisarticle surveyssome of the recent
developments on policy optimization, a gradient-based iterative approach
for feedback control synthesis that hasbeen popularized by successesof re-
inforcement learning. We take an interdisciplinary perspective in our expo-
sition that connects control theory, reinforcement learning, and large-scale
Fact: For initial 𝐾0
stabilizing & small 𝜂, gradient
descent 𝐾+ = 𝐾 − 𝜂 ∇𝐽 𝐾 converges linearly to 𝐾∗.
where Σ = 𝐼 + 𝐴 + 𝐵𝐾 Σ 𝐴 + 𝐵𝐾 𝑇
Algorithm: model-based adaptive control via policy gradient
1. data collection: refresh (𝑋0
, 𝑈0
, 𝑋1
)
2. identification of
𝐵, መ
𝐴 via recursive LS
3. policy gradient: 𝐾+ = 𝐾 − 𝜂 ∇𝐽 𝐾 using
estimates
𝐵, መ
𝐴 & closed-loop Gramians Σ, 𝑃
actuate
& repeat
19