Chaoyue Zhao (University of Washington, Seattle, USA) Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods Chaoyue
Zhao Industrial and Systems Engineering University of Washington OT-DOM Workshop 2024 Zhao, UW WPO/SPO March 2024 1 / 30

Reinforcement Learning What is Reinforcement Learning (RL) ? ▶ An
agent performs in an uncertain environment ▶ This agent obtains rewards through interactions with the environment Zhao, UW WPO/SPO March 2024 2 / 30

Components of Reinforcement Learning Model of environment: Markov Decision Process
(MDP) ▶ Define how environment reacts to actions ▶ Denoted by tuple: (S, A, P, r, γ): S - state space, A - action space, r - reward function: S × A − → R, P - transition probabilities S × A × S − → R, γ - discount factor Policy specifies the strategy for RL agent ▶ Provide guideline on the action to take in a certain state ▶ Different representations: Stochastic policy π(a|s), deterministic policy π(s) Zhao, UW WPO/SPO March 2024 3 / 30

Components of Reinforcement Learning Objective of RL - maximize the
discounted return: ▶ J(π) = Eπ [ ∞ t=0 γtr(st, at )] Functions to reflect the expected future rewards ▶ Q function ⋆ Qπ(s, a) = Eπ [ ∞ k=0 γk r(st+k+1 , at+k+1 )|st = s, at = a] ▶ V function ⋆ V π(s) = Ea∼π(·|s) Qπ(s, a) ▶ Advantage function ⋆ Aπ(s, a) = Qπ(s, a) − V π(s) Zhao, UW WPO/SPO March 2024 4 / 30

Key Concepts of Reinforcement Learning Model-based ▶ Know the model
of the environment, e.g., AlphaZero [Silver et al., 2017] ▶ Learn the model of the environment, e.g., Dyna [Sutton, 1991] ▶ Example: Q value iteration: Qk+1 (s, a) = s′ p(s′|s, a)(r(s, a) + γ max a′ Qk (s′, a′)) Model-free ▶ Solve for the strategy without using or learning the environment model ▶ Example: Q learning: Qk+1 (s, a) = Qk (s, a) + α[r(s, a) + γ max a′ Qk (s′, a′) − Qk (s, a)] Zhao, UW WPO/SPO March 2024 5 / 30

Model-Free Reinforcement Learning Value-based ▶ Only the value function is
stored and learnt ▶ Policy is implicit - derived from the value function ▶ Examples: Q-learning [Watkins, 1989], DQN [Mnih et al., 2013], etc Policy-based ▶ Store and learn the policy directly ▶ Examples: REINFORCE [Williams, 1992], actor-critic [Konda and Tsitsiklis, 2000], deterministic policy gradients [Silver et al., 2014], deep deterministic policy gradients [Lillicrap et al., 2015], etc Zhao, UW WPO/SPO March 2024 6 / 30

Policy Gradient Policy gradient (PG) is a prominent RL method
▶ Model-free ▶ Policy-based ▶ Policy is modeled as πθ (a|s) ⋆ A probability distribution controlled by parameter θ ▶ Gradient ascent is applied to move θ towards highest return ⋆ i.e., θ ← − θ + α∇θ J(πθ ) Zhao, UW WPO/SPO March 2024 7 / 30

One Limitation of Policy Gradient Difficult to determine the right
step size for policy update ▶ Step size too small: slow convergence ▶ Step size too large: catastrophically bad policy updates Solution: restrict the policy deviations - add D(πθt+1 |πθt ) ≤ δ as a “trust region” constraint ▶ D(.|.) can be some divergence or metric to measure the difference between two policies ▶ Example - KL divergence DKL (P|Q) := Ω log( dP dQ )dP Zhao, UW WPO/SPO March 2024 8 / 30

Trust Region Based Policy Gradient Control the size of policy
update Related work: ▶ Kullback-Leibler (KL) divergence based trust region constraint: Trust region policy optimization [Schulman et al., 2015], maximum aposteriori policy optimization [Abdolmaleki et al., 2018], advantage-weighted regression [Peng et al., 2019] ▶ Penalize the size of policy update: Natural policy gradient [Kakade, 2001], Proximal policy optimization [Schulman et al., 2017] Zhao, UW WPO/SPO March 2024 9 / 30

Sample Inefficiency of Trust Region Based PG RL agents may
need to consume an enormous amount of sample points to keep on learning a finer policy Two main causes: ▶ Approximations made when solving the policy optimization ⋆ Policy updates are rejected by the line search process ▶ Inaccurate evaluations of the value functions ⋆ Slow down the learning speed Zhao, UW WPO/SPO March 2024 10 / 30

Suboptimality of Trust Region Based PG Parametric policy assumption leads
to suboptimality ▶ Policy is assumed to follow a particular parametric distribution e.g. Gaussian [Schulman et al., 2015] ▶ Hard to predetermine the distribution class of optimal policy ▶ Parametric distributions are not convex in the distribution space Figure: Objective in distribution space Π/ parameter space Θ [Tessler et al., 2019] Zhao, UW WPO/SPO March 2024 11 / 30

Our Proposal: Wasserstein and Sinkhorn Policy Optimization An innovative trust
region based PG methods: ▶ Stability guaranteed ▶ No restrictive parametric policy assumption ▶ More robust to value function inaccuracies A combination of: ▶ Optimistic Distributionally Robust Optimization ▶ Trust region constructed by suitable metric Zhao, UW WPO/SPO March 2024 12 / 30

Optimistic Distributionally Robust Optimization Formulation: max P∈D EP [Q(x, ξ)]
▶ Distribution P of the random parameter ξ is not precisely known but assumed to belong to an ambiguity set (trust region) D ▶ The optimization problem is set to identify the optimistic policy within the trust region All admissible policy distributions are considered ▶ Reduce the rejection rate of policy updates ⋆ Improve sample efficiency ▶ Open up the possibility of converging to a better final policy Zhao, UW WPO/SPO March 2024 13 / 30

Mathematical Framework Formulation max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)]
where D = {π′|Es∼ρπ υ [d(π′(·|s), π(·|s))] ≤ δ}. (1) Mathematical notations ▶ Aπ(s, a) - advantage function of policy π associated with state s and action a ▶ D - trust region/ambiguity set ▶ ρπ υ - unnormalized discounted visitation frequencies with initial state distribution υ: ρπ υ (s) = Es0∼υ [ ∞ t=0 γtP(st = s|s0 )] Zhao, UW WPO/SPO March 2024 14 / 30

Mathematical Framework Formulation max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)]
where D = {π′|Es∼ρπ υ [d(π′(·|s), π(·|s))] ≤ δ}. Motivation of objective: ▶ η(π′) = η(π) + Es∼ρπ′ υ ,a∼π′ [Aπ(s, a)] ▶ η(π) - performance of policy π: η(π) = Es0,a0,s1... [ ∞ t=0 γtr(st, at )] ▶ Approximation: use ρπ υ instead of ρπ′ υ Zhao, UW WPO/SPO March 2024 15 / 30

Motivation of using Wasserstein Metric/Sinkhorn Divergence Kullback-Leibler (KL) based trust
regions are pervasively used to stabilize policy optimization in model-free RL. ▶ Other distances such as Wasserstein and Sinkhorn are underexplored Wasserstein has several advantages over KL: ▶ Consider the geometry of the metric space [Panaretos et al., 2019] ▶ Allow distributions to have different or even non-overlapping supports Zhao, UW WPO/SPO March 2024 16 / 30

Motivating Example Three actions: Left, Right and Pickup. d(left, right)
= 1, d(left, pickup) = d(right, pickup) = 4. Figure: Motivating grid world example (a) Policy shift of close action (b) Policy shift of far action Figure: Wasserstein utilizes geometric feature of action space Zhao, UW WPO/SPO March 2024 17 / 30

Motivating Example (Cont) Figure: Demonstration of policy updates under different
trust regions Wasserstein metric achieves better exploration than KL Zhao, UW WPO/SPO March 2024 18 / 30

Definitions of Wasserstein and Sinkhorn distances Wasserstein metric: dW(π′, π)
= inf Q∈Π(π′,π) ⟨Q, M⟩, (2) Sinkhorn divergence [Cuturi, 2013] provides a smooth approximation of the Wasserstein distance by adding an entropic regularizer: dS(π′, π|λ) = inf Q∈Π(π′,π) ⟨Q, M⟩ − 1 λ h(Q) , (3) where h(Q) = − N i=1 N j=1 Qij log Qij represents the entropy term. Zhao, UW WPO/SPO March 2024 19 / 30

Wasserstein Policy Optimization/Sinkhorn Policy Optimization Framework Directly optimize nonparametric policy
in the distribution space Proposed WPO/SPO problem (optimistic DRO problems) [1]: max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)] where D = {π′|Es∼ρπ υ [dW(π′(·|s), π(·|s))] ≤ δ}. (4) [1] Jun Song, Niao He, Lijun Ding, and Chaoyue Zhao. “Provably convergent policy optimization via metric-aware trust region methods”. Transactions on Machine Learning Research (TMLR). Zhao, UW WPO/SPO March 2024 20 / 30

WPO Optimal Policy Update Theorem 1 An optimal policy update
of Wasserstein Policy Optimization is: π∗(ai |s) = N j=1 π(aj |s)f ∗ s (i, j), (5) where f ∗ s (i, j) = 1 if i = kπ s (β∗, j) and f ∗ s (i, j) = 0 otherwise. Mathematical notations: ▶ β∗: an optimal solution to the dual formulation of (4) ▶ kπ s (β∗, j) - an arbitrary optimizer: kπ s (β∗, j) ∈ argmaxk=1...N Aπ(s, ak ) − β∗Mkj Zhao, UW WPO/SPO March 2024 21 / 30

SPO Optimal Policy Update Theorem 2 An optimal policy update
of Sinkhorn Policy Optimization is: π∗ λ (ai |s) = N j=1 π(aj |s)f ∗ s,λ (i, j), (6) where f ∗ s,λ (i, j) = exp ( λ β∗ λ Aπ(s,ai )−λDij ) N k=1 exp ( λ β∗ λ Aπ(s,ak )−λDkj ) Produce smoother policy than WPO − → further speed up exploration Zhao, UW WPO/SPO March 2024 22 / 30

Efficient Policy Update Strategies βk as a general time-dependent control
parameter ▶ Avoid solving the dual problem each iteration ▶ Equivalent to the penalty version WPO/SPO problem: max πk+1 Es∼ρπk υ ,a∼πk+1(·|s) [Aπk (s, a)] − βk Es∼ρπk υ [dW (πk+1 (·|s), πk (·|s))]. ▶ Set up as decreasing sequence − → less penalty as policy is learnt finer Support sampling in continuous state/action space: ▶ Sample actions to approximate the target policy distribution πk+1 ▶ Sample states to perform the policy updates Zhao, UW WPO/SPO March 2024 23 / 30

Theoretical Analysis of WPO Performance improvement bound: J(πk+1) ≥ J(πk)+βkEs∼ρ
πk+1 µ N j=1 πk(aj |s) i∈ ˆ Kπk s (βk ,j) f k s (i, j)Mij − 2ϵ 1 − γ , where βk ≥ 0, and ϵ bounds || ˆ Aπ − Aπ||∞ ≤ ϵ. Monotonic performance improvement J(πk+1) ≥ J(πk) holds if ϵ = 0. Global convergence: limk→∞ J(πk) = J⋆ holds, if limk→∞ βk = 0. Zhao, UW WPO/SPO March 2024 24 / 30

WPO/SPO Algorithm Algorithm 1: WPO/SPO Algorithm Input: number of iterations
K, learning rate α Initialize policy π0 and value network Vψ0 with random parameter ψ0 for k = 0, 1, 2 . . . K do Collect a set of trajectories Dk on policy πk For each timestep t in each trajectory, compute total returns Gt and estimate advantages ˆ Aπk t Update value: ψk+1 ← − ψk − α∇ψk (Gt − Vψk (st))2 Update policy with (5) or (6): πk+1 ← − F(πk) Zhao, UW WPO/SPO March 2024 25 / 30

Algorithm Details Trajectories can be either complete or partial ▶
Complete: use accumulated discount reward as return ⋆ Rt = T−t−1 k=0 γk rt+k ▶ Partial: approximate return using multi-step temporal difference (TD) ⋆ ˆ Rt:t+n = n−1 k=0 γk rt+k + γnV (st+n ) Support various advantage estimation methods ▶ Monte Carlo advantage estimation ˆ Aπk t = Rt − Vψk (st ) ▶ Generalized Advantage Estimation (GAE) [Schulman et al., 2015] ⋆ Provide explicit control over variance-bias trade-off Value function is represented as neural network ▶ s → V (s) ▶ Network parameter ψ - updated by gradient descent ▶ Reduce the computational burden of computing advantage directly Zhao, UW WPO/SPO March 2024 26 / 30

Numerical Studies: Ablation Study Table: Run time for different β
settings Runtime Taxi (s) CartPole (s) Setting 1 (optimal β) 1224 130 Setting 2 (optimal-then-decay) 648 63 Setting 3 (optimal-then-fix) 630 67 Setting 4 (decaying β) 522 44 (a) Different settings of β (b) Different choices of λ Zhao, UW WPO/SPO March 2024 27 / 30

Numerical Studies: Episode Rewards during Training Tested on RL tasks
across various domains: Tabular domain (discrete state and action), e.g., Taxi Locomotion tasks (continuous state, discrete action), e.g., CartPole MuJuCo tasks (continuous state and action), e.g., Hopper Both WPO and SPO outperform state-of-art. WPO has better final performance. SPO has faster convergence. Zhao, UW WPO/SPO March 2024 28 / 30

Numerical Studies: Additional Comparison with KL Trust Regions NA: Number
of samples to estimate the advantage function (a) NA = 100 (b) NA = 250 (c) NA = 1000 Wasserstein metric is more robust to the inaccurate advantage estimations caused by the lack of samples Zhao, UW WPO/SPO March 2024 29 / 30

Conclusion The WPO/SPO framework can: improve final performance by allowing
all admissible policies maintain stability by confining policy updates within the trust region improve sample efficiency by reducing the rejection rate of sampling through the relaxation of the parametric restriction increase the robustness against inaccuracies in advantage functions Zhao, UW WPO/SPO March 2024 30 / 30

Chaoyue Zhao (University of Washington, Seattle...

Chaoyue Zhao (University of Washington, Seattle, USA) Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Jia-Jie Zhu

More Decks by Jia-Jie Zhu

Featured

Transcript

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods Chaoyue

Reinforcement Learning What is Reinforcement Learning (RL) ? ▶ An

Components of Reinforcement Learning Model of environment: Markov Decision Process

Components of Reinforcement Learning Objective of RL - maximize the

Key Concepts of Reinforcement Learning Model-based ▶ Know the model

Model-Free Reinforcement Learning Value-based ▶ Only the value function is

Policy Gradient Policy gradient (PG) is a prominent RL method

One Limitation of Policy Gradient Difficult to determine the right

Trust Region Based Policy Gradient Control the size of policy

Sample Inefficiency of Trust Region Based PG RL agents may

Suboptimality of Trust Region Based PG Parametric policy assumption leads

Our Proposal: Wasserstein and Sinkhorn Policy Optimization An innovative trust

Optimistic Distributionally Robust Optimization Formulation: max P∈D EP [Q(x, ξ)]

Mathematical Framework Formulation max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)]

Mathematical Framework Formulation max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)]

Motivation of using Wasserstein Metric/Sinkhorn Divergence Kullback-Leibler (KL) based trust

Motivating Example Three actions: Left, Right and Pickup. d(left, right)

Motivating Example (Cont) Figure: Demonstration of policy updates under different

Definitions of Wasserstein and Sinkhorn distances Wasserstein metric: dW(π′, π)

Wasserstein Policy Optimization/Sinkhorn Policy Optimization Framework Directly optimize nonparametric policy

WPO Optimal Policy Update Theorem 1 An optimal policy update

SPO Optimal Policy Update Theorem 2 An optimal policy update

Efficient Policy Update Strategies βk as a general time-dependent control

Theoretical Analysis of WPO Performance improvement bound: J(πk+1) ≥ J(πk)+βkEs∼ρ

WPO/SPO Algorithm Algorithm 1: WPO/SPO Algorithm Input: number of iterations

Algorithm Details Trajectories can be either complete or partial ▶

Numerical Studies: Ablation Study Table: Run time for different β

Numerical Studies: Episode Rewards during Training Tested on RL tasks

Numerical Studies: Additional Comparison with KL Trust Regions NA: Number

Conclusion The WPO/SPO framework can: improve final performance by allowing