Slide 1

Slide 1 text

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods Chaoyue Zhao Industrial and Systems Engineering University of Washington OT-DOM Workshop 2024 Zhao, UW WPO/SPO March 2024 1 / 30

Slide 2

Slide 2 text

Reinforcement Learning What is Reinforcement Learning (RL) ? ▶ An agent performs in an uncertain environment ▶ This agent obtains rewards through interactions with the environment Zhao, UW WPO/SPO March 2024 2 / 30

Slide 3

Slide 3 text

Components of Reinforcement Learning Model of environment: Markov Decision Process (MDP) ▶ Define how environment reacts to actions ▶ Denoted by tuple: (S, A, P, r, γ): S - state space, A - action space, r - reward function: S × A − → R, P - transition probabilities S × A × S − → R, γ - discount factor Policy specifies the strategy for RL agent ▶ Provide guideline on the action to take in a certain state ▶ Different representations: Stochastic policy π(a|s), deterministic policy π(s) Zhao, UW WPO/SPO March 2024 3 / 30

Slide 4

Slide 4 text

Components of Reinforcement Learning Objective of RL - maximize the discounted return: ▶ J(π) = Eπ [ ∞ t=0 γtr(st, at )] Functions to reflect the expected future rewards ▶ Q function ⋆ Qπ(s, a) = Eπ [ ∞ k=0 γk r(st+k+1 , at+k+1 )|st = s, at = a] ▶ V function ⋆ V π(s) = Ea∼π(·|s) Qπ(s, a) ▶ Advantage function ⋆ Aπ(s, a) = Qπ(s, a) − V π(s) Zhao, UW WPO/SPO March 2024 4 / 30

Slide 5

Slide 5 text

Key Concepts of Reinforcement Learning Model-based ▶ Know the model of the environment, e.g., AlphaZero [Silver et al., 2017] ▶ Learn the model of the environment, e.g., Dyna [Sutton, 1991] ▶ Example: Q value iteration: Qk+1 (s, a) = s′ p(s′|s, a)(r(s, a) + γ max a′ Qk (s′, a′)) Model-free ▶ Solve for the strategy without using or learning the environment model ▶ Example: Q learning: Qk+1 (s, a) = Qk (s, a) + α[r(s, a) + γ max a′ Qk (s′, a′) − Qk (s, a)] Zhao, UW WPO/SPO March 2024 5 / 30

Slide 6

Slide 6 text

Model-Free Reinforcement Learning Value-based ▶ Only the value function is stored and learnt ▶ Policy is implicit - derived from the value function ▶ Examples: Q-learning [Watkins, 1989], DQN [Mnih et al., 2013], etc Policy-based ▶ Store and learn the policy directly ▶ Examples: REINFORCE [Williams, 1992], actor-critic [Konda and Tsitsiklis, 2000], deterministic policy gradients [Silver et al., 2014], deep deterministic policy gradients [Lillicrap et al., 2015], etc Zhao, UW WPO/SPO March 2024 6 / 30

Slide 7

Slide 7 text

Policy Gradient Policy gradient (PG) is a prominent RL method ▶ Model-free ▶ Policy-based ▶ Policy is modeled as πθ (a|s) ⋆ A probability distribution controlled by parameter θ ▶ Gradient ascent is applied to move θ towards highest return ⋆ i.e., θ ← − θ + α∇θ J(πθ ) Zhao, UW WPO/SPO March 2024 7 / 30

Slide 8

Slide 8 text

One Limitation of Policy Gradient Difficult to determine the right step size for policy update ▶ Step size too small: slow convergence ▶ Step size too large: catastrophically bad policy updates Solution: restrict the policy deviations - add D(πθt+1 |πθt ) ≤ δ as a “trust region” constraint ▶ D(.|.) can be some divergence or metric to measure the difference between two policies ▶ Example - KL divergence DKL (P|Q) := Ω log( dP dQ )dP Zhao, UW WPO/SPO March 2024 8 / 30

Slide 9

Slide 9 text

Trust Region Based Policy Gradient Control the size of policy update Related work: ▶ Kullback-Leibler (KL) divergence based trust region constraint: Trust region policy optimization [Schulman et al., 2015], maximum aposteriori policy optimization [Abdolmaleki et al., 2018], advantage-weighted regression [Peng et al., 2019] ▶ Penalize the size of policy update: Natural policy gradient [Kakade, 2001], Proximal policy optimization [Schulman et al., 2017] Zhao, UW WPO/SPO March 2024 9 / 30

Slide 10

Slide 10 text

Sample Inefficiency of Trust Region Based PG RL agents may need to consume an enormous amount of sample points to keep on learning a finer policy Two main causes: ▶ Approximations made when solving the policy optimization ⋆ Policy updates are rejected by the line search process ▶ Inaccurate evaluations of the value functions ⋆ Slow down the learning speed Zhao, UW WPO/SPO March 2024 10 / 30

Slide 11

Slide 11 text

Suboptimality of Trust Region Based PG Parametric policy assumption leads to suboptimality ▶ Policy is assumed to follow a particular parametric distribution e.g. Gaussian [Schulman et al., 2015] ▶ Hard to predetermine the distribution class of optimal policy ▶ Parametric distributions are not convex in the distribution space Figure: Objective in distribution space Π/ parameter space Θ [Tessler et al., 2019] Zhao, UW WPO/SPO March 2024 11 / 30

Slide 12

Slide 12 text

Our Proposal: Wasserstein and Sinkhorn Policy Optimization An innovative trust region based PG methods: ▶ Stability guaranteed ▶ No restrictive parametric policy assumption ▶ More robust to value function inaccuracies A combination of: ▶ Optimistic Distributionally Robust Optimization ▶ Trust region constructed by suitable metric Zhao, UW WPO/SPO March 2024 12 / 30

Slide 13

Slide 13 text

Optimistic Distributionally Robust Optimization Formulation: max P∈D EP [Q(x, ξ)] ▶ Distribution P of the random parameter ξ is not precisely known but assumed to belong to an ambiguity set (trust region) D ▶ The optimization problem is set to identify the optimistic policy within the trust region All admissible policy distributions are considered ▶ Reduce the rejection rate of policy updates ⋆ Improve sample efficiency ▶ Open up the possibility of converging to a better final policy Zhao, UW WPO/SPO March 2024 13 / 30

Slide 14

Slide 14 text

Mathematical Framework Formulation max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)] where D = {π′|Es∼ρπ υ [d(π′(·|s), π(·|s))] ≤ δ}. (1) Mathematical notations ▶ Aπ(s, a) - advantage function of policy π associated with state s and action a ▶ D - trust region/ambiguity set ▶ ρπ υ - unnormalized discounted visitation frequencies with initial state distribution υ: ρπ υ (s) = Es0∼υ [ ∞ t=0 γtP(st = s|s0 )] Zhao, UW WPO/SPO March 2024 14 / 30

Slide 15

Slide 15 text

Mathematical Framework Formulation max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)] where D = {π′|Es∼ρπ υ [d(π′(·|s), π(·|s))] ≤ δ}. Motivation of objective: ▶ η(π′) = η(π) + Es∼ρπ′ υ ,a∼π′ [Aπ(s, a)] ▶ η(π) - performance of policy π: η(π) = Es0,a0,s1... [ ∞ t=0 γtr(st, at )] ▶ Approximation: use ρπ υ instead of ρπ′ υ Zhao, UW WPO/SPO March 2024 15 / 30

Slide 16

Slide 16 text

Motivation of using Wasserstein Metric/Sinkhorn Divergence Kullback-Leibler (KL) based trust regions are pervasively used to stabilize policy optimization in model-free RL. ▶ Other distances such as Wasserstein and Sinkhorn are underexplored Wasserstein has several advantages over KL: ▶ Consider the geometry of the metric space [Panaretos et al., 2019] ▶ Allow distributions to have different or even non-overlapping supports Zhao, UW WPO/SPO March 2024 16 / 30

Slide 17

Slide 17 text

Motivating Example Three actions: Left, Right and Pickup. d(left, right) = 1, d(left, pickup) = d(right, pickup) = 4. Figure: Motivating grid world example (a) Policy shift of close action (b) Policy shift of far action Figure: Wasserstein utilizes geometric feature of action space Zhao, UW WPO/SPO March 2024 17 / 30

Slide 18

Slide 18 text

Motivating Example (Cont) Figure: Demonstration of policy updates under different trust regions Wasserstein metric achieves better exploration than KL Zhao, UW WPO/SPO March 2024 18 / 30

Slide 19

Slide 19 text

Definitions of Wasserstein and Sinkhorn distances Wasserstein metric: dW(π′, π) = inf Q∈Π(π′,π) ⟨Q, M⟩, (2) Sinkhorn divergence [Cuturi, 2013] provides a smooth approximation of the Wasserstein distance by adding an entropic regularizer: dS(π′, π|λ) = inf Q∈Π(π′,π) ⟨Q, M⟩ − 1 λ h(Q) , (3) where h(Q) = − N i=1 N j=1 Qij log Qij represents the entropy term. Zhao, UW WPO/SPO March 2024 19 / 30

Slide 20

Slide 20 text

Wasserstein Policy Optimization/Sinkhorn Policy Optimization Framework Directly optimize nonparametric policy in the distribution space Proposed WPO/SPO problem (optimistic DRO problems) [1]: max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)] where D = {π′|Es∼ρπ υ [dW(π′(·|s), π(·|s))] ≤ δ}. (4) [1] Jun Song, Niao He, Lijun Ding, and Chaoyue Zhao. “Provably convergent policy optimization via metric-aware trust region methods”. Transactions on Machine Learning Research (TMLR). Zhao, UW WPO/SPO March 2024 20 / 30

Slide 21

Slide 21 text

WPO Optimal Policy Update Theorem 1 An optimal policy update of Wasserstein Policy Optimization is: π∗(ai |s) = N j=1 π(aj |s)f ∗ s (i, j), (5) where f ∗ s (i, j) = 1 if i = kπ s (β∗, j) and f ∗ s (i, j) = 0 otherwise. Mathematical notations: ▶ β∗: an optimal solution to the dual formulation of (4) ▶ kπ s (β∗, j) - an arbitrary optimizer: kπ s (β∗, j) ∈ argmaxk=1...N Aπ(s, ak ) − β∗Mkj Zhao, UW WPO/SPO March 2024 21 / 30

Slide 22

Slide 22 text

SPO Optimal Policy Update Theorem 2 An optimal policy update of Sinkhorn Policy Optimization is: π∗ λ (ai |s) = N j=1 π(aj |s)f ∗ s,λ (i, j), (6) where f ∗ s,λ (i, j) = exp ( λ β∗ λ Aπ(s,ai )−λDij ) N k=1 exp ( λ β∗ λ Aπ(s,ak )−λDkj ) Produce smoother policy than WPO − → further speed up exploration Zhao, UW WPO/SPO March 2024 22 / 30

Slide 23

Slide 23 text

Efficient Policy Update Strategies βk as a general time-dependent control parameter ▶ Avoid solving the dual problem each iteration ▶ Equivalent to the penalty version WPO/SPO problem: max πk+1 Es∼ρπk υ ,a∼πk+1(·|s) [Aπk (s, a)] − βk Es∼ρπk υ [dW (πk+1 (·|s), πk (·|s))]. ▶ Set up as decreasing sequence − → less penalty as policy is learnt finer Support sampling in continuous state/action space: ▶ Sample actions to approximate the target policy distribution πk+1 ▶ Sample states to perform the policy updates Zhao, UW WPO/SPO March 2024 23 / 30

Slide 24

Slide 24 text

Theoretical Analysis of WPO Performance improvement bound: J(πk+1) ≥ J(πk)+βkEs∼ρ πk+1 µ N j=1 πk(aj |s) i∈ ˆ Kπk s (βk ,j) f k s (i, j)Mij − 2ϵ 1 − γ , where βk ≥ 0, and ϵ bounds || ˆ Aπ − Aπ||∞ ≤ ϵ. Monotonic performance improvement J(πk+1) ≥ J(πk) holds if ϵ = 0. Global convergence: limk→∞ J(πk) = J⋆ holds, if limk→∞ βk = 0. Zhao, UW WPO/SPO March 2024 24 / 30

Slide 25

Slide 25 text

WPO/SPO Algorithm Algorithm 1: WPO/SPO Algorithm Input: number of iterations K, learning rate α Initialize policy π0 and value network Vψ0 with random parameter ψ0 for k = 0, 1, 2 . . . K do Collect a set of trajectories Dk on policy πk For each timestep t in each trajectory, compute total returns Gt and estimate advantages ˆ Aπk t Update value: ψk+1 ← − ψk − α∇ψk (Gt − Vψk (st))2 Update policy with (5) or (6): πk+1 ← − F(πk) Zhao, UW WPO/SPO March 2024 25 / 30

Slide 26

Slide 26 text

Algorithm Details Trajectories can be either complete or partial ▶ Complete: use accumulated discount reward as return ⋆ Rt = T−t−1 k=0 γk rt+k ▶ Partial: approximate return using multi-step temporal difference (TD) ⋆ ˆ Rt:t+n = n−1 k=0 γk rt+k + γnV (st+n ) Support various advantage estimation methods ▶ Monte Carlo advantage estimation ˆ Aπk t = Rt − Vψk (st ) ▶ Generalized Advantage Estimation (GAE) [Schulman et al., 2015] ⋆ Provide explicit control over variance-bias trade-off Value function is represented as neural network ▶ s → V (s) ▶ Network parameter ψ - updated by gradient descent ▶ Reduce the computational burden of computing advantage directly Zhao, UW WPO/SPO March 2024 26 / 30

Slide 27

Slide 27 text

Numerical Studies: Ablation Study Table: Run time for different β settings Runtime Taxi (s) CartPole (s) Setting 1 (optimal β) 1224 130 Setting 2 (optimal-then-decay) 648 63 Setting 3 (optimal-then-fix) 630 67 Setting 4 (decaying β) 522 44 (a) Different settings of β (b) Different choices of λ Zhao, UW WPO/SPO March 2024 27 / 30

Slide 28

Slide 28 text

Numerical Studies: Episode Rewards during Training Tested on RL tasks across various domains: Tabular domain (discrete state and action), e.g., Taxi Locomotion tasks (continuous state, discrete action), e.g., CartPole MuJuCo tasks (continuous state and action), e.g., Hopper Both WPO and SPO outperform state-of-art. WPO has better final performance. SPO has faster convergence. Zhao, UW WPO/SPO March 2024 28 / 30

Slide 29

Slide 29 text

Numerical Studies: Additional Comparison with KL Trust Regions NA: Number of samples to estimate the advantage function (a) NA = 100 (b) NA = 250 (c) NA = 1000 Wasserstein metric is more robust to the inaccurate advantage estimations caused by the lack of samples Zhao, UW WPO/SPO March 2024 29 / 30

Slide 30

Slide 30 text

Conclusion The WPO/SPO framework can: improve final performance by allowing all admissible policies maintain stability by confining policy updates within the trust region improve sample efficiency by reducing the rejection rate of sampling through the relaxation of the parametric restriction increase the robustness against inaccuracies in advantage functions Zhao, UW WPO/SPO March 2024 30 / 30