Chaoyue Zhao (University of Washington, Seattle, USA) Provably Convergent Policy Optimization via Metric-aware Trust Region Methods
WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24
(MDP) ▶ Define how environment reacts to actions ▶ Denoted by tuple: (S, A, P, r, γ): S - state space, A - action space, r - reward function: S × A − → R, P - transition probabilities S × A × S − → R, γ - discount factor Policy specifies the strategy for RL agent ▶ Provide guideline on the action to take in a certain state ▶ Different representations: Stochastic policy π(a|s), deterministic policy π(s) Zhao, UW WPO/SPO March 2024 3 / 30
discounted return: ▶ J(π) = Eπ [ ∞ t=0 γtr(st, at )] Functions to reflect the expected future rewards ▶ Q function ⋆ Qπ(s, a) = Eπ [ ∞ k=0 γk r(st+k+1 , at+k+1 )|st = s, at = a] ▶ V function ⋆ V π(s) = Ea∼π(·|s) Qπ(s, a) ▶ Advantage function ⋆ Aπ(s, a) = Qπ(s, a) − V π(s) Zhao, UW WPO/SPO March 2024 4 / 30
of the environment, e.g., AlphaZero [Silver et al., 2017] ▶ Learn the model of the environment, e.g., Dyna [Sutton, 1991] ▶ Example: Q value iteration: Qk+1 (s, a) = s′ p(s′|s, a)(r(s, a) + γ max a′ Qk (s′, a′)) Model-free ▶ Solve for the strategy without using or learning the environment model ▶ Example: Q learning: Qk+1 (s, a) = Qk (s, a) + α[r(s, a) + γ max a′ Qk (s′, a′) − Qk (s, a)] Zhao, UW WPO/SPO March 2024 5 / 30
stored and learnt ▶ Policy is implicit - derived from the value function ▶ Examples: Q-learning [Watkins, 1989], DQN [Mnih et al., 2013], etc Policy-based ▶ Store and learn the policy directly ▶ Examples: REINFORCE [Williams, 1992], actor-critic [Konda and Tsitsiklis, 2000], deterministic policy gradients [Silver et al., 2014], deep deterministic policy gradients [Lillicrap et al., 2015], etc Zhao, UW WPO/SPO March 2024 6 / 30
▶ Model-free ▶ Policy-based ▶ Policy is modeled as πθ (a|s) ⋆ A probability distribution controlled by parameter θ ▶ Gradient ascent is applied to move θ towards highest return ⋆ i.e., θ ← − θ + α∇θ J(πθ ) Zhao, UW WPO/SPO March 2024 7 / 30
step size for policy update ▶ Step size too small: slow convergence ▶ Step size too large: catastrophically bad policy updates Solution: restrict the policy deviations - add D(πθt+1 |πθt ) ≤ δ as a “trust region” constraint ▶ D(.|.) can be some divergence or metric to measure the difference between two policies ▶ Example - KL divergence DKL (P|Q) := Ω log( dP dQ )dP Zhao, UW WPO/SPO March 2024 8 / 30
update Related work: ▶ Kullback-Leibler (KL) divergence based trust region constraint: Trust region policy optimization [Schulman et al., 2015], maximum aposteriori policy optimization [Abdolmaleki et al., 2018], advantage-weighted regression [Peng et al., 2019] ▶ Penalize the size of policy update: Natural policy gradient [Kakade, 2001], Proximal policy optimization [Schulman et al., 2017] Zhao, UW WPO/SPO March 2024 9 / 30
need to consume an enormous amount of sample points to keep on learning a finer policy Two main causes: ▶ Approximations made when solving the policy optimization ⋆ Policy updates are rejected by the line search process ▶ Inaccurate evaluations of the value functions ⋆ Slow down the learning speed Zhao, UW WPO/SPO March 2024 10 / 30
to suboptimality ▶ Policy is assumed to follow a particular parametric distribution e.g. Gaussian [Schulman et al., 2015] ▶ Hard to predetermine the distribution class of optimal policy ▶ Parametric distributions are not convex in the distribution space Figure: Objective in distribution space Π/ parameter space Θ [Tessler et al., 2019] Zhao, UW WPO/SPO March 2024 11 / 30
region based PG methods: ▶ Stability guaranteed ▶ No restrictive parametric policy assumption ▶ More robust to value function inaccuracies A combination of: ▶ Optimistic Distributionally Robust Optimization ▶ Trust region constructed by suitable metric Zhao, UW WPO/SPO March 2024 12 / 30
▶ Distribution P of the random parameter ξ is not precisely known but assumed to belong to an ambiguity set (trust region) D ▶ The optimization problem is set to identify the optimistic policy within the trust region All admissible policy distributions are considered ▶ Reduce the rejection rate of policy updates ⋆ Improve sample efficiency ▶ Open up the possibility of converging to a better final policy Zhao, UW WPO/SPO March 2024 13 / 30
where D = {π′|Es∼ρπ υ [d(π′(·|s), π(·|s))] ≤ δ}. (1) Mathematical notations ▶ Aπ(s, a) - advantage function of policy π associated with state s and action a ▶ D - trust region/ambiguity set ▶ ρπ υ - unnormalized discounted visitation frequencies with initial state distribution υ: ρπ υ (s) = Es0∼υ [ ∞ t=0 γtP(st = s|s0 )] Zhao, UW WPO/SPO March 2024 14 / 30
regions are pervasively used to stabilize policy optimization in model-free RL. ▶ Other distances such as Wasserstein and Sinkhorn are underexplored Wasserstein has several advantages over KL: ▶ Consider the geometry of the metric space [Panaretos et al., 2019] ▶ Allow distributions to have different or even non-overlapping supports Zhao, UW WPO/SPO March 2024 16 / 30
= 1, d(left, pickup) = d(right, pickup) = 4. Figure: Motivating grid world example (a) Policy shift of close action (b) Policy shift of far action Figure: Wasserstein utilizes geometric feature of action space Zhao, UW WPO/SPO March 2024 17 / 30
in the distribution space Proposed WPO/SPO problem (optimistic DRO problems) [1]: max π′∈D Es∼ρπ υ ,a∼π′(·|s) [Aπ(s, a)] where D = {π′|Es∼ρπ υ [dW(π′(·|s), π(·|s))] ≤ δ}. (4) [1] Jun Song, Niao He, Lijun Ding, and Chaoyue Zhao. “Provably convergent policy optimization via metric-aware trust region methods”. Transactions on Machine Learning Research (TMLR). Zhao, UW WPO/SPO March 2024 20 / 30
of Wasserstein Policy Optimization is: π∗(ai |s) = N j=1 π(aj |s)f ∗ s (i, j), (5) where f ∗ s (i, j) = 1 if i = kπ s (β∗, j) and f ∗ s (i, j) = 0 otherwise. Mathematical notations: ▶ β∗: an optimal solution to the dual formulation of (4) ▶ kπ s (β∗, j) - an arbitrary optimizer: kπ s (β∗, j) ∈ argmaxk=1...N Aπ(s, ak ) − β∗Mkj Zhao, UW WPO/SPO March 2024 21 / 30
parameter ▶ Avoid solving the dual problem each iteration ▶ Equivalent to the penalty version WPO/SPO problem: max πk+1 Es∼ρπk υ ,a∼πk+1(·|s) [Aπk (s, a)] − βk Es∼ρπk υ [dW (πk+1 (·|s), πk (·|s))]. ▶ Set up as decreasing sequence − → less penalty as policy is learnt finer Support sampling in continuous state/action space: ▶ Sample actions to approximate the target policy distribution πk+1 ▶ Sample states to perform the policy updates Zhao, UW WPO/SPO March 2024 23 / 30
K, learning rate α Initialize policy π0 and value network Vψ0 with random parameter ψ0 for k = 0, 1, 2 . . . K do Collect a set of trajectories Dk on policy πk For each timestep t in each trajectory, compute total returns Gt and estimate advantages ˆ Aπk t Update value: ψk+1 ← − ψk − α∇ψk (Gt − Vψk (st))2 Update policy with (5) or (6): πk+1 ← − F(πk) Zhao, UW WPO/SPO March 2024 25 / 30
Complete: use accumulated discount reward as return ⋆ Rt = T−t−1 k=0 γk rt+k ▶ Partial: approximate return using multi-step temporal difference (TD) ⋆ ˆ Rt:t+n = n−1 k=0 γk rt+k + γnV (st+n ) Support various advantage estimation methods ▶ Monte Carlo advantage estimation ˆ Aπk t = Rt − Vψk (st ) ▶ Generalized Advantage Estimation (GAE) [Schulman et al., 2015] ⋆ Provide explicit control over variance-bias trade-off Value function is represented as neural network ▶ s → V (s) ▶ Network parameter ψ - updated by gradient descent ▶ Reduce the computational burden of computing advantage directly Zhao, UW WPO/SPO March 2024 26 / 30
across various domains: Tabular domain (discrete state and action), e.g., Taxi Locomotion tasks (continuous state, discrete action), e.g., CartPole MuJuCo tasks (continuous state and action), e.g., Hopper Both WPO and SPO outperform state-of-art. WPO has better final performance. SPO has faster convergence. Zhao, UW WPO/SPO March 2024 28 / 30
of samples to estimate the advantage function (a) NA = 100 (b) NA = 250 (c) NA = 1000 Wasserstein metric is more robust to the inaccurate advantage estimations caused by the lack of samples Zhao, UW WPO/SPO March 2024 29 / 30
all admissible policies maintain stability by confining policy updates within the trust region improve sample efficiency by reducing the rejection rate of sampling through the relaxation of the parametric restriction increase the robustness against inaccuracies in advantage functions Zhao, UW WPO/SPO March 2024 30 / 30