Transport Optimization Methods in Bayesian Sampling Problems

Transport Optimization Methods in Bayesian Sampling Problems Wuchen Li University
of South Carolina SIAM MDS 22, San Diego, Sep 30 Joint work with Peng Chen (Georgia tech) and Yifei Wang (Stanford). Partially supported by AFOSR MURI FA9550-18-1-0502 and NSF RTG: 2038080.

Sampling applications 2

Bayesian sampling problems Main problem Given a function f(x), x
2 ⌦, can we generate samples from ⇡(x) / e f(x) , where ⇡ is a probability density function. Di culties I Fast, e cient, accurate algorithms. I High dimensional sampling: ⌦ ⇢ RN , N >> 3. 3

A sampling algorithm Consider Langevin dynamics: dXt = rf(Xt)dt +
p 2dBt, where Xt 2 Rd and Bt is the standard Brownian motion in Rd. Denote Xt ⇠ ⇢t . The evolution of ⇢t satisﬁes @t⇢t = r · (⇢t rf) + ⇢t. Its invariant distribution satisﬁes ⇢1(x) = ⇡ = 1 K e f(x) , K = Z e f(y) dy. The EulerMaruyama method of Langevin dynamics relates to the Langevin Monte Carlo methods (LMC). 5 Euler-Maruyama

Optimization in probability space Nowadays, the sampling algorithm relates to
the variational problem in probability density space P(⌦) = n ⇢ 2 F(⌦): Z ⌦ ⇢dx = 1, ⇢ 0. o . Consider the optimization problem: min ⇢2P(⌦) E(⇢), where E(⇢) measures the closeness between ⇢ and target distribution ⇡. 6

Gradient descent=LMC methods Consider E(⇢) = Z ⇢(x) log ⇢(x)
⇡(x)dx. Observe that @t⇢t =r · (⇢t r ⇢ E(⇢t)) =r · (⇢t r(log ⇢t log ⇡ + 1)) = ⇢t r · (⇢t r log ⇡) = ⇢t + r · (⇢t rf), where we apply the fact that ⇢r log ⇢ = r⇢. In short, LMC method in probability density space can be viewed as a gradient descent of KL divergence w.r.t. optimal transport metric. 7

First-order optimization algorithms Sampling methods turns out to be first-order
gradient descent methods in probability space. It has many generalizations: I Fisher-Rao metric and natural gradient methods (Amari, Ny, Zhang, et.al.); I Wasserstein metric and Wasserstein gradient descent (Jordan, Kinderlehrer, Otto, Villani, Carillo, Slepcev, Liu, Lu, Gao, Ambrosio, Gigli, Savare, Gangbo, et.al.); I Stein metric and Stein variational gradient descent (Liu, Wang); I Wasserstein-Kalman metric and Ensemble Kalman sampling (Garbuno-Inigo, Ho↵man, Li, Stuart); I Wasserstein Newton’s metric and Wasserstein Newton’s flow (Wang, Li); I Projected Wasserstein metric and Project Wasserstein gradient flow (Wang, Chen, Li); I Entropy-Entropy flux-metric and Flux-gradient flows (Li, Liu, Osher, Shu). 8 （Wang, Chen, Pilanci, Li);

Accelerated sampling dynamics Goal Design and compute sampling dynamics in
information metric spaces? Questions I What is the “mean-ﬁeld” accelerated Langevin dynamics? I What is the “mean-ﬁeld” Newton Langevin dynamics? 9

Review: Gradient ﬂows I The methods for optimization problems in
the Euclidean space have been well-studied. min x2Rn f(x). I The classical gradient descent method follows xk+1 = xk srf(xk). It can be formulated into a continuous time gradient ﬂow, ˙ x = rf(x). 10

Review: Accelerate Gradient ﬂows I The Nesterov’s accelerated method is
widely applied in accelerating the vanilla gradient descent: ¨ x + ↵t ˙ x + rf(x) = 0. I This is equivalent to a damped Hamiltonian system  ˙ x ˙ p +  0 ↵tp  0 I I 0  rxHE(x, p) rpHE(x, p) = 0, with the Hamiltonian H E(x, p) = 1 2 kpk2 2 + f(x), and initial values x(0) = x0 and p(0) = 0. 11

Gradient flows I The gradient flow for E(⇢) in (P(⌦),
g⇢) takes the form @t⇢t = G(⇢t) 1 E(⇢t) ⇢t . I The Fisher-Rao gradient flow is given by @t⇢t = ⇢t ✓ E ⇢t Z E ⇢t ⇢tdx ◆ . I The Wasserstein gradient flow writes @t⇢t = ( r · (⇢t r))( E ⇢t ) =r · (⇢t r E ⇢t ). 12

Hamiltonian ﬂows in probability space The Euler-Lagrange equation in the
probability space can be formulated as a system of (⇢t, t), i.e., 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0. They are examples of dynamics in Mean ﬁeld game, optimal transport and information geometry. 13

Example: Fisher-Rao Hamiltonian ﬂow I The Fisher-Rao Hamiltonian ﬂow follows
8 > > < > > : @t⇢t t⇢t + Z t⇢tdx = 0, @t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. I The corresponding Hamiltonian is HF (⇢t, t) = 1 2 Z 2 t ⇢tdx ✓Z ⇢t tdx ◆2 ! + E(⇢t). 14

Example: Wasserstein Hamiltonian ﬂow I The Wasserstein Hamiltonian ﬂow writes
8 < : @t⇢t + r · (⇢t r t) = 0, @t t + 1 2 kr t k2 + E ⇢t = 0. I The corresponding Hamiltonian forms HW (⇢t, t) = 1 2 Z kr t k2 ⇢tdx + E(⇢t). 15

Damped Hamiltonian ﬂows in probability space Consider 8 > <
> : @t⇢t G(⇢t) 1 t = 0, @t t + ↵t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0, (AIG) with initial values ⇢0 = ⇢0 and 0 = 0. 16 AIG: Accelerate Information Gradient Flows

Fisher-Rao AIG ﬂow The Fisher-Rao AIG ﬂow writes 8 >
> < > > : @t⇢t t⇢t + ✓Z t⇢tdx ◆ ⇢t = 0, @t t + ↵t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. (F-AIG) 17

Wasserstein AIG flow The Wasserstein AIG flow (Carillo, 2020, Amir,
et.al. 2019) satisfies 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + ↵t t + 1 2 kr t k2 + E ⇢t = 0. (W-AIG) 18

Covariance AIG We consider AIG dynamics in Gaussian distributions. Theorem
Suppose that ⇢0 , ⇢ ⇤ 2 Nn 0 and their covariance matrices are ⌃0 and ⌃⇤ . E(⌃) evaluates the KL divergence from ⇢ to ⇢ ⇤ . Here (⌃t, St) satisfies ( ˙ ⌃t (St⌃t + ⌃tSt) = 0, ˙ St + ↵tSt + S 2 t + 2r⌃t E(⌃t) = 0, (W-AIG-G) with initial values ⌃0 = ⌃0 and S0 = 0. Especially, for any t 0, ⌃t is well-defined and stays positive definite. 19 Flows

Convergence rate of W-AIG flows Theorem Suppose that E(⇢) satisfies
Hess( ) for > 0. The solution ⇢t to (W-AIG) with ↵t = 2 p satisfies E(⇢t)  C0e p t = O(e p t). If E(⇢) only satisfies Hess(0), then the solution ⇢t to (W-AIG) with ↵t = 3/t satisfies E(⇢t)  C 0 0t 2 = O(t 2). Here the constants C0, C 0 0 only depend on ⇢0. 20

Time scaling and Convergence results Convergence rate I A su
cient condition for Hess( ) is that f(x) is -strongly convex. I The W-AIG flow improves the O(e 2 t) convergence rate of the gradient flow to O(e p t), especially when is close to 0. Convexities estimations I from Wasserstein Hessian operator (Otto, Villani et.al.)=Weak form of Bakry-Emery Gamma calculus. I Modified Fisher-Rao in optimal transport space =Weak form of Gamma calculus (Li, 2018). I Modified Fisher-Rao in generalized optimal transport space=Information Gamma calculus (Feng, Li, 2019-2022). 21 （Lott, Villani, Strum, et.al.） Generalized tenors from (Arnold, Carlen, Ju, Erb, Baudoin, Garofalo).

W-AIG ﬂows in the particle level Proposition Suppose that Xt
⇠ ⇢t and Vt = r t(Xt) are the position and the velocity of a particle at time t. Then, the di↵erential equation of the particle system corresponding to (W-AIG) writes ( dXt = Vtdt, dVt = ↵tVtdt rf(Xt)dt r log ⇢t(Xt)dt. 22

Discrete-time algorithm for W-AIG ﬂow For i = 1, 2
. . . N, the update rule follows ( V i k+1 = ↵kV i k p ⌧(rf(X i k ) + ⇠k(X i k )), X i k+1 = X i k + p ⌧V i k+1. If E(⇢) is Hess( ), then ↵k = 1 p ⌧ 1+ p ⌧ ; if E(⇢) is Hess(0) or is unknown, then ↵k = k 1 k+2 . Here ⇠k(x) is an approximation of r log ⇢t(x). Di culties I The approximation of r log ⇢t(x). I The sti↵ness of AIG ﬂows. 23

Learning density from Brownian motion I Given the bandwidth h,
{Xi k }N i=1 and a step size s, we can compute two particle systems Y i k (h) = X i k s⇠k(x; h), Z i k = X i k + p 2sB i , i = 1, . . . N, where Bi is the standard Brownian motion. I Minimize MMD(ˆ ⇢Y , ˆ ⇢Z), the maximum mean discrepancy between the empirical distribution ˆ ⇢Y (x) = PN i=1 (x Y i k (h)) and ˆ ⇢Z(x) = PN i=1 (x Zi k ). I Optimize over h, using the bandwidth hk 1 from the last iteration as the initialization. 24

Adaptive restart in density space I The second equation in
(W-AIG) is the Hamilton-Jacobi equation, which usually has the strong sti↵ness. We propose an adaptive restart technique to deal with this problem. Consider 'k = N X i=1 ⌦ V i k+1, rf(X i k ) + ⇠k(X i k ) ↵ , which can be viewed as discrete-time approximation of @tE(⇢t). I If 'k < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. I This essentially keeps @tE(⇢t) negative along the trajectory. I The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 25

AIG Algorithm Algorithm 1 Particle implementation of Wasserstein AIG ﬂow
Require: initial positions {Xi 0 }N i=1 , step size ⌧, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl 1 , {Xi k }N i=1 , p ⌧). 4: Calculate ⇠k (Xi k ) with the bandwidth hl . 5: Set ↵k based on whether E(⇢) is Hess( ) or Hess(0). Update V i k+1 = ↵k V i k p ⌧(rf(Xi k ) + ⇠k (Xi k )), Xi k+1 = Xi k + p ⌧V i k+1 . 6: if RESTART then 7: Compute 'k = P N i=1 ⌦ V i k+1 , rf(Xi k ) + ⇠k (Xi k ) ↵ . 8: If 'k < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 26

Gaussian examples: ODE level I The initial value is set
to be ⌃0 = I. I The following ﬁgure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG ﬂow and the e↵ect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1,  ⇡ 3.8 ⇥ 103. Right: = 1,  ⇡ 4.0 ⇥ 103. 27

Gaussian examples: Particle level I The initial distribution of samples
follows N(0, I) and the number of samples is N = 600. I For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ ⌃k) using the empirical covariance matrix ˆ ⌃k . Figure: The acceleration of W-AIG ﬂow and the e↵ect of adaptive restart (particle level). The setting of the target density is identical to the previous ﬁgure. 28

Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian
logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29

Projected Wasserstein gradient algorithms 30 Mean fi eld games/Optimal transport
+Optimization+UQ methods

PWGD for COVID-19 dynamics 29

Neural network PWGD for PDE Bayesian inference 32

C Complex Systems Bayesian Sampling Future works Time Discretization Spatial
Discretization Entropy Entropy Flux Metric Optimization

Transport Optimization Methods in Bayesian Samp...

Transport Optimization Methods in Bayesian Sampling Problems

Wuchen Li

More Decks by Wuchen Li

Other Decks in Research

Featured

Transcript