Wuchen Li
September 30, 2022
51

# Transport Optimization Methods in Bayesian Sampling Problems

We present a class of information metric optimization methods for Bayesian inverse problems. Two examples of information metrics are studied, including the Fisher-Rao metric and the Wasserstein-2 metric. Focusing on the Wasserstein-2 metric, we introduce accelerated gradient and Newton flows. We also formulate their practical discrete-time algorithms in particles. Facing a curse of dimensionality problem, we further introduce a projected Wasserstein gradient method to handle it. Numerical experiments, including PDE-constrained Bayesian inference and parameter estimation in COVID-19 modeling, demonstrate the effectiveness of the proposed method. This is based on joint works with Yifei Wang and Peng Chen.

## Wuchen Li

September 30, 2022

## Transcript

1. ### Transport Optimization Methods in Bayesian Sampling Problems Wuchen Li University

of South Carolina SIAM MDS 22, San Diego, Sep 30 Joint work with Peng Chen (Georgia tech) and Yifei Wang (Stanford). Partially supported by AFOSR MURI FA9550-18-1-0502 and NSF RTG: 2038080.

3. ### Bayesian sampling problems Main problem Given a function f(x), x

2 ⌦, can we generate samples from ⇡(x) / e f(x) , where ⇡ is a probability density function. Di culties I Fast, e cient, accurate algorithms. I High dimensional sampling: ⌦ ⇢ RN , N >> 3. 3

5. ### A sampling algorithm Consider Langevin dynamics: dXt = rf(Xt)dt +

p 2dBt, where Xt 2 Rd and Bt is the standard Brownian motion in Rd. Denote Xt ⇠ ⇢t . The evolution of ⇢t satisﬁes @t⇢t = r · (⇢t rf) + ⇢t. Its invariant distribution satisﬁes ⇢1(x) = ⇡ = 1 K e f(x) , K = Z e f(y) dy. The EulerMaruyama method of Langevin dynamics relates to the Langevin Monte Carlo methods (LMC). 5 Euler-Maruyama
6. ### Optimization in probability space Nowadays, the sampling algorithm relates to

the variational problem in probability density space P(⌦) = n ⇢ 2 F(⌦): Z ⌦ ⇢dx = 1, ⇢ 0. o . Consider the optimization problem: min ⇢2P(⌦) E(⇢), where E(⇢) measures the closeness between ⇢ and target distribution ⇡. 6
7. ### Gradient descent=LMC methods Consider E(⇢) = Z ⇢(x) log ⇢(x)

⇡(x)dx. Observe that @t⇢t =r · (⇢t r ⇢ E(⇢t)) =r · (⇢t r(log ⇢t log ⇡ + 1)) = ⇢t r · (⇢t r log ⇡) = ⇢t + r · (⇢t rf), where we apply the fact that ⇢r log ⇢ = r⇢. In short, LMC method in probability density space can be viewed as a gradient descent of KL divergence w.r.t. optimal transport metric. 7
8. ### First-order optimization algorithms Sampling methods turns out to be ﬁrst-order

gradient descent methods in probability space. It has many generalizations: I Fisher-Rao metric and natural gradient methods (Amari, Ny, Zhang, et.al.); I Wasserstein metric and Wasserstein gradient descent (Jordan, Kinderlehrer, Otto, Villani, Carillo, Slepcev, Liu, Lu, Gao, Ambrosio, Gigli, Savare, Gangbo, et.al.); I Stein metric and Stein variational gradient descent (Liu, Wang); I Wasserstein-Kalman metric and Ensemble Kalman sampling (Garbuno-Inigo, Ho↵man, Li, Stuart); I Wasserstein Newton’s metric and Wasserstein Newton’s ﬂow (Wang, Li); I Projected Wasserstein metric and Project Wasserstein gradient ﬂow (Wang, Chen, Li); I Entropy-Entropy ﬂux-metric and Flux-gradient ﬂows (Li, Liu, Osher, Shu). 8 （Wang, Chen, Pilanci, Li);
9. ### Accelerated sampling dynamics Goal Design and compute sampling dynamics in

information metric spaces? Questions I What is the “mean-ﬁeld” accelerated Langevin dynamics? I What is the “mean-ﬁeld” Newton Langevin dynamics? 9
10. ### Review: Gradient ﬂows I The methods for optimization problems in

the Euclidean space have been well-studied. min x2Rn f(x). I The classical gradient descent method follows xk+1 = xk srf(xk). It can be formulated into a continuous time gradient ﬂow, ˙ x = rf(x). 10
11. ### Review: Accelerate Gradient ﬂows I The Nesterov’s accelerated method is

widely applied in accelerating the vanilla gradient descent: ¨ x + ↵t ˙ x + rf(x) = 0. I This is equivalent to a damped Hamiltonian system  ˙ x ˙ p +  0 ↵tp  0 I I 0  rxHE(x, p) rpHE(x, p) = 0, with the Hamiltonian H E(x, p) = 1 2 kpk2 2 + f(x), and initial values x(0) = x0 and p(0) = 0. 11
12. ### Gradient ﬂows I The gradient ﬂow for E(⇢) in (P(⌦),

g⇢) takes the form @t⇢t = G(⇢t) 1 E(⇢t) ⇢t . I The Fisher-Rao gradient ﬂow is given by @t⇢t = ⇢t ✓ E ⇢t Z E ⇢t ⇢tdx ◆ . I The Wasserstein gradient ﬂow writes @t⇢t = ( r · (⇢t r))( E ⇢t ) =r · (⇢t r E ⇢t ). 12
13. ### Hamiltonian ﬂows in probability space The Euler-Lagrange equation in the

probability space can be formulated as a system of (⇢t, t), i.e., 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0. They are examples of dynamics in Mean ﬁeld game, optimal transport and information geometry. 13
14. ### Example: Fisher-Rao Hamiltonian ﬂow I The Fisher-Rao Hamiltonian ﬂow follows

8 > > < > > : @t⇢t t⇢t + Z t⇢tdx = 0, @t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. I The corresponding Hamiltonian is HF (⇢t, t) = 1 2 Z 2 t ⇢tdx ✓Z ⇢t tdx ◆2 ! + E(⇢t). 14
15. ### Example: Wasserstein Hamiltonian ﬂow I The Wasserstein Hamiltonian ﬂow writes

8 < : @t⇢t + r · (⇢t r t) = 0, @t t + 1 2 kr t k2 + E ⇢t = 0. I The corresponding Hamiltonian forms HW (⇢t, t) = 1 2 Z kr t k2 ⇢tdx + E(⇢t). 15
16. ### Damped Hamiltonian ﬂows in probability space Consider 8 > <

> : @t⇢t G(⇢t) 1 t = 0, @t t + ↵t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0, (AIG) with initial values ⇢0 = ⇢0 and 0 = 0. 16 AIG: Accelerate Information Gradient Flows
17. ### Fisher-Rao AIG ﬂow The Fisher-Rao AIG ﬂow writes 8 >

> < > > : @t⇢t t⇢t + ✓Z t⇢tdx ◆ ⇢t = 0, @t t + ↵t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. (F-AIG) 17
18. ### Wasserstein AIG ﬂow The Wasserstein AIG ﬂow (Carillo, 2020, Amir,

et.al. 2019) satisﬁes 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + ↵t t + 1 2 kr t k2 + E ⇢t = 0. (W-AIG) 18
19. ### Covariance AIG We consider AIG dynamics in Gaussian distributions. Theorem

Suppose that ⇢0 , ⇢ ⇤ 2 Nn 0 and their covariance matrices are ⌃0 and ⌃⇤ . E(⌃) evaluates the KL divergence from ⇢ to ⇢ ⇤ . Here (⌃t, St) satisﬁes ( ˙ ⌃t (St⌃t + ⌃tSt) = 0, ˙ St + ↵tSt + S 2 t + 2r⌃t E(⌃t) = 0, (W-AIG-G) with initial values ⌃0 = ⌃0 and S0 = 0. Especially, for any t 0, ⌃t is well-deﬁned and stays positive deﬁnite. 19 Flows
20. ### Convergence rate of W-AIG ﬂows Theorem Suppose that E(⇢) satisﬁes

Hess( ) for > 0. The solution ⇢t to (W-AIG) with ↵t = 2 p satisﬁes E(⇢t)  C0e p t = O(e p t). If E(⇢) only satisﬁes Hess(0), then the solution ⇢t to (W-AIG) with ↵t = 3/t satisﬁes E(⇢t)  C 0 0t 2 = O(t 2). Here the constants C0, C 0 0 only depend on ⇢0. 20
21. ### Time scaling and Convergence results Convergence rate I A su

cient condition for Hess( ) is that f(x) is -strongly convex. I The W-AIG ﬂow improves the O(e 2 t) convergence rate of the gradient ﬂow to O(e p t), especially when is close to 0. Convexities estimations I from Wasserstein Hessian operator (Otto, Villani et.al.)=Weak form of Bakry-Emery Gamma calculus. I Modiﬁed Fisher-Rao in optimal transport space =Weak form of Gamma calculus (Li, 2018). I Modiﬁed Fisher-Rao in generalized optimal transport space=Information Gamma calculus (Feng, Li, 2019-2022). 21 （Lott, Villani, Strum, et.al.） Generalized tenors from (Arnold, Carlen, Ju, Erb, Baudoin, Garofalo).
22. ### W-AIG ﬂows in the particle level Proposition Suppose that Xt

⇠ ⇢t and Vt = r t(Xt) are the position and the velocity of a particle at time t. Then, the di↵erential equation of the particle system corresponding to (W-AIG) writes ( dXt = Vtdt, dVt = ↵tVtdt rf(Xt)dt r log ⇢t(Xt)dt. 22
23. ### Discrete-time algorithm for W-AIG ﬂow For i = 1, 2

. . . N, the update rule follows ( V i k+1 = ↵kV i k p ⌧(rf(X i k ) + ⇠k(X i k )), X i k+1 = X i k + p ⌧V i k+1. If E(⇢) is Hess( ), then ↵k = 1 p ⌧ 1+ p ⌧ ; if E(⇢) is Hess(0) or is unknown, then ↵k = k 1 k+2 . Here ⇠k(x) is an approximation of r log ⇢t(x). Di culties I The approximation of r log ⇢t(x). I The sti↵ness of AIG ﬂows. 23
24. ### Learning density from Brownian motion I Given the bandwidth h,

{Xi k }N i=1 and a step size s, we can compute two particle systems Y i k (h) = X i k s⇠k(x; h), Z i k = X i k + p 2sB i , i = 1, . . . N, where Bi is the standard Brownian motion. I Minimize MMD(ˆ ⇢Y , ˆ ⇢Z), the maximum mean discrepancy between the empirical distribution ˆ ⇢Y (x) = PN i=1 (x Y i k (h)) and ˆ ⇢Z(x) = PN i=1 (x Zi k ). I Optimize over h, using the bandwidth hk 1 from the last iteration as the initialization. 24
25. ### Adaptive restart in density space I The second equation in

(W-AIG) is the Hamilton-Jacobi equation, which usually has the strong sti↵ness. We propose an adaptive restart technique to deal with this problem. Consider 'k = N X i=1 ⌦ V i k+1, rf(X i k ) + ⇠k(X i k ) ↵ , which can be viewed as discrete-time approximation of @tE(⇢t). I If 'k < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. I This essentially keeps @tE(⇢t) negative along the trajectory. I The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 25
26. ### AIG Algorithm Algorithm 1 Particle implementation of Wasserstein AIG ﬂow

Require: initial positions {Xi 0 }N i=1 , step size ⌧, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl 1 , {Xi k }N i=1 , p ⌧). 4: Calculate ⇠k (Xi k ) with the bandwidth hl . 5: Set ↵k based on whether E(⇢) is Hess( ) or Hess(0). Update V i k+1 = ↵k V i k p ⌧(rf(Xi k ) + ⇠k (Xi k )), Xi k+1 = Xi k + p ⌧V i k+1 . 6: if RESTART then 7: Compute 'k = P N i=1 ⌦ V i k+1 , rf(Xi k ) + ⇠k (Xi k ) ↵ . 8: If 'k < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 26
27. ### Gaussian examples: ODE level I The initial value is set

to be ⌃0 = I. I The following ﬁgure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG ﬂow and the e↵ect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1,  ⇡ 3.8 ⇥ 103. Right: = 1,  ⇡ 4.0 ⇥ 103. 27
28. ### Gaussian examples: Particle level I The initial distribution of samples

follows N(0, I) and the number of samples is N = 600. I For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ ⌃k) using the empirical covariance matrix ˆ ⌃k . Figure: The acceleration of W-AIG ﬂow and the e↵ect of adaptive restart (particle level). The setting of the target density is identical to the previous ﬁgure. 28
29. ### Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian

logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29
30. ### Projected Wasserstein gradient algorithms 30 Mean fi eld games/Optimal transport

+Optimization+UQ methods

33. None
34. ### C Complex Systems Bayesian Sampling Future works Time Discretization Spatial

Discretization Entropy Entropy Flux Metric Optimization