Slide 1

Slide 1 text

Transport Optimization Methods in Bayesian Sampling Problems Wuchen Li University of South Carolina SIAM MDS 22, San Diego, Sep 30 Joint work with Peng Chen (Georgia tech) and Yifei Wang (Stanford). Partially supported by AFOSR MURI FA9550-18-1-0502 and NSF RTG: 2038080.

Slide 2

Slide 2 text

Sampling applications 2

Slide 3

Slide 3 text

Bayesian sampling problems Main problem Given a function f(x), x 2 ⌦, can we generate samples from ⇡(x) / e f(x) , where ⇡ is a probability density function. Di culties I Fast, e cient, accurate algorithms. I High dimensional sampling: ⌦ ⇢ RN , N >> 3. 3

Slide 4

Slide 4 text

4

Slide 5

Slide 5 text

A sampling algorithm Consider Langevin dynamics: dXt = rf(Xt)dt + p 2dBt, where Xt 2 Rd and Bt is the standard Brownian motion in Rd. Denote Xt ⇠ ⇢t . The evolution of ⇢t satisfies @t⇢t = r · (⇢t rf) + ⇢t. Its invariant distribution satisfies ⇢1(x) = ⇡ = 1 K e f(x) , K = Z e f(y) dy. The EulerMaruyama method of Langevin dynamics relates to the Langevin Monte Carlo methods (LMC). 5 Euler-Maruyama

Slide 6

Slide 6 text

Optimization in probability space Nowadays, the sampling algorithm relates to the variational problem in probability density space P(⌦) = n ⇢ 2 F(⌦): Z ⌦ ⇢dx = 1, ⇢ 0. o . Consider the optimization problem: min ⇢2P(⌦) E(⇢), where E(⇢) measures the closeness between ⇢ and target distribution ⇡. 6

Slide 7

Slide 7 text

Gradient descent=LMC methods Consider E(⇢) = Z ⇢(x) log ⇢(x) ⇡(x)dx. Observe that @t⇢t =r · (⇢t r ⇢ E(⇢t)) =r · (⇢t r(log ⇢t log ⇡ + 1)) = ⇢t r · (⇢t r log ⇡) = ⇢t + r · (⇢t rf), where we apply the fact that ⇢r log ⇢ = r⇢. In short, LMC method in probability density space can be viewed as a gradient descent of KL divergence w.r.t. optimal transport metric. 7

Slide 8

Slide 8 text

First-order optimization algorithms Sampling methods turns out to be first-order gradient descent methods in probability space. It has many generalizations: I Fisher-Rao metric and natural gradient methods (Amari, Ny, Zhang, et.al.); I Wasserstein metric and Wasserstein gradient descent (Jordan, Kinderlehrer, Otto, Villani, Carillo, Slepcev, Liu, Lu, Gao, Ambrosio, Gigli, Savare, Gangbo, et.al.); I Stein metric and Stein variational gradient descent (Liu, Wang); I Wasserstein-Kalman metric and Ensemble Kalman sampling (Garbuno-Inigo, Ho↵man, Li, Stuart); I Wasserstein Newton’s metric and Wasserstein Newton’s flow (Wang, Li); I Projected Wasserstein metric and Project Wasserstein gradient flow (Wang, Chen, Li); I Entropy-Entropy flux-metric and Flux-gradient flows (Li, Liu, Osher, Shu). 8 (Wang, Chen, Pilanci, Li);

Slide 9

Slide 9 text

Accelerated sampling dynamics Goal Design and compute sampling dynamics in information metric spaces? Questions I What is the “mean-field” accelerated Langevin dynamics? I What is the “mean-field” Newton Langevin dynamics? 9

Slide 10

Slide 10 text

Review: Gradient flows I The methods for optimization problems in the Euclidean space have been well-studied. min x2Rn f(x). I The classical gradient descent method follows xk+1 = xk srf(xk). It can be formulated into a continuous time gradient flow, ˙ x = rf(x). 10

Slide 11

Slide 11 text

Review: Accelerate Gradient flows I The Nesterov’s accelerated method is widely applied in accelerating the vanilla gradient descent: ¨ x + ↵t ˙ x + rf(x) = 0. I This is equivalent to a damped Hamiltonian system  ˙ x ˙ p +  0 ↵tp  0 I I 0  rxHE(x, p) rpHE(x, p) = 0, with the Hamiltonian H E(x, p) = 1 2 kpk2 2 + f(x), and initial values x(0) = x0 and p(0) = 0. 11

Slide 12

Slide 12 text

Gradient flows I The gradient flow for E(⇢) in (P(⌦), g⇢) takes the form @t⇢t = G(⇢t) 1 E(⇢t) ⇢t . I The Fisher-Rao gradient flow is given by @t⇢t = ⇢t ✓ E ⇢t Z E ⇢t ⇢tdx ◆ . I The Wasserstein gradient flow writes @t⇢t = ( r · (⇢t r))( E ⇢t ) =r · (⇢t r E ⇢t ). 12

Slide 13

Slide 13 text

Hamiltonian flows in probability space The Euler-Lagrange equation in the probability space can be formulated as a system of (⇢t, t), i.e., 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0. They are examples of dynamics in Mean field game, optimal transport and information geometry. 13

Slide 14

Slide 14 text

Example: Fisher-Rao Hamiltonian flow I The Fisher-Rao Hamiltonian flow follows 8 > > < > > : @t⇢t t⇢t + Z t⇢tdx = 0, @t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. I The corresponding Hamiltonian is HF (⇢t, t) = 1 2 Z 2 t ⇢tdx ✓Z ⇢t tdx ◆2 ! + E(⇢t). 14

Slide 15

Slide 15 text

Example: Wasserstein Hamiltonian flow I The Wasserstein Hamiltonian flow writes 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + 1 2 kr t k2 + E ⇢t = 0. I The corresponding Hamiltonian forms HW (⇢t, t) = 1 2 Z kr t k2 ⇢tdx + E(⇢t). 15

Slide 16

Slide 16 text

Damped Hamiltonian flows in probability space Consider 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + ↵t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0, (AIG) with initial values ⇢0 = ⇢0 and 0 = 0. 16 AIG: Accelerate Information Gradient Flows

Slide 17

Slide 17 text

Fisher-Rao AIG flow The Fisher-Rao AIG flow writes 8 > > < > > : @t⇢t t⇢t + ✓Z t⇢tdx ◆ ⇢t = 0, @t t + ↵t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. (F-AIG) 17

Slide 18

Slide 18 text

Wasserstein AIG flow The Wasserstein AIG flow (Carillo, 2020, Amir, et.al. 2019) satisfies 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + ↵t t + 1 2 kr t k2 + E ⇢t = 0. (W-AIG) 18

Slide 19

Slide 19 text

Covariance AIG We consider AIG dynamics in Gaussian distributions. Theorem Suppose that ⇢0 , ⇢ ⇤ 2 Nn 0 and their covariance matrices are ⌃0 and ⌃⇤ . E(⌃) evaluates the KL divergence from ⇢ to ⇢ ⇤ . Here (⌃t, St) satisfies ( ˙ ⌃t (St⌃t + ⌃tSt) = 0, ˙ St + ↵tSt + S 2 t + 2r⌃t E(⌃t) = 0, (W-AIG-G) with initial values ⌃0 = ⌃0 and S0 = 0. Especially, for any t 0, ⌃t is well-defined and stays positive definite. 19 Flows

Slide 20

Slide 20 text

Convergence rate of W-AIG flows Theorem Suppose that E(⇢) satisfies Hess( ) for > 0. The solution ⇢t to (W-AIG) with ↵t = 2 p satisfies E(⇢t)  C0e p t = O(e p t). If E(⇢) only satisfies Hess(0), then the solution ⇢t to (W-AIG) with ↵t = 3/t satisfies E(⇢t)  C 0 0t 2 = O(t 2). Here the constants C0, C 0 0 only depend on ⇢0. 20

Slide 21

Slide 21 text

Time scaling and Convergence results Convergence rate I A su cient condition for Hess( ) is that f(x) is -strongly convex. I The W-AIG flow improves the O(e 2 t) convergence rate of the gradient flow to O(e p t), especially when is close to 0. Convexities estimations I from Wasserstein Hessian operator (Otto, Villani et.al.)=Weak form of Bakry-Emery Gamma calculus. I Modified Fisher-Rao in optimal transport space =Weak form of Gamma calculus (Li, 2018). I Modified Fisher-Rao in generalized optimal transport space=Information Gamma calculus (Feng, Li, 2019-2022). 21 (Lott, Villani, Strum, et.al.) Generalized tenors from (Arnold, Carlen, Ju, Erb, Baudoin, Garofalo).

Slide 22

Slide 22 text

W-AIG flows in the particle level Proposition Suppose that Xt ⇠ ⇢t and Vt = r t(Xt) are the position and the velocity of a particle at time t. Then, the di↵erential equation of the particle system corresponding to (W-AIG) writes ( dXt = Vtdt, dVt = ↵tVtdt rf(Xt)dt r log ⇢t(Xt)dt. 22

Slide 23

Slide 23 text

Discrete-time algorithm for W-AIG flow For i = 1, 2 . . . N, the update rule follows ( V i k+1 = ↵kV i k p ⌧(rf(X i k ) + ⇠k(X i k )), X i k+1 = X i k + p ⌧V i k+1. If E(⇢) is Hess( ), then ↵k = 1 p ⌧ 1+ p ⌧ ; if E(⇢) is Hess(0) or is unknown, then ↵k = k 1 k+2 . Here ⇠k(x) is an approximation of r log ⇢t(x). Di culties I The approximation of r log ⇢t(x). I The sti↵ness of AIG flows. 23

Slide 24

Slide 24 text

Learning density from Brownian motion I Given the bandwidth h, {Xi k }N i=1 and a step size s, we can compute two particle systems Y i k (h) = X i k s⇠k(x; h), Z i k = X i k + p 2sB i , i = 1, . . . N, where Bi is the standard Brownian motion. I Minimize MMD(ˆ ⇢Y , ˆ ⇢Z), the maximum mean discrepancy between the empirical distribution ˆ ⇢Y (x) = PN i=1 (x Y i k (h)) and ˆ ⇢Z(x) = PN i=1 (x Zi k ). I Optimize over h, using the bandwidth hk 1 from the last iteration as the initialization. 24

Slide 25

Slide 25 text

Adaptive restart in density space I The second equation in (W-AIG) is the Hamilton-Jacobi equation, which usually has the strong sti↵ness. We propose an adaptive restart technique to deal with this problem. Consider 'k = N X i=1 ⌦ V i k+1, rf(X i k ) + ⇠k(X i k ) ↵ , which can be viewed as discrete-time approximation of @tE(⇢t). I If 'k < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. I This essentially keeps @tE(⇢t) negative along the trajectory. I The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 25

Slide 26

Slide 26 text

AIG Algorithm Algorithm 1 Particle implementation of Wasserstein AIG flow Require: initial positions {Xi 0 }N i=1 , step size ⌧, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl 1 , {Xi k }N i=1 , p ⌧). 4: Calculate ⇠k (Xi k ) with the bandwidth hl . 5: Set ↵k based on whether E(⇢) is Hess( ) or Hess(0). Update V i k+1 = ↵k V i k p ⌧(rf(Xi k ) + ⇠k (Xi k )), Xi k+1 = Xi k + p ⌧V i k+1 . 6: if RESTART then 7: Compute 'k = P N i=1 ⌦ V i k+1 , rf(Xi k ) + ⇠k (Xi k ) ↵ . 8: If 'k < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 26

Slide 27

Slide 27 text

Gaussian examples: ODE level I The initial value is set to be ⌃0 = I. I The following figure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG flow and the e↵ect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1,  ⇡ 3.8 ⇥ 103. Right: = 1,  ⇡ 4.0 ⇥ 103. 27

Slide 28

Slide 28 text

Gaussian examples: Particle level I The initial distribution of samples follows N(0, I) and the number of samples is N = 600. I For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ ⌃k) using the empirical covariance matrix ˆ ⌃k . Figure: The acceleration of W-AIG flow and the e↵ect of adaptive restart (particle level). The setting of the target density is identical to the previous figure. 28

Slide 29

Slide 29 text

Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29

Slide 30

Slide 30 text

Projected Wasserstein gradient algorithms 30 Mean fi eld games/Optimal transport +Optimization+UQ methods

Slide 31

Slide 31 text

PWGD for COVID-19 dynamics 29

Slide 32

Slide 32 text

Neural network PWGD for PDE Bayesian inference 32

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

C Complex Systems Bayesian Sampling Future works Time Discretization Spatial Discretization Entropy Entropy Flux Metric Optimization