Transport Optimization Methods in Bayesian Sampling Problems

Slide 1

Slide 1 text

Transport Optimization Methods in Bayesian Sampling Problems Wuchen Li University of South Carolina SIAM MDS 22, San Diego, Sep 30 Joint work with Peng Chen (Georgia tech) and Yifei Wang (Stanford). Partially supported by AFOSR MURI FA9550-18-1-0502 and NSF RTG: 2038080.

Slide 2

Slide 2 text

Sampling applications 2

Slide 3

Slide 3 text

Bayesian sampling problems Main problem Given a function f(x), x 2 ⌦, can we generate samples from ⇡(x) / e f(x) , where ⇡ is a probability density function. Di culties I Fast, e cient, accurate algorithms. I High dimensional sampling: ⌦ ⇢ RN , N >> 3. 3

Slide 4

Slide 4 text

Slide 5

Slide 5 text

A sampling algorithm Consider Langevin dynamics: dXt = rf(Xt)dt + p 2dBt, where Xt 2 Rd and Bt is the standard Brownian motion in Rd. Denote Xt ⇠ ⇢t . The evolution of ⇢t satisﬁes @t⇢t = r · (⇢t rf) + ⇢t. Its invariant distribution satisﬁes ⇢1(x) = ⇡ = 1 K e f(x) , K = Z e f(y) dy. The EulerMaruyama method of Langevin dynamics relates to the Langevin Monte Carlo methods (LMC). 5 Euler-Maruyama

Slide 6

Slide 6 text

Optimization in probability space Nowadays, the sampling algorithm relates to the variational problem in probability density space P(⌦) = n ⇢ 2 F(⌦): Z ⌦ ⇢dx = 1, ⇢ 0. o . Consider the optimization problem: min ⇢2P(⌦) E(⇢), where E(⇢) measures the closeness between ⇢ and target distribution ⇡. 6

Slide 7

Slide 7 text

Gradient descent=LMC methods Consider E(⇢) = Z ⇢(x) log ⇢(x) ⇡(x)dx. Observe that @t⇢t =r · (⇢t r ⇢ E(⇢t)) =r · (⇢t r(log ⇢t log ⇡ + 1)) = ⇢t r · (⇢t r log ⇡) = ⇢t + r · (⇢t rf), where we apply the fact that ⇢r log ⇢ = r⇢. In short, LMC method in probability density space can be viewed as a gradient descent of KL divergence w.r.t. optimal transport metric. 7

Slide 8

Slide 8 text

First-order optimization algorithms Sampling methods turns out to be first-order gradient descent methods in probability space. It has many generalizations: I Fisher-Rao metric and natural gradient methods (Amari, Ny, Zhang, et.al.); I Wasserstein metric and Wasserstein gradient descent (Jordan, Kinderlehrer, Otto, Villani, Carillo, Slepcev, Liu, Lu, Gao, Ambrosio, Gigli, Savare, Gangbo, et.al.); I Stein metric and Stein variational gradient descent (Liu, Wang); I Wasserstein-Kalman metric and Ensemble Kalman sampling (Garbuno-Inigo, Ho↵man, Li, Stuart); I Wasserstein Newton’s metric and Wasserstein Newton’s flow (Wang, Li); I Projected Wasserstein metric and Project Wasserstein gradient flow (Wang, Chen, Li); I Entropy-Entropy flux-metric and Flux-gradient flows (Li, Liu, Osher, Shu). 8 （Wang, Chen, Pilanci, Li);

Slide 9

Slide 9 text

Accelerated sampling dynamics Goal Design and compute sampling dynamics in information metric spaces? Questions I What is the “mean-ﬁeld” accelerated Langevin dynamics? I What is the “mean-ﬁeld” Newton Langevin dynamics? 9

Slide 10

Slide 10 text

Review: Gradient ﬂows I The methods for optimization problems in the Euclidean space have been well-studied. min x2Rn f(x). I The classical gradient descent method follows xk+1 = xk srf(xk). It can be formulated into a continuous time gradient ﬂow, ˙ x = rf(x). 10

Slide 11

Slide 11 text

Review: Accelerate Gradient ﬂows I The Nesterov’s accelerated method is widely applied in accelerating the vanilla gradient descent: ¨ x + ↵t ˙ x + rf(x) = 0. I This is equivalent to a damped Hamiltonian system  ˙ x ˙ p +  0 ↵tp  0 I I 0  rxHE(x, p) rpHE(x, p) = 0, with the Hamiltonian H E(x, p) = 1 2 kpk2 2 + f(x), and initial values x(0) = x0 and p(0) = 0. 11

Slide 12

Slide 12 text

Gradient flows I The gradient flow for E(⇢) in (P(⌦), g⇢) takes the form @t⇢t = G(⇢t) 1 E(⇢t) ⇢t . I The Fisher-Rao gradient flow is given by @t⇢t = ⇢t ✓ E ⇢t Z E ⇢t ⇢tdx ◆ . I The Wasserstein gradient flow writes @t⇢t = ( r · (⇢t r))( E ⇢t ) =r · (⇢t r E ⇢t ). 12

Slide 13

Slide 13 text

Hamiltonian ﬂows in probability space The Euler-Lagrange equation in the probability space can be formulated as a system of (⇢t, t), i.e., 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0. They are examples of dynamics in Mean ﬁeld game, optimal transport and information geometry. 13

Slide 14

Slide 14 text

Example: Fisher-Rao Hamiltonian ﬂow I The Fisher-Rao Hamiltonian ﬂow follows 8 > > < > > : @t⇢t t⇢t + Z t⇢tdx = 0, @t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. I The corresponding Hamiltonian is HF (⇢t, t) = 1 2 Z 2 t ⇢tdx ✓Z ⇢t tdx ◆2 ! + E(⇢t). 14

Slide 15

Slide 15 text

Example: Wasserstein Hamiltonian ﬂow I The Wasserstein Hamiltonian ﬂow writes 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + 1 2 kr t k2 + E ⇢t = 0. I The corresponding Hamiltonian forms HW (⇢t, t) = 1 2 Z kr t k2 ⇢tdx + E(⇢t). 15

Slide 16

Slide 16 text

Damped Hamiltonian ﬂows in probability space Consider 8 > < > : @t⇢t G(⇢t) 1 t = 0, @t t + ↵t t + 1 2 ⇢t ✓Z tG(⇢t) 1 tdx ◆ + E ⇢t = 0, (AIG) with initial values ⇢0 = ⇢0 and 0 = 0. 16 AIG: Accelerate Information Gradient Flows

Slide 17

Slide 17 text

Fisher-Rao AIG ﬂow The Fisher-Rao AIG ﬂow writes 8 > > < > > : @t⇢t t⇢t + ✓Z t⇢tdx ◆ ⇢t = 0, @t t + ↵t t + 1 2 2 t ✓Z ⇢t tdx ◆ t + E ⇢t = 0. (F-AIG) 17

Slide 18

Slide 18 text

Wasserstein AIG flow The Wasserstein AIG flow (Carillo, 2020, Amir, et.al. 2019) satisfies 8 < : @t⇢t + r · (⇢t r t) = 0, @t t + ↵t t + 1 2 kr t k2 + E ⇢t = 0. (W-AIG) 18

Slide 19

Slide 19 text

Covariance AIG We consider AIG dynamics in Gaussian distributions. Theorem Suppose that ⇢0 , ⇢ ⇤ 2 Nn 0 and their covariance matrices are ⌃0 and ⌃⇤ . E(⌃) evaluates the KL divergence from ⇢ to ⇢ ⇤ . Here (⌃t, St) satisfies ( ˙ ⌃t (St⌃t + ⌃tSt) = 0, ˙ St + ↵tSt + S 2 t + 2r⌃t E(⌃t) = 0, (W-AIG-G) with initial values ⌃0 = ⌃0 and S0 = 0. Especially, for any t 0, ⌃t is well-defined and stays positive definite. 19 Flows

Slide 20

Slide 20 text

Convergence rate of W-AIG flows Theorem Suppose that E(⇢) satisfies Hess( ) for > 0. The solution ⇢t to (W-AIG) with ↵t = 2 p satisfies E(⇢t)  C0e p t = O(e p t). If E(⇢) only satisfies Hess(0), then the solution ⇢t to (W-AIG) with ↵t = 3/t satisfies E(⇢t)  C 0 0t 2 = O(t 2). Here the constants C0, C 0 0 only depend on ⇢0. 20

Slide 21

Slide 21 text

Time scaling and Convergence results Convergence rate I A su cient condition for Hess( ) is that f(x) is -strongly convex. I The W-AIG flow improves the O(e 2 t) convergence rate of the gradient flow to O(e p t), especially when is close to 0. Convexities estimations I from Wasserstein Hessian operator (Otto, Villani et.al.)=Weak form of Bakry-Emery Gamma calculus. I Modified Fisher-Rao in optimal transport space =Weak form of Gamma calculus (Li, 2018). I Modified Fisher-Rao in generalized optimal transport space=Information Gamma calculus (Feng, Li, 2019-2022). 21 （Lott, Villani, Strum, et.al.） Generalized tenors from (Arnold, Carlen, Ju, Erb, Baudoin, Garofalo).

Slide 22

Slide 22 text

W-AIG ﬂows in the particle level Proposition Suppose that Xt ⇠ ⇢t and Vt = r t(Xt) are the position and the velocity of a particle at time t. Then, the di↵erential equation of the particle system corresponding to (W-AIG) writes ( dXt = Vtdt, dVt = ↵tVtdt rf(Xt)dt r log ⇢t(Xt)dt. 22

Slide 23

Slide 23 text

Discrete-time algorithm for W-AIG ﬂow For i = 1, 2 . . . N, the update rule follows ( V i k+1 = ↵kV i k p ⌧(rf(X i k ) + ⇠k(X i k )), X i k+1 = X i k + p ⌧V i k+1. If E(⇢) is Hess( ), then ↵k = 1 p ⌧ 1+ p ⌧ ; if E(⇢) is Hess(0) or is unknown, then ↵k = k 1 k+2 . Here ⇠k(x) is an approximation of r log ⇢t(x). Di culties I The approximation of r log ⇢t(x). I The sti↵ness of AIG ﬂows. 23

Slide 24

Slide 24 text

Learning density from Brownian motion I Given the bandwidth h, {Xi k }N i=1 and a step size s, we can compute two particle systems Y i k (h) = X i k s⇠k(x; h), Z i k = X i k + p 2sB i , i = 1, . . . N, where Bi is the standard Brownian motion. I Minimize MMD(ˆ ⇢Y , ˆ ⇢Z), the maximum mean discrepancy between the empirical distribution ˆ ⇢Y (x) = PN i=1 (x Y i k (h)) and ˆ ⇢Z(x) = PN i=1 (x Zi k ). I Optimize over h, using the bandwidth hk 1 from the last iteration as the initialization. 24

Slide 25

Slide 25 text

Adaptive restart in density space I The second equation in (W-AIG) is the Hamilton-Jacobi equation, which usually has the strong sti↵ness. We propose an adaptive restart technique to deal with this problem. Consider 'k = N X i=1 ⌦ V i k+1, rf(X i k ) + ⇠k(X i k ) ↵ , which can be viewed as discrete-time approximation of @tE(⇢t). I If 'k < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. I This essentially keeps @tE(⇢t) negative along the trajectory. I The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 25

Slide 26

Slide 26 text

AIG Algorithm Algorithm 1 Particle implementation of Wasserstein AIG ﬂow Require: initial positions {Xi 0 }N i=1 , step size ⌧, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl 1 , {Xi k }N i=1 , p ⌧). 4: Calculate ⇠k (Xi k ) with the bandwidth hl . 5: Set ↵k based on whether E(⇢) is Hess( ) or Hess(0). Update V i k+1 = ↵k V i k p ⌧(rf(Xi k ) + ⇠k (Xi k )), Xi k+1 = Xi k + p ⌧V i k+1 . 6: if RESTART then 7: Compute 'k = P N i=1 ⌦ V i k+1 , rf(Xi k ) + ⇠k (Xi k ) ↵ . 8: If 'k < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 26

Slide 27

Slide 27 text

Gaussian examples: ODE level I The initial value is set to be ⌃0 = I. I The following ﬁgure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG ﬂow and the e↵ect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1,  ⇡ 3.8 ⇥ 103. Right: = 1,  ⇡ 4.0 ⇥ 103. 27

Slide 28

Slide 28 text

Gaussian examples: Particle level I The initial distribution of samples follows N(0, I) and the number of samples is N = 600. I For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ ⌃k) using the empirical covariance matrix ˆ ⌃k . Figure: The acceleration of W-AIG ﬂow and the e↵ect of adaptive restart (particle level). The setting of the target density is identical to the previous ﬁgure. 28

Slide 29

Slide 29 text

Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29

Slide 30

Slide 30 text

Projected Wasserstein gradient algorithms 30 Mean fi eld games/Optimal transport +Optimization+UQ methods

Slide 31

Slide 31 text

PWGD for COVID-19 dynamics 29

Slide 32

Slide 32 text

Neural network PWGD for PDE Bayesian inference 32

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

C Complex Systems Bayesian Sampling Future works Time Discretization Spatial Discretization Entropy Entropy Flux Metric Optimization