Transport Information Geometric Computations Year 1

Transport Information Geometric Computations Wuchen Li University of South Carolina,
Columbia AFOSR Comp Math review meeting, Aug 10, 2023 Supported by AFOSR YIP award No. FA9550-23-1-0087 Starting from 05/01/2023.

Current Collaborators UCLA: S. Osher; Osher’s group: S.T. Liu, S.
Liu, M. Zhou, X. Zuo, T. Meng, Y. Liu, Y. Hong (visiting student); Q. Gu (Computer science); U of SC: L. Lu, W. Dahmen, H. Wang, Z. Wang; UH/Princeton: Z. Han, V. Poor; Brown/ND: C. Shu; G. Fu; U Mass-Amherst: M. Katsoulakis’ group, B. Zhang; Caltech: A. Stuart, F. Ho↵man, Y. Chen et.al.; UCI: T. Georgiou; UMN: L. Wang, W. Lee; Stanford: M. Pilanci, Y. Wang; Gatech: P. Chen; FSU/U-M: Q. Feng; B. Erhan; Duke/Purdue: J. Liu, Y. Gao; ...... 2

Machine learning applications 3

Image sampling problems Figure: Results from Wasserstein Generative adversary networks.
They are samples from image distributions. 4

Reversible di↵usions 5

AI, Complex systems, Math 6

Computational stochastic processes 7 Underdamped Langevin dynamics (ULA) Regularized Wasserstein
proximal Li, Liu, Osher; Y, Osher, Li, 2023

Transport, Entropy&Information, Geometry 8

Information Geometry Information matrix (a.k.a Fisher information matrix, Fisher–Rao metric)
plays important roles in estimation, information science, statistics and machine learning: Statistics and information science: Likelihood principle; Cramer-Rao bound, etc. Statistical physics: Non-equilibrium thermodynamics (Sosuke Ito et.al.); Population games via replicator dynamics (Shahshahani, Smith): Applications in biology, economics, cancer, epidemic propagation etc; AI Inference problems: f–divergence, Amari ↵–divergence etc. Information geometry optimization methods; (KFAC, ADAM), with applications in AI: Natural gradient (Amari); ADAM (Kingma et.al. 2014); Stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.) 9

Learning: Objective function Given a data measure pdata(x) = 1
N PN i=1 Xi (x) and a parameterized model p(x, ✓). Most problems in AI or sampling refer to min p✓ 2p(⇥) DKL(pdata, p✓). One typical choice of DKL is the Kullback–Leibler divergence (relative entropy; Gibbs Free energy) DKL(pdata, p✓) = Z ⌦ pdata(x) log pdata(x) p(x, ✓) dx. 10

Learning: Gradient direction The natural gradient method refers to ✓k+1
= ✓k hGF (✓k) 1r✓DKL(pdata, p✓k ), where h > 0 is a stepsize and GF (✓) 2 R|⇥|⇥|⇥|, with GF (✓)ij =E X⇠p✓ h r✓i log p(X, ✓) · r✓j log p(X, ✓) i , being the Fisher information matrix. Why natural gradient and Fisher information matrix? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; Fisher e cient; Online Cramer-Rao bound. 11

Optimal transport In recent years, optimal transport (a.k.a Earth mover’s
distance, Monge-Kantorovich problem, Wasserstein distance) has witnessed a lot of applications: Theory (Brenier, Gangbo, Villani, Figalli, et.al.); Gradient flows (Otto, Villani, Maas, Miekle, et.al.); Proximal methods (Jordan, Kinderlehrer, Otto et.al.); Physics: GENERICs/Pre-GENERICs (Grmela and Ottinger, Doung et.al.), Flux-gradient flows (Li, Liu, Osher); Mean field games (Larsy, Lions et.al.); Population games; Evolutionary game theory (Degond, Liu, Li, et.al.); Image processing (Rubner et.al. 2000, Tannenbaum, Georgiou, et.al.); Inverse problems (Stuart, Li, Osher, et.al.); Scientific computing: (Benamou, Brenier, Carillo, Oberman, Peyre, Osher, et. al.) AI and Machine learning: See NIPS, ICLR, ICML 2015– 2023. Minimal surfaces in Wasserstein spaces (Li, Georgiou); Wasserstein common noises (Li); 12

Optimal transport density manifold Optimal transport has an optimal control
reformulation by the dual of dual of linear programming: DW (p0, p1)2 = inf pt n Z 1 0 gW (@tpt, @tpt)dt = Z 1 0 Z ⌦ (r t, r t)ptdxdt o , where the infimum runs over all vector fields vt = r t , under the constraint of continuity equation: @tpt + r · (pt r t) = 0, p0 = p0, p1 = p1. Here, (P(⌦), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. La↵erty: The density manifold and configuration space quantization, 1988. 13

Learning Wasserstein type gradient directions Goal: We work on the
optimal transport induced geometries (information matrices) and dynamics in probability models. They are used to design mathematical safe: fast, e cient/scalable, accurate algorithms. Literature review Wasserstein Gradient Flow: Otto (2001); Natural Gradient works e ciently in learning: Amari (1998), ; Constrained gradient ﬂows in Wasserstein spaces: Carlen, Gangbo (2003); 14

TIG computations of complex systems Goals: Fast: Convexities in nonlinear
optimization in probability spaces; Accurate: Approximation in nonlinear probability models; Scalable/E cient: Computation of nonlinear PDEs and controls in complex systems. 15

TIG directions TIG geometric calculus: Continuous domain (La↵erty, Lott, Li),
Discrete domain/Complex networks (Li); TIG convexity analysis: Generalized Ricci curvatures in interacting stochastic systems: Continuous domain (Li, Feng, Erhan), Discrete domain/Complex networks (Li, Lu); TIG statistical formulations (Li, Amari, Zhao, Matsuda, Fabio, et.al.); TIG AI scientiﬁc computing (Li, Osher, Liu, Lee, Hong, Wang, et.al.); TIG optimization methods (Li, Osher, Stuart, Franca, Lin, Wang, Pilanci, Chen, Gu, et.al.); TIG high order structure preserving computations (Li, Osher, Fu, et.al.); TIG modeling in social systems, pandemics, engineering, and more (Li, Osher, Liu, Lee, Han, Gao, Liu, et.al.); TIG inverse problems of complex systems (Li, Osher, S.T. Liu, S. Liu, X. Zhe, M. Zhou, et.al); TIG quantum mechanics (free probability) and computing (Li, Becker, Jekel, Shlyakhtenko). 16

Part I: Transport information statistical formulations Main results Li, Transport
information geometry: Riemannian calculus on probability simplex, 2018; Li, Zhao, Wasserstein information matrix, 2019; Li, Rubio, On a prior based on the Wasserstein information matrix, 2022; Amari, Matsuda, Wasserstein statistics in 1D location-scale model, March, 2020. Amari, Matsuda, Information Geometry of Wasserstein Statistics on Shapes and A ne Deformations, 2023. Review paper G. Khan, J. Zhang. When Optimal Transport Meets Information Geometry, 2022. 17

Pull back=Computational schemes Probability models Finite volume; Finite element; Kernel
methods; Gaussian; Exponential family; Mixed Gaussian; Generative models by deep neural network functions: Neural ODE (Explicit); Generative adversarial networks (Implicit). 18

Information matrix Deﬁnition (Statistical information matrix) Consider the density manifold
(P(X), g) with a metric tensor g, and a smoothly parametrized statistical model p✓ with parameter ✓ 2 ⇥ ⇢ Rd. Then the pull-back G of g onto the parameter space ⇥ is given by G(✓) = D r✓p✓, g(p✓)r✓p✓ E . Denote G(✓) = (G(✓)ij)1i,jd , then G(✓)ij = Z X @ @✓i p(x; ✓) ⇣ g(p✓) @ @✓j p ⌘ (x; ✓)dx. Here we name g the statistical metric, and call G the statistical information matrix. 19

Score Deﬁnition (Score Function) Denote i : X ⇥ ⇥
! R, i = 1, ..., n satisfying i(x; ✓) =  g(p) ✓ @ @✓i p(x; ✓) ◆ . They are the score functions associated with the statistical information matrix G and are equivalent classes in C(X)/R. The representatives in the equivalent classes are determined by the following normalization condition: E p✓ i = 0, i = 1, ..., n. Then the statistical information matrix satisﬁes G(✓)ij = Z X i(x; ✓) ⇣ g(p✓) 1 j ⌘ (x; ✓)dx. 20

Examples: Fisher information matrix Consider g(p) 1 = p. GF
(✓)ij = E X⇠p✓ [ i(X; ✓) j(X; ✓)] . where k = 1 p✓ r✓k p✓, k = {i, j}. Notice the fact 1 p r✓k p = r✓k log p, then GF (✓)ij = E X⇠p✓  @ @✓i log p(X; ✓) @ @✓j log p(X; ✓) . In literature, GF (✓) is known as the Fisher information matrix and r✓ log p(X; ✓) is named score function. 21

Examples: Wasserstein information matrix Consider g(p) 1 = r ·
(pr): GW (✓)ij = E X⇠p✓ ⇥ rx W i (X; ✓), rx W j (X; ✓) ⇤ . where rx · (p✓ rx W k ) = r✓k p✓, k 2 {i, j}. Here we call GW (✓) Wasserstein information matrix and name W the Wasserstein score function. 22

Distance/Divergence and information matrix Speciﬁcally, given a smooth family of
probability densities p(x; ✓) and a given perturbation ✓ 2 T✓⇥, consider the following Taylor expansions in term of ✓: DKL(p(✓)kp(✓ + ✓)) = 1 2 ✓T GF (✓) ✓ + o(( ✓)2), and DW(p(✓ + ✓), p(✓))2 = ✓T GW (✓) ✓ + o(( ✓)2). 23

PDEs connecting Wasserstein and Fisher Scores The Wasserstein score functions
W i (x; ✓) satisfy the following Poisson equation rx log p(x; ✓) · rx W i (x; ✓) + x W i (x; ✓) = @ @✓i log p(x; ✓). 24

One dimensional sample space If X ⇢ R1, the Wasserstein
score functions satisfy W i (x; ✓) = Z 1 p(z; ✓) @ @✓i F(z; ✓)dz, where F(x; ✓) = R x p(y; ✓)dy is the cumulative distribution function. And the Wasserstein information matrix2 satisﬁes GW (✓)ij = E X⇠p✓ " @ @✓i F(X; ✓) @ @✓j F(X; ✓) p(X; ✓)2 # . The Wasserstein score function is the average of the cumulative Fisher score function, and the Wasserstein information matrix is the covariance of the density average of the cumulative Fisher score function. 2Chen, Li, Wasserstein natural gradient in continuous sample space, 2018. 25

Separability If p(x; ✓) is an independence model, i.e. p(x,
✓) = ⇧n k=1 pk(xk; ✓), x = (x1, · · · , xn). Then there exists a set of one dimensional functions W,k : Xk ⇥ ⇥k ! R, such that W (x; ✓) = n X k=1 W,k(xk; ✓). In addition, the Wasserstein information matrix is separable: GW (✓) = n X k=1 Gk W (✓), where Gk W (✓) = E X⇠p✓ ⇥ rxk W,k(X; ✓), rxk W,k(X; ✓) ⇤ . 26

Analytic examples Consider a one dimensional generative models: p(·; ✓)
= f✓⇤p0 (·) , p0 a given distribution, GW (✓) = Z |r✓f✓ (x)|2 p0 (x) dx, For a single ReLU function: f✓ (x) = (x ✓) = ( 0, x  ✓, x ✓, x > ✓. GW (✓) = F0(✓), F0 cumulative distribution function of p0. Figure: This ﬁgure plots two example of the push-forward family we described above with parameter chosen as ✓1 = 3, ✓2 = 5. 27

Statistical Information Matrix Probability Family Wasserstein information matrix Fisher information
matrix Uniform: p(x;a,b) = 1 b a 1(a,b) (x) GW (a,b) = 1 3 ✓ 1 1 2 1 2 1 ◆ GF (a,b) not well-defined Gaussian: p(x;µ, ) = e 1 2 2 (x µ)2 p 2⇡ GW (µ, ) = ✓ 1 0 0 1 ◆ GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ Exponential: p(x;m, ) = e (x m) GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) not well-defined Laplacian: p(x;m, ) = 2 e |x m| GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) = ✓ 2 0 0 1 2 ◆ Location-scale: p(x;m, ) = 1 p( x p ) GW ( ,m) = E ,mx2 2mE ,mx+m2 2 0 0 1 ! GF ( ,m) = 0 @ 1 2 ⇣ 1 + R R ⇣ (x m)2p02 2p + (x m)p0 ⌘ dx ⌘ R R (x m)p02 3p dx R R (x m)p02 3p dx 1 2 R R p02 p dx 1 A Independent: p(x,y;✓) = p(x;✓)p(y;✓) GW (x,y;✓) = G1 W (x;✓) + G2 W (y;✓) GF (x,y;✓) = G1 F (x;✓) + G2 F (y;✓) ReLU push-forward: p(x;✓) = f✓⇤p(x), f✓ ✓-parameterized ReLUs.. GW (✓) = F (✓), F cdf of p(x) GF (✓) not well-defined Table: In this table, we present Wasserstein, Fisher information matrices for various probability families. 28

Part II: Transport information proximal methods in generative AI Lin,
Li, Osher, et.al., Wasserstein proximal of GANs, GSI, 2018; Li, Lin, et.al., A ne proximal learning, GSI, 2019; Arbel, Gretton, Li, et.al., Kernelized Wasserstein Natural Gradient, ICLR, spotlight, 2020; Liu, Li, et.al., Neural parametric Fokker-Planck equations, SIAM journal on numerical analysis, 2020. 29

Wasserstein natural gradient Given a Loss function F: P(⌦) !
R and probability model p(·, ✓), the associated gradient ﬂow on a Riemannian manifold is deﬁned by d✓ dt = rgF(p(·, ✓)). Here rg is the Riemannian gradient operator satisfying g✓(rgF(p(·, ✓)), ⇠) = r✓F(p(·, ✓)) · ⇠ for any tangent vector ⇠ 2 T✓⇥, where r✓ represents the di↵erential operator (Euclidean gradient). 30

Proximal method on a probability space Denote G(p) 1 =
p . Consider the gradient ﬂow: ˙ p = G(p) 1 pF(p) = r · (pr F(p)). The proximal method refers to pk+1 = arg min p2P F(p) + 1 2h dG(p, pk)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, known as the transport distance. This is known as the variational backward Euler method pk+1 = pk hr · (pk+1 r F(pk+1)) + o(h). A linearlization of proximal methods have been formulated by (Li, Lu, Wang, 2019). 31

Generative Adversary Networks For each parameter ✓ 2 Rd and
given neural network parameterized mapping function g✓ , consider p✓ = g✓#p(z). 32

Natural proximal on probability models Consider the gradient ﬂow: ˙
✓ = G(✓) 1r✓F(p(✓, ·)), where G(✓) = (r✓p, ( p) 1r✓p)L2 . The update scheme follows ✓k+1 = arg min ✓2⇥ F(p✓) + 1 2h DW(✓, ✓k)2. where ✓ is the parameters of the generator, F(p✓) is the loss function, and DW is the Wasserstein metric. In practice [Lin, Li, Osher, et.al. 2018], consider ✓k+1 = arg min ✓2⇥ F(p✓) + 1 B B X i 1 2h kg✓(zi) g✓k (zi)k2, where g✓ is the generator, B is the batch size, and zi ⇠ p(z) are inputs to the generator. 33

Examples: Jensen–Shannon entropy Loss Figure: The relaxed Wasserstein Proximal of
GANs, on the CIFA10 (left), CelebA (right) datasets. 34

Part III: Transport information scientific computing and numerical analysis Lee,
Wang, Li, Deep JKO: time-implicit particle methods for general nonlinear gradient flows, Ongoing. Zhao, Li, Scaling limit of Wasserstein Gaussian mixture models, preprint preparation. G. Fu, S. Osher, W.C. Li. High order spatial discretization for variational time implicit schemes: Wasserstein gradient flows and reaction-di↵usion systems, accepted in Journal of Computational Physics, July, 2023. 35

High order fi nite element JKO computations of droplet dynamics
Fu, Osher, Li, 2023

High order fi nite element JKO computations of Reaction-Diffusion equations

Deep JKO for nonlinear PDEs

Part IV: Transport information optimization in MCMC and Bayesian inverse
problems A. Garbuno-Inigo, F. Ho↵mann, W.C. Li, A.M. Stuart. Interacting Langevin Di↵usions: Gradient Structure And Ensemble Kalman Sampler. SIAM Journal on applied dynamical system, 2019. Y.F. Wang, W.C. Li. Information Newton flow: second-order optimization method in probability space. Jan, 2020. Y.F. Wang, W.C. Li. Accelerated Information Gradient flow. Accepted in Journal of Scientific computing, Nov, 2021. Y.F. Wang, P. Chen, W.C. Li. Projected Wasserstein gradient descent for high-dimensional Bayesian inference. SIAM/ASA Journal on Uncertainty Quantification, 2022. Y.F. Wang, P. Chen, M. Pilanci, W.C. Li. Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. May, 2022. 36

Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian
logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29

Projected Wasserstein gradient algorithms 30 Mean fi eld games+Optimal transport
+Optimization+Bayesian inverse problems

PWGD for COVID-19 dynamics 29

Neural network PWGD for PDE Bayesian inference 32

C Complex Systems Bayesian Sampling Future works Time Discretization Spatial
Discretization Entropy Entropy Flux Metric Optimization

Upcoming conferences Lectures: Variational and Information fl ows in Machine
learning and Optimal transport

Discussions: Mean ﬁeld Information Geometry 37

Transport Information Geometric Computations Ye...

Transport Information Geometric Computations Year 1

More Decks by Wuchen Li

Other Decks in Research

Featured

Transcript