Transport Information Geometric Computations Year 1

Slide 1

Slide 1 text

Transport Information Geometric Computations Wuchen Li University of South Carolina, Columbia AFOSR Comp Math review meeting, Aug 10, 2023 Supported by AFOSR YIP award No. FA9550-23-1-0087 Starting from 05/01/2023.

Slide 2

Slide 2 text

Current Collaborators UCLA: S. Osher; Osher’s group: S.T. Liu, S. Liu, M. Zhou, X. Zuo, T. Meng, Y. Liu, Y. Hong (visiting student); Q. Gu (Computer science); U of SC: L. Lu, W. Dahmen, H. Wang, Z. Wang; UH/Princeton: Z. Han, V. Poor; Brown/ND: C. Shu; G. Fu; U Mass-Amherst: M. Katsoulakis’ group, B. Zhang; Caltech: A. Stuart, F. Ho↵man, Y. Chen et.al.; UCI: T. Georgiou; UMN: L. Wang, W. Lee; Stanford: M. Pilanci, Y. Wang; Gatech: P. Chen; FSU/U-M: Q. Feng; B. Erhan; Duke/Purdue: J. Liu, Y. Gao; ...... 2

Slide 3

Slide 3 text

Machine learning applications 3

Slide 4

Slide 4 text

Image sampling problems Figure: Results from Wasserstein Generative adversary networks. They are samples from image distributions. 4

Slide 5

Slide 5 text

Reversible di↵usions 5

Slide 6

Slide 6 text

AI, Complex systems, Math 6

Slide 7

Slide 7 text

Computational stochastic processes 7 Underdamped Langevin dynamics (ULA) Regularized Wasserstein proximal Li, Liu, Osher; Y, Osher, Li, 2023

Slide 8

Slide 8 text

Transport, Entropy&Information, Geometry 8

Slide 9

Slide 9 text

Information Geometry Information matrix (a.k.a Fisher information matrix, Fisher–Rao metric) plays important roles in estimation, information science, statistics and machine learning: Statistics and information science: Likelihood principle; Cramer-Rao bound, etc. Statistical physics: Non-equilibrium thermodynamics (Sosuke Ito et.al.); Population games via replicator dynamics (Shahshahani, Smith): Applications in biology, economics, cancer, epidemic propagation etc; AI Inference problems: f–divergence, Amari ↵–divergence etc. Information geometry optimization methods; (KFAC, ADAM), with applications in AI: Natural gradient (Amari); ADAM (Kingma et.al. 2014); Stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.) 9

Slide 10

Slide 10 text

Learning: Objective function Given a data measure pdata(x) = 1 N PN i=1 Xi (x) and a parameterized model p(x, ✓). Most problems in AI or sampling refer to min p✓ 2p(⇥) DKL(pdata, p✓). One typical choice of DKL is the Kullback–Leibler divergence (relative entropy; Gibbs Free energy) DKL(pdata, p✓) = Z ⌦ pdata(x) log pdata(x) p(x, ✓) dx. 10

Slide 11

Slide 11 text

Learning: Gradient direction The natural gradient method refers to ✓k+1 = ✓k hGF (✓k) 1r✓DKL(pdata, p✓k ), where h > 0 is a stepsize and GF (✓) 2 R|⇥|⇥|⇥|, with GF (✓)ij =E X⇠p✓ h r✓i log p(X, ✓) · r✓j log p(X, ✓) i , being the Fisher information matrix. Why natural gradient and Fisher information matrix? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; Fisher e cient; Online Cramer-Rao bound. 11

Slide 12

Slide 12 text

Optimal transport In recent years, optimal transport (a.k.a Earth mover’s distance, Monge-Kantorovich problem, Wasserstein distance) has witnessed a lot of applications: Theory (Brenier, Gangbo, Villani, Figalli, et.al.); Gradient flows (Otto, Villani, Maas, Miekle, et.al.); Proximal methods (Jordan, Kinderlehrer, Otto et.al.); Physics: GENERICs/Pre-GENERICs (Grmela and Ottinger, Doung et.al.), Flux-gradient flows (Li, Liu, Osher); Mean field games (Larsy, Lions et.al.); Population games; Evolutionary game theory (Degond, Liu, Li, et.al.); Image processing (Rubner et.al. 2000, Tannenbaum, Georgiou, et.al.); Inverse problems (Stuart, Li, Osher, et.al.); Scientific computing: (Benamou, Brenier, Carillo, Oberman, Peyre, Osher, et. al.) AI and Machine learning: See NIPS, ICLR, ICML 2015– 2023. Minimal surfaces in Wasserstein spaces (Li, Georgiou); Wasserstein common noises (Li); 12

Slide 13

Slide 13 text

Optimal transport density manifold Optimal transport has an optimal control reformulation by the dual of dual of linear programming: DW (p0, p1)2 = inf pt n Z 1 0 gW (@tpt, @tpt)dt = Z 1 0 Z ⌦ (r t, r t)ptdxdt o , where the infimum runs over all vector fields vt = r t , under the constraint of continuity equation: @tpt + r · (pt r t) = 0, p0 = p0, p1 = p1. Here, (P(⌦), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. La↵erty: The density manifold and configuration space quantization, 1988. 13

Slide 14

Slide 14 text

Learning Wasserstein type gradient directions Goal: We work on the optimal transport induced geometries (information matrices) and dynamics in probability models. They are used to design mathematical safe: fast, e cient/scalable, accurate algorithms. Literature review Wasserstein Gradient Flow: Otto (2001); Natural Gradient works e ciently in learning: Amari (1998), ; Constrained gradient ﬂows in Wasserstein spaces: Carlen, Gangbo (2003); 14

Slide 15

Slide 15 text

TIG computations of complex systems Goals: Fast: Convexities in nonlinear optimization in probability spaces; Accurate: Approximation in nonlinear probability models; Scalable/E cient: Computation of nonlinear PDEs and controls in complex systems. 15

Slide 16

Slide 16 text

TIG directions TIG geometric calculus: Continuous domain (La↵erty, Lott, Li), Discrete domain/Complex networks (Li); TIG convexity analysis: Generalized Ricci curvatures in interacting stochastic systems: Continuous domain (Li, Feng, Erhan), Discrete domain/Complex networks (Li, Lu); TIG statistical formulations (Li, Amari, Zhao, Matsuda, Fabio, et.al.); TIG AI scientiﬁc computing (Li, Osher, Liu, Lee, Hong, Wang, et.al.); TIG optimization methods (Li, Osher, Stuart, Franca, Lin, Wang, Pilanci, Chen, Gu, et.al.); TIG high order structure preserving computations (Li, Osher, Fu, et.al.); TIG modeling in social systems, pandemics, engineering, and more (Li, Osher, Liu, Lee, Han, Gao, Liu, et.al.); TIG inverse problems of complex systems (Li, Osher, S.T. Liu, S. Liu, X. Zhe, M. Zhou, et.al); TIG quantum mechanics (free probability) and computing (Li, Becker, Jekel, Shlyakhtenko). 16

Slide 17

Slide 17 text

Part I: Transport information statistical formulations Main results Li, Transport information geometry: Riemannian calculus on probability simplex, 2018; Li, Zhao, Wasserstein information matrix, 2019; Li, Rubio, On a prior based on the Wasserstein information matrix, 2022; Amari, Matsuda, Wasserstein statistics in 1D location-scale model, March, 2020. Amari, Matsuda, Information Geometry of Wasserstein Statistics on Shapes and A ne Deformations, 2023. Review paper G. Khan, J. Zhang. When Optimal Transport Meets Information Geometry, 2022. 17

Slide 18

Slide 18 text

Pull back=Computational schemes Probability models Finite volume; Finite element; Kernel methods; Gaussian; Exponential family; Mixed Gaussian; Generative models by deep neural network functions: Neural ODE (Explicit); Generative adversarial networks (Implicit). 18

Slide 19

Slide 19 text

Information matrix Deﬁnition (Statistical information matrix) Consider the density manifold (P(X), g) with a metric tensor g, and a smoothly parametrized statistical model p✓ with parameter ✓ 2 ⇥ ⇢ Rd. Then the pull-back G of g onto the parameter space ⇥ is given by G(✓) = D r✓p✓, g(p✓)r✓p✓ E . Denote G(✓) = (G(✓)ij)1i,jd , then G(✓)ij = Z X @ @✓i p(x; ✓) ⇣ g(p✓) @ @✓j p ⌘ (x; ✓)dx. Here we name g the statistical metric, and call G the statistical information matrix. 19

Slide 20

Slide 20 text

Score Deﬁnition (Score Function) Denote i : X ⇥ ⇥ ! R, i = 1, ..., n satisfying i(x; ✓) =  g(p) ✓ @ @✓i p(x; ✓) ◆ . They are the score functions associated with the statistical information matrix G and are equivalent classes in C(X)/R. The representatives in the equivalent classes are determined by the following normalization condition: E p✓ i = 0, i = 1, ..., n. Then the statistical information matrix satisﬁes G(✓)ij = Z X i(x; ✓) ⇣ g(p✓) 1 j ⌘ (x; ✓)dx. 20

Slide 21

Slide 21 text

Examples: Fisher information matrix Consider g(p) 1 = p. GF (✓)ij = E X⇠p✓ [ i(X; ✓) j(X; ✓)] . where k = 1 p✓ r✓k p✓, k = {i, j}. Notice the fact 1 p r✓k p = r✓k log p, then GF (✓)ij = E X⇠p✓  @ @✓i log p(X; ✓) @ @✓j log p(X; ✓) . In literature, GF (✓) is known as the Fisher information matrix and r✓ log p(X; ✓) is named score function. 21

Slide 22

Slide 22 text

Examples: Wasserstein information matrix Consider g(p) 1 = r · (pr): GW (✓)ij = E X⇠p✓ ⇥ rx W i (X; ✓), rx W j (X; ✓) ⇤ . where rx · (p✓ rx W k ) = r✓k p✓, k 2 {i, j}. Here we call GW (✓) Wasserstein information matrix and name W the Wasserstein score function. 22

Slide 23

Slide 23 text

Distance/Divergence and information matrix Speciﬁcally, given a smooth family of probability densities p(x; ✓) and a given perturbation ✓ 2 T✓⇥, consider the following Taylor expansions in term of ✓: DKL(p(✓)kp(✓ + ✓)) = 1 2 ✓T GF (✓) ✓ + o(( ✓)2), and DW(p(✓ + ✓), p(✓))2 = ✓T GW (✓) ✓ + o(( ✓)2). 23

Slide 24

Slide 24 text

PDEs connecting Wasserstein and Fisher Scores The Wasserstein score functions W i (x; ✓) satisfy the following Poisson equation rx log p(x; ✓) · rx W i (x; ✓) + x W i (x; ✓) = @ @✓i log p(x; ✓). 24

Slide 25

Slide 25 text

One dimensional sample space If X ⇢ R1, the Wasserstein score functions satisfy W i (x; ✓) = Z 1 p(z; ✓) @ @✓i F(z; ✓)dz, where F(x; ✓) = R x p(y; ✓)dy is the cumulative distribution function. And the Wasserstein information matrix2 satisﬁes GW (✓)ij = E X⇠p✓ " @ @✓i F(X; ✓) @ @✓j F(X; ✓) p(X; ✓)2 # . The Wasserstein score function is the average of the cumulative Fisher score function, and the Wasserstein information matrix is the covariance of the density average of the cumulative Fisher score function. 2Chen, Li, Wasserstein natural gradient in continuous sample space, 2018. 25

Slide 26

Slide 26 text

Separability If p(x; ✓) is an independence model, i.e. p(x, ✓) = ⇧n k=1 pk(xk; ✓), x = (x1, · · · , xn). Then there exists a set of one dimensional functions W,k : Xk ⇥ ⇥k ! R, such that W (x; ✓) = n X k=1 W,k(xk; ✓). In addition, the Wasserstein information matrix is separable: GW (✓) = n X k=1 Gk W (✓), where Gk W (✓) = E X⇠p✓ ⇥ rxk W,k(X; ✓), rxk W,k(X; ✓) ⇤ . 26

Slide 27

Slide 27 text

Analytic examples Consider a one dimensional generative models: p(·; ✓) = f✓⇤p0 (·) , p0 a given distribution, GW (✓) = Z |r✓f✓ (x)|2 p0 (x) dx, For a single ReLU function: f✓ (x) = (x ✓) = ( 0, x  ✓, x ✓, x > ✓. GW (✓) = F0(✓), F0 cumulative distribution function of p0. Figure: This ﬁgure plots two example of the push-forward family we described above with parameter chosen as ✓1 = 3, ✓2 = 5. 27

Slide 28

Slide 28 text

Statistical Information Matrix Probability Family Wasserstein information matrix Fisher information matrix Uniform: p(x;a,b) = 1 b a 1(a,b) (x) GW (a,b) = 1 3 ✓ 1 1 2 1 2 1 ◆ GF (a,b) not well-defined Gaussian: p(x;µ, ) = e 1 2 2 (x µ)2 p 2⇡ GW (µ, ) = ✓ 1 0 0 1 ◆ GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ Exponential: p(x;m, ) = e (x m) GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) not well-defined Laplacian: p(x;m, ) = 2 e |x m| GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) = ✓ 2 0 0 1 2 ◆ Location-scale: p(x;m, ) = 1 p( x p ) GW ( ,m) = E ,mx2 2mE ,mx+m2 2 0 0 1 ! GF ( ,m) = 0 @ 1 2 ⇣ 1 + R R ⇣ (x m)2p02 2p + (x m)p0 ⌘ dx ⌘ R R (x m)p02 3p dx R R (x m)p02 3p dx 1 2 R R p02 p dx 1 A Independent: p(x,y;✓) = p(x;✓)p(y;✓) GW (x,y;✓) = G1 W (x;✓) + G2 W (y;✓) GF (x,y;✓) = G1 F (x;✓) + G2 F (y;✓) ReLU push-forward: p(x;✓) = f✓⇤p(x), f✓ ✓-parameterized ReLUs.. GW (✓) = F (✓), F cdf of p(x) GF (✓) not well-defined Table: In this table, we present Wasserstein, Fisher information matrices for various probability families. 28

Slide 29

Slide 29 text

Part II: Transport information proximal methods in generative AI Lin, Li, Osher, et.al., Wasserstein proximal of GANs, GSI, 2018; Li, Lin, et.al., A ne proximal learning, GSI, 2019; Arbel, Gretton, Li, et.al., Kernelized Wasserstein Natural Gradient, ICLR, spotlight, 2020; Liu, Li, et.al., Neural parametric Fokker-Planck equations, SIAM journal on numerical analysis, 2020. 29

Slide 30

Slide 30 text

Wasserstein natural gradient Given a Loss function F: P(⌦) ! R and probability model p(·, ✓), the associated gradient ﬂow on a Riemannian manifold is deﬁned by d✓ dt = rgF(p(·, ✓)). Here rg is the Riemannian gradient operator satisfying g✓(rgF(p(·, ✓)), ⇠) = r✓F(p(·, ✓)) · ⇠ for any tangent vector ⇠ 2 T✓⇥, where r✓ represents the di↵erential operator (Euclidean gradient). 30

Slide 31

Slide 31 text

Proximal method on a probability space Denote G(p) 1 = p . Consider the gradient ﬂow: ˙ p = G(p) 1 pF(p) = r · (pr F(p)). The proximal method refers to pk+1 = arg min p2P F(p) + 1 2h dG(p, pk)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, known as the transport distance. This is known as the variational backward Euler method pk+1 = pk hr · (pk+1 r F(pk+1)) + o(h). A linearlization of proximal methods have been formulated by (Li, Lu, Wang, 2019). 31

Slide 32

Slide 32 text

Generative Adversary Networks For each parameter ✓ 2 Rd and given neural network parameterized mapping function g✓ , consider p✓ = g✓#p(z). 32

Slide 33

Slide 33 text

Natural proximal on probability models Consider the gradient ﬂow: ˙ ✓ = G(✓) 1r✓F(p(✓, ·)), where G(✓) = (r✓p, ( p) 1r✓p)L2 . The update scheme follows ✓k+1 = arg min ✓2⇥ F(p✓) + 1 2h DW(✓, ✓k)2. where ✓ is the parameters of the generator, F(p✓) is the loss function, and DW is the Wasserstein metric. In practice [Lin, Li, Osher, et.al. 2018], consider ✓k+1 = arg min ✓2⇥ F(p✓) + 1 B B X i 1 2h kg✓(zi) g✓k (zi)k2, where g✓ is the generator, B is the batch size, and zi ⇠ p(z) are inputs to the generator. 33

Slide 34

Slide 34 text

Examples: Jensen–Shannon entropy Loss Figure: The relaxed Wasserstein Proximal of GANs, on the CIFA10 (left), CelebA (right) datasets. 34

Slide 35

Slide 35 text

Part III: Transport information scientific computing and numerical analysis Lee, Wang, Li, Deep JKO: time-implicit particle methods for general nonlinear gradient flows, Ongoing. Zhao, Li, Scaling limit of Wasserstein Gaussian mixture models, preprint preparation. G. Fu, S. Osher, W.C. Li. High order spatial discretization for variational time implicit schemes: Wasserstein gradient flows and reaction-di↵usion systems, accepted in Journal of Computational Physics, July, 2023. 35

Slide 36

Slide 36 text

High order fi nite element JKO computations of droplet dynamics Fu, Osher, Li, 2023

Slide 37

Slide 37 text

High order fi nite element JKO computations of Reaction-Diffusion equations

Slide 38

Slide 38 text

Deep JKO for nonlinear PDEs

Slide 39

Slide 39 text

Part IV: Transport information optimization in MCMC and Bayesian inverse problems A. Garbuno-Inigo, F. Ho↵mann, W.C. Li, A.M. Stuart. Interacting Langevin Di↵usions: Gradient Structure And Ensemble Kalman Sampler. SIAM Journal on applied dynamical system, 2019. Y.F. Wang, W.C. Li. Information Newton flow: second-order optimization method in probability space. Jan, 2020. Y.F. Wang, W.C. Li. Accelerated Information Gradient flow. Accepted in Journal of Scientific computing, Nov, 2021. Y.F. Wang, P. Chen, W.C. Li. Projected Wasserstein gradient descent for high-dimensional Bayesian inference. SIAM/ASA Journal on Uncertainty Quantification, 2022. Y.F. Wang, P. Chen, M. Pilanci, W.C. Li. Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. May, 2022. 36

Slide 40

Slide 40 text

Bayesian logistic regression Figure: Comparison of di↵erent methods on Bayesian logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 29

Slide 41

Slide 41 text

Projected Wasserstein gradient algorithms 30 Mean fi eld games+Optimal transport +Optimization+Bayesian inverse problems

Slide 42

Slide 42 text

PWGD for COVID-19 dynamics 29

Slide 43

Slide 43 text

Neural network PWGD for PDE Bayesian inference 32

Slide 44

Slide 44 text

C Complex Systems Bayesian Sampling Future works Time Discretization Spatial Discretization Entropy Entropy Flux Metric Optimization

Slide 45

Slide 45 text

Upcoming conferences Lectures: Variational and Information fl ows in Machine learning and Optimal transport

Slide 46

Slide 46 text

Discussions: Mean ﬁeld Information Geometry 37