Transport information geometry: current and future

Transport information geometry: Current and Future Wuchen Li (UCLA/ U
of SC) IPAM, April 6, 2020 1

Divergences on probability space 2 2

Applications of Divergences 3 3

Information matrix Information matrix (a.k.a Fisher information matrix, Fisher–Rao metric)
plays important roles in estimation, information science, statistics and machine learning: Statistics and information science: Likelihood principle; Cramer-Rao bound, etc. Statistical physics; Population games (mean ﬁeld game) via replicator dynamics (Shahshahani, Smith): Applications in biology, economics, cancer, epidemic propagation etc; AI Inference problems: f–divergence, Amari ↵–divergence etc. Information geometry optimization methods; (KFAC, ADAM), with applications in AI: Natural gradient (Amari); ADAM (Kingma 2014); Stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.) 4 4

Probability models 5 5

Learning: Objective function Given a data measure ⇢data(x) = 1
N PN i=1 Xi (x) and a parameterized model ⇢(x, ✓). Most problems in AI or sampling refer to min ⇢✓ 2⇢(⇥) D(⇢data , ⇢✓). One typical choice of D is the Kullback–Leibler divergence (relative entropy) D(⇢data , ⇢✓) = Z ⌦ ⇢data(x) log ⇢data(x) ⇢(x, ✓) dx. 6 6

Learning: Gradient direction The natural gradient method refers to ✓k+1
= ✓k hGF (✓k) 1r✓ D(⇢data , ⇢✓k ), where h > 0 is a stepsize and GF (✓) =E X⇠⇢✓ (r✓ log ⇢(X, ✓))T (r✓ log ⇢(X, ✓)) is the Fisher information matrix. Why natural gradient and Fisher information matrix? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; Fisher e cient; Online Cramer-Rao bound. 7 7

Optimal transport In recent years, optimal transport (a.k.a Earth mover’s
distance, Monge-Kantorovich problem, Wasserstein distance) has witnessed a lot of applications in AI: Theory (Brenier, Gangbo, Villani, Figalli, et.al.); Gradient flows (Otto, Villani, et.al.); Proximal methods (Jordan, Kinderlehrer, Otto et.al.); Mean field games (Larsy, Lions et.al.); Population games via Fokker-Planck equations (Degond et. al. 2014, Li et.al. 2016); Image retrieval (Rubner et.al. 2000); Wasserstein of Wasserstein loss (Dukler, Li, Lin, Guido); Inverse problems (Stuart, Zhou, et.al., Ding, Li, Yin, Osher); Scientific computing: (Benamou, Brenier, Carillo, Oberman, Peyre, Solomon, Osher, Li, Yin, et. al.) AI and Machine learning: Wasserstein Training of Boltzmann Machines (Cuturi et.al. 2015); Learning from Wasserstein Loss (Frogner et.al. 2015); Wasserstein GAN (Bottou et.al. 2017); see NIPS, ICLR, ICML 2015– 2020. 8 8

Why optimal transport? Optimal transport provides a particular distance (W)
among histograms, which relies on the distance on sample spaces (ground cost c). E.g. Denote X0 ⇠ ⇢0 = x0 , X1 ⇠ ⇢1 = x1 . Compare W(⇢0, ⇢1) = inf ⇡2⇧(⇢0,⇢1) E (X0,X1)⇠⇡ c(X0 , X1) = c(x0 , x1); Vs TV(⇢0, ⇢1) = Z ⌦ |⇢0(x) ⇢1(x)|dx = 2; Vs KL(⇢0k⇢1) = Z ⌦ ⇢0(x) log ⇢0(x) ⇢1(x) dx = 1. 9 9

Learning: Wasserstein objective/loss function Given a data distribution ⇢0 and
a probability model ⇢1(✓). Consider min ✓2⇥ W(⇢0, ⇢1(✓)). This is a double minimization problem, i.e. W(⇢0, ⇢1(✓)) = min ⇡2⇧(⇢0,⇢1(✓)) E (X0,X1)⇠⇡ c(X0 , X1). Many applications, such as Wasserstein GAN, Wasserstein Loss, are built on the above formulation. 10 10

Learning: Wasserstein gradient direction Main Question: Besides looking at the
Wasserstein/transport distance, I propose to work on the optimal transport induced information matrices and geometric structures in probability models. Literature review Wasserstein Gradient Flow: Otto (2001); Natural Gradient works e ciently in learning: Amari (1998), ; Constrained gradient: Carlen, Gangbo (2003); Linear programming: Wong (2017), Amari, Karakida, Oizumi (2017); Gaussian measures: Takatsu (2011), Malago et.al. (2017), Modin (2017), Yongxin et.al (2017), Sanctis (2017) and many more. 11 11

Scope of transport information 12 12

Problem formulation Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation
; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, La↵erty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into learning problems. 13 13

Density manifold Optimal transport has an optimal control reformulation by
the dual of dual of linear programming: inf ⇢t Z 1 0 gW (@t ⇢t , @t ⇢t)dt = Z 1 0 Z ⌦ (r t , r t)⇢t dxdt, under the dynamical constraint, i.e. continuity equation: @t ⇢t + r · (⇢t r t) = 0, ⇢0 = ⇢0, ⇢1 = ⇢1. Here, (P(⌦), gW ) forms an inﬁnite-dimensional Riemannian manifold1. 1John D. La↵erty: the density manifold and conﬁguration space quantization, 1988. 14 14

Topics (O) Transport information geometry and Gamma calculus (Li); (I)
Transport information matrix (Li, Zhao); (II) Wasserstein natural gradient/proximal (Li, Montufar, Chen, Lin, Abel, Osher, etc.) (III) Neural Fokker-Planck equation (Li, Liu, Zha, Zhou.) (IV) Transport information optimization methods (Li, Wang). 15 15

Part I: Transport information statistics Main results Li, Zhao, Wasserstein
information matrix, 2019; Amari, Wasserstein statistics in 1D location-scale model, March, 2020. Related studies Wasserstein covariance (Petersen, Muller.) Wasserstein minimal distance estimator (Bernton, Jacob, Gerber, and Robert.) Statistical inference for generative models with maximum mean discrepancy (Briol, Barp, Duncan, Girolami.) 16 16

Information matrix and Pull back 17 17

Statistical information matrix Deﬁnition (Statistical information matrix) Consider the density
manifold (P(X), g) with a metric tensor g, and a smoothly parametrized statistical model p✓ with parameter ✓ 2 ⇥ ⇢ Rd. Then the pull-back G of g onto the parameter space ⇥ is given by G(✓) = D r✓ p✓ , g(p✓)r✓ p✓ E . Denote G(✓) = (G(✓)ij)1i,jd , then G(✓)ij = Z X @ @✓i p(x; ✓) ⇣ g(p✓) @ @✓j p ⌘ (x; ✓)dx. Here we name g the statistical metric, and call G the statistical information matrix. 18 18

Statistical information matrix Deﬁnition (Score Function) Denote i : X
⇥ ⇥ ! R, i = 1, ..., n satisfying i(x; ✓) =  g(p) ✓ @ @✓i p(x; ✓) ◆ . They are the score functions associated with the statistical information matrix G and are equivalent classes in C(X)/R. The representatives in the equivalent classes are determined by the following normalization condition: E p✓ i = 0, i = 1, ..., n. Then the statistical information matrix satisﬁes G(✓)ij = Z X i(x; ✓) ⇣ g(p✓) 1 j ⌘ (x; ✓)dx. 19 19

Examples: Fisher information matrix Consider g(p) 1 = p. GF
(✓)ij = E X⇠p✓ [ i(X; ✓) j(X; ✓)] . where i = 1 p✓ r✓i p✓ , and j = 1 p✓ r✓j p✓ . Notice the fact 1 p r✓ p = r✓ log p, then GF (✓)ij = E X⇠p✓  @ @✓i log p(X; ✓) @ @✓j log p(X; ✓) . In literature, GF (✓) is known as the Fisher information matrix and log p(X; ✓) is named score function. 20 20

Examples: Wasserstein information matrix Consider g(p) 1 = r ·
(pr): GW (✓)ij = E X⇠p✓ ⇥ rx W i (X; ✓), rx W j (X; ✓) ⇤ . where rx · (p✓ rx W i ) = r✓i p✓ . and rx · (p✓ rx W j ) = r✓j p✓ . Here we call GW (✓) Wasserstein information matrix and name W the Wasserstein score function. 21 21

Distance and information matrix Speciﬁcally, given a smooth family of
probability densities p(x; ✓) and a given perturbation ✓ 2 T✓⇥, consider the following Taylor expansions in term of ✓: KL(p(✓)kp(✓ + ✓)) = 1 2 ✓T GF (✓) ✓ + o(( ✓)2), and W2(p(✓ + ✓), p(✓))2 = ✓T GW (✓) ✓ + o(( ✓)2). 22 22

Poisson equation The Wasserstein score functions W i (x; ✓)
satisfy the following Poisson equation rx log p(x; ✓) · rx W i (x; ✓) + x W i (x; ✓) = @ @✓i log p(x; ✓). 23 23

Separability If p(x; ✓) is an independence model, i.e. p(x,
✓) = ⇧n k=1 pk(xk; ✓), x = (x1 , · · · , xn). Then there exists a set of one dimensional functions W,k : Xk ⇥ ⇥k ! R, such that W (x; ✓) = n X k=1 W,k(xk; ✓). In addition, the Wasserstein information matrix is separable: GW (✓) = n X k=1 Gk W (✓), where Gk W (✓) = E X⇠p✓ ⇥ rxk W,k(X; ✓), rxk W,k(X; ✓) ⇤ . 24 24

One dimensional sample space If X ⇢ R1, the Wasserstein
score functions satisfy W i (x; ✓) = Z 1 p(z; ✓) @ @✓i F(z; ✓)dz, where F(x; ✓) = R x p(y; ✓)dy is the cumulative distribution function. And the Wasserstein information matrix2 satisﬁes GW (✓)ij = E X⇠p✓ " @ @✓i F(X; ✓) @ @✓j F(X; ✓) p(X; ✓)2 # . 2Chen, Li, Wasserstein natural gradient in continuous sample space, 2018. 25 25

Remark The Wasserstein score function is the average of the
cumulative Fisher score function, and the Wasserstein information matrix is the covariance of the density average of the cumulative Fisher score function. 26 26

Analytic examples Classical statistical models: Gaussian family: p(x; µ, )
= 1 p 2⇡ e 1 2 2 (x µ)2 , GW (µ, ) = ✓ 1 0 0 1 ◆ . Laplacian family: p(x; m, ) = 2 e |x m|, GW (µ, ) = ✓ 1 0 0 1 ◆ . 27 27

Analytic examples Generative models: Continuous 1-d generative family: p(·; ✓)
= f✓⇤ p0 (·) , p0 a given distribution, GW (✓) = Z |r✓ f✓ (x)|2 p0 (x) dx, where the push-forward distribution is deﬁned as Z A p0 dx = Z f 1 ✓ (A) f✓⇤ p0 dx, 28 28

Analytic examples Generative models with ReLU family: f✓ (x) =
(x ✓) = ( 0, x  ✓, x ✓, x > ✓. GW (✓) = F0(✓), F0 cumulative distribution function of p0 . Figure: This ﬁgure plots two example of the push-forward family we described above with parameter chosen as ✓1 = 3, ✓2 = 5. 29 29

Statistical Information Matrix Probability Family Wasserstein information matrix Fisher information
matrix Uniform: p(x;a,b) = 1 b a 1(a,b) (x) GW (a,b) = 1 3 ✓ 1 1 2 1 2 1 ◆ GF (a,b) not well-defined Gaussian: p(x;µ, ) = e 1 2 2 (x µ)2 p 2⇡ GW (µ, ) = ✓ 1 0 0 1 ◆ GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ Exponential: p(x;m, ) = e (x m) GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) not well-defined Laplacian: p(x;m, ) = 2 e |x m| GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) = ✓ 2 0 0 1 2 ◆ Location-scale: p(x;m, ) = 1 p( x p ) GW ( ,m) = E ,mx2 2mE ,mx+m2 2 0 0 1 ! GF ( ,m) = 0 @ 1 2 ⇣ 1 + R R ⇣ (x m)2p02 2p + (x m)p0 ⌘ dx ⌘ R R (x m)p02 3p dx R R (x m)p02 3p dx 1 2 R R p02 p dx 1 A Independent: p(x,y;✓) = p(x;✓)p(y;✓) GW (x,y;✓) = G1 W (x;✓) + G2 W (y;✓) GF (x,y;✓) = G1 F (x;✓) + G2 F (y;✓) ReLU push-forward: p(x;✓) = f✓⇤ p(x), f✓ ✓-parameterized ReLUs.. GW (✓) = F (✓), F cdf of p(x) GF (✓) not well-defined Table: In this table, we present Wasserstein, Fisher information matrices for various probability families. 30 30

Classical (Fisher) statistics & Wasserstein statistics We develop a parallel
Wasserstein statistics following the classical statistics approach: Covariance inner product ! Wasserstein covariance Cramer-Rao bound cotangent space ! Wasserstein-Cramer-Rao bound Natural gradient e ciency separability ! Wasserstein Natural gradient e ciency 31 31

Wasserstein covariance Deﬁnition (Wasserstein covariance) Given a statistical model ⇥,
denote the Wasserstein covariance as follows: CovW ✓ [T1 , T2] = E X⇠p✓ rx T1(X), rx T2(X)T , where T1 , T2 are random variables as functions of X and the expectation is taken w.r.t. X ⇠ p✓ . Denote the Wasserstein variance: VarW ✓ [T] = E X⇠p✓ rx T(X), rx T(X)T . 32 32

Wasserstein-Cramer-Rao bound Theorem (Wasserstein-Cramer-Rao inequalities) Given any set of statistics
T = (T1 , ..., Tn) : X ! Rn, where n is the number of the statistics, deﬁne two matrices CovW ✓ [T(x)], r✓ E p✓ [T(x)]T as below: CovW ✓ [T(x)]ij = CovW ✓ [Ti , Tj], r✓ E p✓ [T(x)]T ij = @ @✓j E p✓ [Ti(x)], then CovW ✓ [T(x)] ⌫ r✓ E p✓ [T(x)]T GW (✓) 1r✓ E p✓ [T(x)]. 33 33

Cramer-Rao bound: Fisher vs Wasserstein Gaussian: GW (µ, ) =
✓ 1 0 0 1 ◆ , GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ . Laplacian: GW (m, ) = ✓ 1 1 2 1 2 2 4 ◆ , GF not well-deﬁned. Comparison: GW is well-deﬁned for wider range of families. Tighter bound on the variance of an estimator 34 34

Information functional inequalities Comparison between Fisher and Wasserstein information matrices
relates to famous information functional inequalities. Here we study them in parameter statistics. Dissipation of entropy & Fisher information functional d dt H(µ|⌫) = Z X rx log µ(x) ⌫(x) 2 µ(x)dx = I(µ|⌫) d dt ˜ H(p✓ |p✓⇤ ) = r✓ e HT e G 1 W r✓ e H = e I(p✓ |p✓⇤ ) Log-Sobolev inequality (LSI) H(µ|⌫) < 1 2↵ I(µ|⌫), µ 2 P(X) e H(p✓ |p✓⇤ ) < 1 2↵ e I(p✓ |p✓⇤ ), ✓ 2 ⇥ 35 35

Functional inequalities via Wasserstein and Fisher information matrices Theorem (RIW-condition3
4) The information matrix criterion for LSI can be written as: GF (✓) + r2 ✓ p✓ log p✓ p✓⇤ W r✓ e H(p✓ |p✓⇤ ) 2↵GW (✓), where W is the Christo↵el symbol in Wasserstein statistical model ⇥, while for PI can be written as: GF (✓) + r2 ✓ p✓ log p✓ p✓⇤ 2↵GW (✓). 3Li, Geometry of probability simplex via optimal transport, 2018. 4Li, Montufar, Ricci curvature for parameter statistics, 2018. 36 36

List of functional inequalities in family Family Fisher-information functional Log-Sobolev
inequality(LSI(↵)) Gaussian e I(pµ, ) = 1 2 , e I(pµ, |p⇤) = (µ µ⇤)2 4 4 ⇤ + ✓ 1 + 2 ⇤ ◆2 . 1 2 2↵ 1 2 log 2⇡ log 1 2 , (µ µ⇤)2 4 ⇤ + ✓ 1 + 2 ⇤ ◆2 2↵ ✓ log + log ⇤ 1 2 + 2 + (µ µ⇤)2 2 2 ⇤ ! . Laplacian e I(p ,m) = 2 2 , e I(pm, |p⇤) = 2 ⇤ ⇣ 1 e |m m⇤| ⌘2 + ( |m m⇤ | + 1) ⇤ e |m m⇤| 2 2 . 2 ⇤ ⇣ 1 e |m m⇤| ⌘2 + ( |m m⇤ | + 1) ⇤ e |m m⇤| 2 2 . 2↵( 1 + log log ⇤ + ⇤ |m m⇤ | + ⇤ e |m m⇤| ◆ . Table: In this table, we continue the list to include the Fisher information functional, Log-Sobolev inequality for various probability families. 37 37

List of functional inequalities in family Family RIW condition for
LSI(↵) RIW condition for PI(↵) Gaussian 1 2 ⇤ 0 0 1 2 ⇤ + 1 2 ! ⌫ 2↵ ✓ 1 0 0 1 ◆ 1 2 ⇤ 0 0 2 2 ⇤ ! ⌫ 2↵ ✓ 1 0 0 1 ◆ Laplacian ⇤ e |m m⇤| 0 0 1 2 + ⇤e |m m⇤| 2(m⇤ m)2 3 ! ⌫ 2↵ ✓ 1 0 0 2 4 ◆ ✓ 2 ⇤ 0 0 1 2 ⇤ ◆ ⌫ 2↵ ✓ 1 0 0 2 4 ⇤ ◆ Table: In this table, we present the RIW condition for LSI and PI in various probability families. 38 38

Wasserstein natural gradient Given a Loss function F: P(⌦) !
R and probability model ⇢(·, ✓), the associated gradient ﬂow on a Riemannian manifold is deﬁned by d✓ dt = rgF(⇢(·, ✓)). Here rg is the Riemannian gradient operator satisfying g✓(rgF(⇢(·, ✓)), ⇠) = r✓F(⇢(·, ✓)) · ⇠ for any tangent vector ⇠ 2 T✓⇥, where r✓ represents the di↵erential operator (Euclidean gradient). 39 39

Wasserstein natural gradient The gradient flow of Loss function F(⇢(·,
✓)) in (⇥, g✓) satisfies d✓ dt = GW (✓) 1r✓F(⇢(·, ✓)). If ⇢(·, ✓) = ⇢(x) is an identity map, then we recover the Wasserstein gradient flow in full probability space: @t ⇢t = r · (⇢t r ⇢ F(⇢t)). 40 40

Entropy production The gradient flow of entropy H(⇢) = Z
Rd ⇢(x)log ⇢(x)dx , w.r.t. optimal transport distance is the heat equation: @⇢ @t = r · (⇢rlog ⇢) = ⇢ . Along the gradient flow, the following relation holds: d dtH(⇢) = Z Rd log ⇢r · (⇢rlog ⇢)dx = Z Rd (r log ⇢) 2⇢dx . Here the blue term is known as the Fisher information2. 2B. Roy Frieden, Science from Fisher Information: A Unification. 8

Online natural gradient algorithm We sample from the unknown distribution
once in each step, and use a sample xt to generate an estimator ✓t+1 ✓t+1 = ✓t 1 t rW ✓ f(xt , ✓t), where f is the loss function. To analyze the convergence of this algorithm, we deﬁne the Wasserstein covariance matrix Vt to be Vt = E p✓⇤ rx(✓t ✓⇤) · rx(✓t ✓⇤)T , where ✓⇤ is the optimal value. Deﬁnition (Natural gradient e ciency) The Wasserstein natural gradient is asymptotic e cient if Vt = 1 t G 1 W (✓⇤) + O( 1 t2 ). 41 42

Covariance update Theorem (Variance updating equation of the Wasserstein Natural
Gradient) For any function f(x, ✓) that satisﬁes the condition E p✓ f(x, ✓) = 0, consider here the asymptotic behavior of the Wasserstein dynamics ✓t+1 = ✓t 1 t G 1 W (✓t)f(xt , ✓t). That is, assume priorly E p✓⇤ h (✓t ✓⇤)2 i , E p✓⇤ h |rx (✓t ✓⇤)|2 i = o(1), 8t. Then, the Wasserstein covariance matrix Vt updates according to the following equation: Vt+1 = Vt + 1 t2 G 1 W (✓⇤)E p✓⇤ ⇥ rx (f(xt , ✓⇤)) · rx f(xt , ✓⇤)T ⇤ G 1 W (✓⇤) 2Vt t E p✓⇤ [r✓ f(xt , ✓⇤)] G 1 W (✓⇤) + o( Vt t ) + o ✓ 1 t2 ◆ . 42 43

Wasserstein online e ciency Corollary (Wasserstein Natural Gradient E ciency)
For the dynamics ✓t+1 = ✓t 1 t G 1 W (✓t) W (xt , ✓t), the Wasserstein covariance updates according to Vt+1 =Vt + 1 t2 G 1 W (✓⇤) 2 t Vt + o ✓ 1 t2 ◆ + o( Vt t ). Then, the online Wasserstein natural gradient algorithm is Wasserstein e cient, that is: Vt = 1 t G 1 W (✓⇤) + O ✓ 1 t2 ◆ . 43 44

Poincare online e ciency Corollary (Poincare E ciency) For the
dynamics ✓t+1 = ✓t 1 t rW ✓ l(xt, ✓t), where l(xt, ✓t) = log p (xt, ✓t) is the log-likelihood function. The Wasserstein covariance updates according to Vt+1 = Vt + 1 t2 G 1 W (✓⇤)Ep✓⇤ h rx (r✓l(xt, ✓⇤)) · rx ⇣ r✓l(xt, ✓⇤)T ⌘i G 1 W (✓⇤) 2 t VtGF (✓⇤)G 1 W (✓⇤) + O ✓ 1 t3 ◆ + o ✓ Vt t ◆ . Now suppose that ↵ = sup{a|GF ⌫ aGW }. Then the dynamics is characterized by the following formula Vt = 8 > < > : O t 2↵ , 2↵  1, 1 t 2GF G 1 W I 1 G 1 W (✓⇤)IG 1 W (✓⇤) + O ✓ 1 t2 ◆ , 2↵ > 1, where I = Ep✓⇤ h rx (r✓l(xt, ✓⇤)) · rx ⇣ r✓l(xt, ✓⇤)T ⌘i . 44 45

Part II: Transport information proximal Lin, Li, Montufar, Osher, Wasserstein
proximal of GANs, 2019. Li, Lin, Montufar, A ne proximal learning, GSI, 2019. Li, Liu, Zha, Zhou, Parametric Fokker–Planck equations, GSI, 2019. Arbel, Gretton, Li, Guido Montufar, Kernelized Wasserstein Natural Gradient, ICLR, spotlight, 2020. Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations, 2020. 45 46

Gradient operator Given a function F : M ! R
with a metric matrix function G(x) 2 Rn⇥n, the gradient operator is deﬁned by gradF(x) = G(x) 1rx F(x), where rx is the Euclidean or L2 gradient w.r.t. x. If G(x) 1 = I, then gradF(x) = rx F(x); If G(x) 1 = diag(xi)1in , then gradF(x) = (xi rxi F(x))n i=1 ; If G(x) 1 = L(x), where L is a graph Laplacian matrix and x is the weight, then gradF(x) = L(x)rx F(x). 46 47

Proximal method Consider an Euclidean Gradient ﬂow: ˙ x =
rF(x). The Proximal method refers to xk+1 = arg min x2Rn F(x) + 1 2h kx xk k2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk hrF(xk+1). 47 48

Proximal method on a metric structure Consider an Euclidean gradient
ﬂow: ˙ x = G(x) 1rF(x). The proximal method refers to xk+1 = arg min x2Rn F(x) + 1 2h dG(x, xk)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, i.e. dG(x, y) = inf n Z 1 0 q ˙(t)T G( (t))˙(t)dt: (0) = x, (1) = y o This is known as the variational backward Euler method xk+1 = xk hG(xk+1) 1rF(xk+1) + o(h). Many other linearization of the methods have been studied. 48 49

Proximal method on a probability space Denote G(⇢) 1 =
⇢ . Consider the gradient ﬂow: ˙ ⇢ = G(⇢) 1 ⇢F(⇢) = r · (⇢r F(⇢)). The proximal method refers to ⇢k+1 = arg min ⇢2P F(⇢) + 1 2h dG(⇢, ⇢k)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, known as the transport distance. This is known as the variational backward Euler method ⇢k+1 = ⇢k hr · (⇢k+1 r F(⇢k+1)) + o(h). A linearlization of proximal methods have been formulated by (Li, Lu, Wang, 2019). 49 50

Generative Adversary Networks 50 51

Generative probability models For each parameter ✓ 2 Rd and
given neural network parameterized mapping function g✓ , consider ⇢✓ = g✓#p(z). 51 52

Natural proximal on probability models Denote G(✓) = (r✓ ⇢,
( ⇢) 1r✓ ⇢)L2 . Consider the gradient ﬂow: ˙ ✓ = G(✓) 1r✓F(⇢(✓, ·)). The update scheme follows5: ✓k+1 = arg min ✓2⇥ F(⇢✓) + 1 2h dW (✓, ✓k)2. where ✓ is the parameters of the generator, F(⇢✓) is the loss function, and dW is the Wasserstein metric. 5Lin, Li, Osher, Montufar, Wasserstein proximal of GANs, 2018. 52 53

Computations In practice, we approximate the Wasserstein metric to obtain
the following update: ✓k+1 = arg min ✓2⇥ F(⇢✓) + 1 B B X i 1 2h kg✓(zi) g✓k (zi)k2, where g✓ is the generator, B is the batch size, and zi ⇠ p(z) are inputs to the generator. 53 54

Examples: Jensen–Shannon entropy Loss Figure: The relaxed Wasserstein Proximal of
GANs, on the CIFA10 (left), CelebA (right) datasets. 54 55

Examples: Wasserstein-1 Loss Figure: Wasserstein proximal of Wasserstein-1 Loss function
on the CIFA10 data set. 55 56

Part III: Transport information Gaussian Mixtures Transport information Gaussian Mixtures
Jiaxi Zhao, Wuchen Li, Scaling limit of transport Gaussian mixture models, 2020. 56 57

Part IV: Transport information generative models Transport information generative models
Shu Liu, Wuchen Li, Hongyuan Zha, Haomin Zhou, Neural parametric Fokker-Planck equations, 2020. 57 58

Fokker-Planck Equation I Fokker Planck Equation (FPE) is a parabolic
evolution equation: @⇢ @t = r · (⇢rV ) + ⇢ ⇢(·, 0) = ⇢0 The equation plays fundamental roles in probability, geometry, statistics, and also machine learning communities. I This equation has many applications: 2 59

Our Goal We are interested in solving high dimensional Fokker-Planck
equations. I Di culty: Curse of dimensionality I Existing tools for high dimensional problems: Deep learning I We are going to introduce deep learning techniques to numerically solve FPE. 3 60

Motivation I Dimension reduction, Fokker-Planck Equation (inﬁnite dimensional ODE) +
Parametric Fokker-Planck Equation (ﬁnite dimensional ODE) I Obtain an e cient sampling-friendly method, i.e. we are able to draw samples from ⇢ t = ⇢(x, t) for arbitrary t. 4 61

Our treatment Generative Model12 in deep learning: T ✓ as
a deep neural network, p as a reference distribution. Find a suitable parameter ✓ s.t. T ✓# p approximates desired distribution ⇢ well. I Deep structure ) Dealing with high dimensional problem; I Pushforward mechanism ) Sampling-friendly; We numerically solve high dimensional Fokker-Planck equation in Generative Model: I Approximate ⇢ t , the solution of the Fokker-Planck Equation at time t, by T ✓t #p; I Construct an ODE for ✓ t and numerically solve the ODE; 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 5 62

Project Fokker-Planck equation to the parameter space 7 63

Wasserstien manifold of probability distributions Wasserstein manifold of probability distributions
deﬁned on Rd: P = ⇢ ⇢: Z ⇢(x)dx = 1, ⇢(x) 0, Z |x| 2⇢(x) dx < 1 Tangent space of P at ⇢ is characterized as: T ⇢P = n ˙ ⇢: Z ˙ ⇢(x)dx = 0 o . 1Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis. 1998 8 64

Wasserstien manifold of probability distributions We deﬁne the Wasserstein metric
gW on TP as1: For any ⇢ 2 P and ˙ ⇢1 , ˙ ⇢2 2 T ⇢P: gW (⇢)( ˙ ⇢1 , ˙ ⇢2) = Z r 1(x) · r 2(x)⇢(x) dx, where 1 , 2 solves ˙ ⇢1 = r · (⇢1r 1), ˙ ⇢2 = r · (⇢2r 2) Now (P, gW ) becomes a Riemannian manifold. We call it Wasserstein manifold. The geodesic distance on (P, gW ) corresponds to L2-Wasserstein distance. 1Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis. 1998 9 65

Gradient flows on Wasserstein manifold Denote gradW as the manifold
gradient on (P, gW ). Then gradW can be computed as follows. Consider a smooth functional F : P ! R, one can verify2: gradW F(⇢) =gW (⇢) 1 ✓ F ⇢ ◆ (x) = r · (⇢(x)r ⇢(x) F(⇢)), where ⇢(x) is the L2 first variation at variable x 2 Rd. Then the gradient flow of F is written as2: @⇢ @t = gradW F(⇢) = r · (⇢(x)r ⇢(x) F(⇢)) 2C´ edric Villani. Optimal transport: old and new. 2008 10 66

Fokker-Planck Equation as gradient ﬂow For a given potential function
V , consider the relative entropy functional H (also known as KL divergence): H(⇢) = Z ⇢(x) log ⇢(x) 1 Z e V (x) ! dx = Z V (x)⇢(x)dx + Z ⇢(x) log ⇢(x)dx + Const. Then r ⇣ H ⇢ ⌘ = rV + r log ⇢, and gradient ﬂow is written as: @⇢ @t = r · (⇢rV ) + r · (⇢r log ⇢)) = r · (⇢rV ) + ⇢. This is exactly the Fokker-Planck Equation mentioned at beginning. 11 67

Parametric Fokker-Planck Equation I Inspired by: Wasserstein geometry5 + Information
Geometry6 ) Wasserstein Information Geometry7 I Sketch of our treatment: – Introduce parametric distributions via Generative Model; – Derive the metric tensor on parameter space; – Compute the Wasserstein gradient ﬂow of relative entropy in parameter space. 5Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001 6S Amari. Natural Gradient Works E ciently in Learning. 1998 7Wuchen Li and Guido Montufar. Natural Gradient via Optimal Transport. 2018 12 68

Parametric Fokker-Planck Equation I Consider a class of invertible push-forward
maps {T ✓}✓2⇥ indexed by parameter ✓ 2 ⇥ ⇢ Rm T ✓ : Rd ! Rd I We obtain family of parametric distributions P⇥ = ⇢ ✓ = T ✓# p | ✓ 2 ⇥ ⇢ (P, gW ) I Deﬁne T# : ✓ 7! T ✓# p = ⇢ ✓ and introduce the metric tensor G = (T#)⇤gW on ⇥ by pulling back gW by T# I We call G on parameter manifold ⇥ the Wasserstein statistical manifold. I For each ✓ 2 ⇥, G(✓) is a bilinear form on T ✓⇥ ' Rm 13 69

Parametric Fokker-Planck Equation Theorem 1 Let G = T# ⇤gW
, then G(✓) is m ⇥ m non-negative deﬁnite symmetric matrix and can be computed as: G(✓) = Z r (T ✓(x))r (T ✓(x))T p(x)dx, Or in entry-wised form: G ij(✓) = Z r i(T ✓(x)) · r j(T ✓(x)) p(x) dx, 1  i, j  m. Here = ( 1 , ... m)T and r is m ⇥ d Jacobian matrix of . For each k = 1, 2, ..., m, k solves the following equation: r · (⇢ ✓r k(x)) = r · (⇢ ✓ @ ✓k T ✓(T 1 ✓ (x))). (1) 14 70

Parametric Fokker-Planck Equation Recall the relative entropy functional H, we
consider H = H T# : ⇥ ! R. Then: H(✓) = H(⇢ ✓) = Z V (x)⇢ ✓(x) dx + Z ⇢ ✓(x) log ⇢ ✓(x) dx = Z [V (T ✓(x)) + log ⇢ ✓(T ✓(x))]p(x) dx Since Fokker-Planck equation is gradient ﬂow of H on (P, gW ), a similar version on (⇥, G) is formulated as: ˙ ✓ = G(✓) 1 r✓ H(✓). ✓ Recall: @⇢ @t = gW (⇢) 1 ⇢(x) H(⇢(x)) ◆ This is an ODE in parameter space ⇥, we call it parametric Fokker-Planck equation. 15 71

From Fokker-Planck equation to particle dynamics (In this diagram, notice
that r H(⇢) ⇢ = rV + r log ⇢.) 16 72

Particle version of Fokker-Planck Equation Fokker-Planck equation: @⇢ @t +
r · (⇢rV ) = ⇢ ⇢(·, 0) = ⇢0 Particle version: I Over damped Langevin dynamics: dX t = rV (X t)dt + p 2 dB t X0 ⇠ ⇢0 I Vlasov-type dynamics: dX t dt = rV (X t) r log ⇢ t(X t) X0 ⇠ ⇢0 Or equivalently, dX t dt = r H(⇢ t) ⇢ t (X t) X0 ⇠ ⇢0 Here we denote ⇢ t(·) as the density of distribution of X t . 17 73

From parametric Fokker-Planck equation to particle dynamics (In this diagram,
notice that r H(⇢) ⇢ = rV + r log ⇢.) 18 74

Particle version of parametric Fokker-Planck Equation I Let {✓ t}
be the solution of the ODE I Take Z ⇠ p, consider Y t = T ✓t (Z), we know for any t, Y t ⇠ ⇢ ✓t . {Y t} satisﬁes the dynamic: ˙ Y t = @ ✓ T ✓t (T 1 ✓t (Y t)) ˙ ✓ t Y0 ⇠ ⇢ ✓0 I It is then more insightful to consider the following dynamic: ˙ X t = r t(X t)T ˙ ✓ t , X0 ⇠ ⇢ ✓0 I X t and Y t have the same distribution. And thus X t ⇠ ⇢ ✓t for all t 0 19 75

Particle version of parametric Fokker-Planck Equation I Since ˙ ✓
t = G(✓ t) 1 r✓ H(✓ t), plugging the deﬁnition of the metric tensor ˙ X t = Z K ✓t (X t , ⌘)( rV (⌘) r log ⇢ ✓t (⌘))⇢ ✓t (⌘)d⌘ I Here we deﬁne the kernel function K ✓ : Rd ⇥ Rd ! Rd⇥d K ✓(x, ⌘) = r T (x) ✓Z r (x)r (x)T ⇢ ✓(x)dx ◆ 1 r (⌘) 20 76

Particle version of parametric Fokker-Planck Equation I The kernel K
✓(·, ·) induce an orthogonal projection K✓ deﬁned on L2 (Rd; Rd, ⇢ ✓) The range of this projection is the subspace is: span {r 1 , ..., r m} ⇢ L2 (Rd; Rd, ⇢ ✓) Here 1 , . . . , m are the m components of t . I We may rewrite the dynamic as: ˙ X t = K✓t (rV + r log ⇢ ✓t )(X t), X0 ⇠ ⇢ ✓0 ⇢ ✓t is the probability density of X t . 21 77

Projected Dynamics (In this diagram, notice that r H(⇢) ⇢
= rV + r log ⇢.) 22 78

Projected Dynamics I We now compare – Vlasov-type dynamics (Fokker-Planck
equation) ˙ Xt = K✓t (rV + r log ⇢✓t )(Xt), X0 ⇠ ⇢✓0 ⇢✓t is the probability density of Xt. – Projected Vlasov-type dynamics (parametric Fokker-Planck equation) ˙ ˜ Xt = (rV + r log ⇢t)( ˜ Xt), ˜ X0 ⇠ ⇢0 ⇢t is the probability density of ˜ Xt. I From particle viewpoint, the di↵erence between ⇢ ✓t and ⇢ t originates from the orthogonal projection K✓t of vector ﬁeld that drives the Vlasov-type dynamics. 23 79

Route map of our research Our parametric Fokker-Planck equation can
be treated as a Neural Lagrangian method with Eulerian view . 24 80

Algorithm When dimension d gets higher: I Endow T ✓
with the structure of deep neural networks. Our choice: Normalizing Flow9 (There are deﬁnitely other choices.) I Classical schemes for ODE are not suitable: – Direct computation of G(✓) could be expensive; – G(✓) 1r✓H(✓) cannot be e ciently computed under deep learning framework; I We have to seek for new schemes. Since optimization is convenient to perform under deep learning framework, one choice is to design variational scheme for our parametric Fokker-Planck equation. 9Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. ICML, 2015 25 81

Algorithm Goal: Propose a variational scheme for solving ˙ ✓
t = G(✓ t) 1 r✓ H(✓ t) I Start with an semi-implicit scheme: ✓ k+1 ✓ k h = G 1 (✓ k)r✓ H(✓ k+1). I Associate with proximal-type algorithm: ✓ k+1 = argmin ✓ {h✓ ✓ k , G(✓ k)(✓ ✓ k)i + 2hH(✓)} I We use the following approximation: h✓ ✓k,G(✓k)(✓ ✓k)i⇡max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx o I Thus, we come up with the following variational saddle scheme: ✓k+1=argmin ✓ max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx+2hH(✓) o . In our implementation, we choose as ReLU network and optimize over its parameters. 26 82

Algorithm Algorithm 1 Solving parametric FPE ˙ ✓ = G(✓)
1 r✓ F(✓) Initialize ✓0 for k = 0, 1, ..., N do ✓ ✓ k ; for j = 1, 2, ..., M do Apply Stochastic Gradient Descent to solve: inf ⇢Z |T✓(x) T✓k (x) r (T✓k (x))|2 dp(x) ✓ ✓ stepsize · r✓J(✓, ✓k). Here J(✓, ˜ ✓) is deﬁned as: J(✓, ˜ ✓) = Z T✓(x) · r (T˜ ✓ (x)) dp(x) + hF(✓) end for ✓ k+1 ✓ end for 27 83

Algorithm I Our algorithm is convenient in high dimensional cases:
– All the integrals can be approximated by Monte-Carlo method; – Derivatives w.r.t. parameters of networks can be computed via back propagation. I An interesting observation: our new scheme can be treated as a modiﬁed version of JKO scheme11 for Wasserstein gradient ﬂow of functional H: ⇢k+1=argmin⇢{ 1 2h W 2 2 (⇢,⇢k)+H(⇢)} , ˙ ⇢= gradW H(⇢) I In our variational scheme, we replace W2 2 (⇢, ⇢ k) with d 2(⇢✓,⇢✓k )=max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx o 10Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 1998 28 84

Numerical examples Now we try FPE with more complicated potential
in higher dimensions. Assume we are in R30, = 1 and we set: V (x) = 1 50 30 X i=1 x4 i 16x2 i + 5x i ! Here is a plot of such function in 2 dimensional space: 33 85

Numerical examples We let time stepsize h = 0.005. Since
we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 86

we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: Figure: Density estimation of the samples points at 360th iteration 34 92

Numerical examples We can verify the graph of trained at
the end of each outer iteration. Here are the graphs of at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of at k = 30th time step Figure: Graph of at k = 60th time step Figure: Graph of at k = 150th time step 35 93

Numerical examples We can verify the graph of trained at
the end of each outer iteration. Here are the graphs of at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of at k = 240th time step Figure: Graph of at k = 300th time step Figure: Graph of at k = 360th time step 36 94

Numerical analysis Let us denote {⇢ t} as the solution
to the original Fokker-Planck equation. Denote {✓ t} as the solution to parametric Fokker-Planck equation ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t) and ⇢ ✓t = T ✓t# p. I Convergence analysis: we will provide upper bound for DKL(⇢ ✓t k⇢ ⇤). Here ⇢ ⇤ = 1 Z exp( V ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). I Error analysis: we provide uniform bound for W2(⇢ ✓t , ⇢ t). 37 95

Numerical analysis Before we introduce our results on numerical analysis,
we first introduce an important quantity which will play an essential role in our results: Let us recall as defined from (1) in Theorem 1, we then set: 0 = sup ✓2⇥ min ⇠2T✓⇥ ⇢Z |(r (T ✓(x))T ⇠ r (V + log ⇢ ✓) T ✓(x)| 2 dp(x) One can treat 0 as a quantification of the approximation power of push-forward family {T ✓}✓2⇥ . 38 96

Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and
r2V ⌫ 0 outside a ﬁnite ball. Suppose {✓ t} solves ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t). Let ⇢ ⇤(x) = 1 Z e V (x)/ be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(⇢ ✓t k⇢ ⇤)  0 ˜ 2 (1 e ˜ t) + KL(⇢ ✓0 k⇢ ⇤)e ˜ t. (2) Here ˜ > 0 is a constant number originating from the Log-Sobolev inequality involving ⇢ ⇤, i.e. KL(⇢k⇢ ⇤)  1 ˜ I(⇢|⇢ ⇤) holds for any probability density ⇢. Here I(⇢|⇢ ⇤) = R r log ⇣ ⇢ ⇢⇤ ⌘ 2 ⇢ dx. 39 97

Error analysis Theorem Assume {✓ t} solves ˙ ✓ t
= G(✓ t) 1 r✓ H(✓ t); and {⇢ t} solves the Fokker-Planck equation. Assume that the Hessian of the potential function V is bounded below by a constant , i.e. r2V ⌫ I. Then we have: W2(⇢ ✓t , ⇢ t)  p 0 (1 e t) + e tW2(⇢ ✓0 , ⇢0). (3) When V is strictly convex, i.e. 0, inequality (3) provides a uniformly small upper bound for the error W2(⇢ ✓t , ⇢ t); However, under many cases, < 0, then the right hand side of (3) will increase to inﬁnity as time t ! 1. 40 98

Error analysis (improved version) Theorem The W2 error W2(⇢ ✓t
, ⇢ t) at any time t > 0 can be uniformly bounded by constant number depending on E0 = W2(⇢ ✓0 , ⇢0) and 0. To be more precise, 1. When 0, the error W2(⇢ ✓t , ⇢ t) can be at least uniformly bounded by O(E0 + p 0) term; 2. When < 0, the error W2(⇢ ✓t , ⇢ t) can be at least uniformly bounded by O((E0 + p 0) ˜ 2| |+˜ ) term. 41 99

Summary and Future Research Summary: Solving high-dimensional Fokker Planck equation
I Dimension reduction: PDE to ODE via implicit generative models I Variational approach for semi-Euler scheme update I Approximation and numerical error bounds Future Research I Application to sampling techniques for time-varying problems I Application to other types of dissipative PDEs I Application to global optimization 42 100

Transport information geometry: current and fut...

Transport information geometry: current and future

More Decks by Wuchen Li

Featured

Transcript