Transport information geometry: current and future

Slide 1

Slide 1 text

Transport information geometry: Current and Future Wuchen Li (UCLA/ U of SC) IPAM, April 6, 2020 1

Slide 2

Slide 2 text

Divergences on probability space 2 2

Slide 3

Slide 3 text

Applications of Divergences 3 3

Slide 4

Slide 4 text

Information matrix Information matrix (a.k.a Fisher information matrix, Fisher–Rao metric) plays important roles in estimation, information science, statistics and machine learning: Statistics and information science: Likelihood principle; Cramer-Rao bound, etc. Statistical physics; Population games (mean ﬁeld game) via replicator dynamics (Shahshahani, Smith): Applications in biology, economics, cancer, epidemic propagation etc; AI Inference problems: f–divergence, Amari ↵–divergence etc. Information geometry optimization methods; (KFAC, ADAM), with applications in AI: Natural gradient (Amari); ADAM (Kingma 2014); Stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.) 4 4

Slide 5

Slide 5 text

Probability models 5 5

Slide 6

Slide 6 text

Learning: Objective function Given a data measure ⇢data(x) = 1 N PN i=1 Xi (x) and a parameterized model ⇢(x, ✓). Most problems in AI or sampling refer to min ⇢✓ 2⇢(⇥) D(⇢data , ⇢✓). One typical choice of D is the Kullback–Leibler divergence (relative entropy) D(⇢data , ⇢✓) = Z ⌦ ⇢data(x) log ⇢data(x) ⇢(x, ✓) dx. 6 6

Slide 7

Slide 7 text

Learning: Gradient direction The natural gradient method refers to ✓k+1 = ✓k hGF (✓k) 1r✓ D(⇢data , ⇢✓k ), where h > 0 is a stepsize and GF (✓) =E X⇠⇢✓ (r✓ log ⇢(X, ✓))T (r✓ log ⇢(X, ✓)) is the Fisher information matrix. Why natural gradient and Fisher information matrix? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; Fisher e cient; Online Cramer-Rao bound. 7 7

Slide 8

Slide 8 text

Optimal transport In recent years, optimal transport (a.k.a Earth mover’s distance, Monge-Kantorovich problem, Wasserstein distance) has witnessed a lot of applications in AI: Theory (Brenier, Gangbo, Villani, Figalli, et.al.); Gradient flows (Otto, Villani, et.al.); Proximal methods (Jordan, Kinderlehrer, Otto et.al.); Mean field games (Larsy, Lions et.al.); Population games via Fokker-Planck equations (Degond et. al. 2014, Li et.al. 2016); Image retrieval (Rubner et.al. 2000); Wasserstein of Wasserstein loss (Dukler, Li, Lin, Guido); Inverse problems (Stuart, Zhou, et.al., Ding, Li, Yin, Osher); Scientific computing: (Benamou, Brenier, Carillo, Oberman, Peyre, Solomon, Osher, Li, Yin, et. al.) AI and Machine learning: Wasserstein Training of Boltzmann Machines (Cuturi et.al. 2015); Learning from Wasserstein Loss (Frogner et.al. 2015); Wasserstein GAN (Bottou et.al. 2017); see NIPS, ICLR, ICML 2015– 2020. 8 8

Slide 9

Slide 9 text

Why optimal transport? Optimal transport provides a particular distance (W) among histograms, which relies on the distance on sample spaces (ground cost c). E.g. Denote X0 ⇠ ⇢0 = x0 , X1 ⇠ ⇢1 = x1 . Compare W(⇢0, ⇢1) = inf ⇡2⇧(⇢0,⇢1) E (X0,X1)⇠⇡ c(X0 , X1) = c(x0 , x1); Vs TV(⇢0, ⇢1) = Z ⌦ |⇢0(x) ⇢1(x)|dx = 2; Vs KL(⇢0k⇢1) = Z ⌦ ⇢0(x) log ⇢0(x) ⇢1(x) dx = 1. 9 9

Slide 10

Slide 10 text

Learning: Wasserstein objective/loss function Given a data distribution ⇢0 and a probability model ⇢1(✓). Consider min ✓2⇥ W(⇢0, ⇢1(✓)). This is a double minimization problem, i.e. W(⇢0, ⇢1(✓)) = min ⇡2⇧(⇢0,⇢1(✓)) E (X0,X1)⇠⇡ c(X0 , X1). Many applications, such as Wasserstein GAN, Wasserstein Loss, are built on the above formulation. 10 10

Slide 11

Slide 11 text

Learning: Wasserstein gradient direction Main Question: Besides looking at the Wasserstein/transport distance, I propose to work on the optimal transport induced information matrices and geometric structures in probability models. Literature review Wasserstein Gradient Flow: Otto (2001); Natural Gradient works e ciently in learning: Amari (1998), ; Constrained gradient: Carlen, Gangbo (2003); Linear programming: Wong (2017), Amari, Karakida, Oizumi (2017); Gaussian measures: Takatsu (2011), Malago et.al. (2017), Modin (2017), Yongxin et.al (2017), Sanctis (2017) and many more. 11 11

Slide 12

Slide 12 text

Scope of transport information 12 12

Slide 13

Slide 13 text

Problem formulation Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, La↵erty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into learning problems. 13 13

Slide 14

Slide 14 text

Density manifold Optimal transport has an optimal control reformulation by the dual of dual of linear programming: inf ⇢t Z 1 0 gW (@t ⇢t , @t ⇢t)dt = Z 1 0 Z ⌦ (r t , r t)⇢t dxdt, under the dynamical constraint, i.e. continuity equation: @t ⇢t + r · (⇢t r t) = 0, ⇢0 = ⇢0, ⇢1 = ⇢1. Here, (P(⌦), gW ) forms an inﬁnite-dimensional Riemannian manifold1. 1John D. La↵erty: the density manifold and conﬁguration space quantization, 1988. 14 14

Slide 15

Slide 15 text

Topics (O) Transport information geometry and Gamma calculus (Li); (I) Transport information matrix (Li, Zhao); (II) Wasserstein natural gradient/proximal (Li, Montufar, Chen, Lin, Abel, Osher, etc.) (III) Neural Fokker-Planck equation (Li, Liu, Zha, Zhou.) (IV) Transport information optimization methods (Li, Wang). 15 15

Slide 16

Slide 16 text

Part I: Transport information statistics Main results Li, Zhao, Wasserstein information matrix, 2019; Amari, Wasserstein statistics in 1D location-scale model, March, 2020. Related studies Wasserstein covariance (Petersen, Muller.) Wasserstein minimal distance estimator (Bernton, Jacob, Gerber, and Robert.) Statistical inference for generative models with maximum mean discrepancy (Briol, Barp, Duncan, Girolami.) 16 16

Slide 17

Slide 17 text

Information matrix and Pull back 17 17

Slide 18

Slide 18 text

Statistical information matrix Deﬁnition (Statistical information matrix) Consider the density manifold (P(X), g) with a metric tensor g, and a smoothly parametrized statistical model p✓ with parameter ✓ 2 ⇥ ⇢ Rd. Then the pull-back G of g onto the parameter space ⇥ is given by G(✓) = D r✓ p✓ , g(p✓)r✓ p✓ E . Denote G(✓) = (G(✓)ij)1i,jd , then G(✓)ij = Z X @ @✓i p(x; ✓) ⇣ g(p✓) @ @✓j p ⌘ (x; ✓)dx. Here we name g the statistical metric, and call G the statistical information matrix. 18 18

Slide 19

Slide 19 text

Statistical information matrix Deﬁnition (Score Function) Denote i : X ⇥ ⇥ ! R, i = 1, ..., n satisfying i(x; ✓) =  g(p) ✓ @ @✓i p(x; ✓) ◆ . They are the score functions associated with the statistical information matrix G and are equivalent classes in C(X)/R. The representatives in the equivalent classes are determined by the following normalization condition: E p✓ i = 0, i = 1, ..., n. Then the statistical information matrix satisﬁes G(✓)ij = Z X i(x; ✓) ⇣ g(p✓) 1 j ⌘ (x; ✓)dx. 19 19

Slide 20

Slide 20 text

Examples: Fisher information matrix Consider g(p) 1 = p. GF (✓)ij = E X⇠p✓ [ i(X; ✓) j(X; ✓)] . where i = 1 p✓ r✓i p✓ , and j = 1 p✓ r✓j p✓ . Notice the fact 1 p r✓ p = r✓ log p, then GF (✓)ij = E X⇠p✓  @ @✓i log p(X; ✓) @ @✓j log p(X; ✓) . In literature, GF (✓) is known as the Fisher information matrix and log p(X; ✓) is named score function. 20 20

Slide 21

Slide 21 text

Examples: Wasserstein information matrix Consider g(p) 1 = r · (pr): GW (✓)ij = E X⇠p✓ ⇥ rx W i (X; ✓), rx W j (X; ✓) ⇤ . where rx · (p✓ rx W i ) = r✓i p✓ . and rx · (p✓ rx W j ) = r✓j p✓ . Here we call GW (✓) Wasserstein information matrix and name W the Wasserstein score function. 21 21

Slide 22

Slide 22 text

Distance and information matrix Speciﬁcally, given a smooth family of probability densities p(x; ✓) and a given perturbation ✓ 2 T✓⇥, consider the following Taylor expansions in term of ✓: KL(p(✓)kp(✓ + ✓)) = 1 2 ✓T GF (✓) ✓ + o(( ✓)2), and W2(p(✓ + ✓), p(✓))2 = ✓T GW (✓) ✓ + o(( ✓)2). 22 22

Slide 23

Slide 23 text

Poisson equation The Wasserstein score functions W i (x; ✓) satisfy the following Poisson equation rx log p(x; ✓) · rx W i (x; ✓) + x W i (x; ✓) = @ @✓i log p(x; ✓). 23 23

Slide 24

Slide 24 text

Separability If p(x; ✓) is an independence model, i.e. p(x, ✓) = ⇧n k=1 pk(xk; ✓), x = (x1 , · · · , xn). Then there exists a set of one dimensional functions W,k : Xk ⇥ ⇥k ! R, such that W (x; ✓) = n X k=1 W,k(xk; ✓). In addition, the Wasserstein information matrix is separable: GW (✓) = n X k=1 Gk W (✓), where Gk W (✓) = E X⇠p✓ ⇥ rxk W,k(X; ✓), rxk W,k(X; ✓) ⇤ . 24 24

Slide 25

Slide 25 text

One dimensional sample space If X ⇢ R1, the Wasserstein score functions satisfy W i (x; ✓) = Z 1 p(z; ✓) @ @✓i F(z; ✓)dz, where F(x; ✓) = R x p(y; ✓)dy is the cumulative distribution function. And the Wasserstein information matrix2 satisﬁes GW (✓)ij = E X⇠p✓ " @ @✓i F(X; ✓) @ @✓j F(X; ✓) p(X; ✓)2 # . 2Chen, Li, Wasserstein natural gradient in continuous sample space, 2018. 25 25

Slide 26

Slide 26 text

Remark The Wasserstein score function is the average of the cumulative Fisher score function, and the Wasserstein information matrix is the covariance of the density average of the cumulative Fisher score function. 26 26

Slide 27

Slide 27 text

Analytic examples Classical statistical models: Gaussian family: p(x; µ, ) = 1 p 2⇡ e 1 2 2 (x µ)2 , GW (µ, ) = ✓ 1 0 0 1 ◆ . Laplacian family: p(x; m, ) = 2 e |x m|, GW (µ, ) = ✓ 1 0 0 1 ◆ . 27 27

Slide 28

Slide 28 text

Analytic examples Generative models: Continuous 1-d generative family: p(·; ✓) = f✓⇤ p0 (·) , p0 a given distribution, GW (✓) = Z |r✓ f✓ (x)|2 p0 (x) dx, where the push-forward distribution is deﬁned as Z A p0 dx = Z f 1 ✓ (A) f✓⇤ p0 dx, 28 28

Slide 29

Slide 29 text

Analytic examples Generative models with ReLU family: f✓ (x) = (x ✓) = ( 0, x  ✓, x ✓, x > ✓. GW (✓) = F0(✓), F0 cumulative distribution function of p0 . Figure: This ﬁgure plots two example of the push-forward family we described above with parameter chosen as ✓1 = 3, ✓2 = 5. 29 29

Slide 30

Slide 30 text

Statistical Information Matrix Probability Family Wasserstein information matrix Fisher information matrix Uniform: p(x;a,b) = 1 b a 1(a,b) (x) GW (a,b) = 1 3 ✓ 1 1 2 1 2 1 ◆ GF (a,b) not well-defined Gaussian: p(x;µ, ) = e 1 2 2 (x µ)2 p 2⇡ GW (µ, ) = ✓ 1 0 0 1 ◆ GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ Exponential: p(x;m, ) = e (x m) GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) not well-defined Laplacian: p(x;m, ) = 2 e |x m| GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) = ✓ 2 0 0 1 2 ◆ Location-scale: p(x;m, ) = 1 p( x p ) GW ( ,m) = E ,mx2 2mE ,mx+m2 2 0 0 1 ! GF ( ,m) = 0 @ 1 2 ⇣ 1 + R R ⇣ (x m)2p02 2p + (x m)p0 ⌘ dx ⌘ R R (x m)p02 3p dx R R (x m)p02 3p dx 1 2 R R p02 p dx 1 A Independent: p(x,y;✓) = p(x;✓)p(y;✓) GW (x,y;✓) = G1 W (x;✓) + G2 W (y;✓) GF (x,y;✓) = G1 F (x;✓) + G2 F (y;✓) ReLU push-forward: p(x;✓) = f✓⇤ p(x), f✓ ✓-parameterized ReLUs.. GW (✓) = F (✓), F cdf of p(x) GF (✓) not well-defined Table: In this table, we present Wasserstein, Fisher information matrices for various probability families. 30 30

Slide 31

Slide 31 text

Classical (Fisher) statistics & Wasserstein statistics We develop a parallel Wasserstein statistics following the classical statistics approach: Covariance inner product ! Wasserstein covariance Cramer-Rao bound cotangent space ! Wasserstein-Cramer-Rao bound Natural gradient e ciency separability ! Wasserstein Natural gradient e ciency 31 31

Slide 32

Slide 32 text

Wasserstein covariance Deﬁnition (Wasserstein covariance) Given a statistical model ⇥, denote the Wasserstein covariance as follows: CovW ✓ [T1 , T2] = E X⇠p✓ rx T1(X), rx T2(X)T , where T1 , T2 are random variables as functions of X and the expectation is taken w.r.t. X ⇠ p✓ . Denote the Wasserstein variance: VarW ✓ [T] = E X⇠p✓ rx T(X), rx T(X)T . 32 32

Slide 33

Slide 33 text

Wasserstein-Cramer-Rao bound Theorem (Wasserstein-Cramer-Rao inequalities) Given any set of statistics T = (T1 , ..., Tn) : X ! Rn, where n is the number of the statistics, deﬁne two matrices CovW ✓ [T(x)], r✓ E p✓ [T(x)]T as below: CovW ✓ [T(x)]ij = CovW ✓ [Ti , Tj], r✓ E p✓ [T(x)]T ij = @ @✓j E p✓ [Ti(x)], then CovW ✓ [T(x)] ⌫ r✓ E p✓ [T(x)]T GW (✓) 1r✓ E p✓ [T(x)]. 33 33

Slide 34

Slide 34 text

Cramer-Rao bound: Fisher vs Wasserstein Gaussian: GW (µ, ) = ✓ 1 0 0 1 ◆ , GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ . Laplacian: GW (m, ) = ✓ 1 1 2 1 2 2 4 ◆ , GF not well-deﬁned. Comparison: GW is well-deﬁned for wider range of families. Tighter bound on the variance of an estimator 34 34

Slide 35

Slide 35 text

Information functional inequalities Comparison between Fisher and Wasserstein information matrices relates to famous information functional inequalities. Here we study them in parameter statistics. Dissipation of entropy & Fisher information functional d dt H(µ|⌫) = Z X rx log µ(x) ⌫(x) 2 µ(x)dx = I(µ|⌫) d dt ˜ H(p✓ |p✓⇤ ) = r✓ e HT e G 1 W r✓ e H = e I(p✓ |p✓⇤ ) Log-Sobolev inequality (LSI) H(µ|⌫) < 1 2↵ I(µ|⌫), µ 2 P(X) e H(p✓ |p✓⇤ ) < 1 2↵ e I(p✓ |p✓⇤ ), ✓ 2 ⇥ 35 35

Slide 36

Slide 36 text

Functional inequalities via Wasserstein and Fisher information matrices Theorem (RIW-condition3 4) The information matrix criterion for LSI can be written as: GF (✓) + r2 ✓ p✓ log p✓ p✓⇤ W r✓ e H(p✓ |p✓⇤ ) 2↵GW (✓), where W is the Christo↵el symbol in Wasserstein statistical model ⇥, while for PI can be written as: GF (✓) + r2 ✓ p✓ log p✓ p✓⇤ 2↵GW (✓). 3Li, Geometry of probability simplex via optimal transport, 2018. 4Li, Montufar, Ricci curvature for parameter statistics, 2018. 36 36

Slide 37

Slide 37 text

List of functional inequalities in family Family Fisher-information functional Log-Sobolev inequality(LSI(↵)) Gaussian e I(pµ, ) = 1 2 , e I(pµ, |p⇤) = (µ µ⇤)2 4 4 ⇤ + ✓ 1 + 2 ⇤ ◆2 . 1 2 2↵ 1 2 log 2⇡ log 1 2 , (µ µ⇤)2 4 ⇤ + ✓ 1 + 2 ⇤ ◆2 2↵ ✓ log + log ⇤ 1 2 + 2 + (µ µ⇤)2 2 2 ⇤ ! . Laplacian e I(p ,m) = 2 2 , e I(pm, |p⇤) = 2 ⇤ ⇣ 1 e |m m⇤| ⌘2 + ( |m m⇤ | + 1) ⇤ e |m m⇤| 2 2 . 2 ⇤ ⇣ 1 e |m m⇤| ⌘2 + ( |m m⇤ | + 1) ⇤ e |m m⇤| 2 2 . 2↵( 1 + log log ⇤ + ⇤ |m m⇤ | + ⇤ e |m m⇤| ◆ . Table: In this table, we continue the list to include the Fisher information functional, Log-Sobolev inequality for various probability families. 37 37

Slide 38

Slide 38 text

List of functional inequalities in family Family RIW condition for LSI(↵) RIW condition for PI(↵) Gaussian 1 2 ⇤ 0 0 1 2 ⇤ + 1 2 ! ⌫ 2↵ ✓ 1 0 0 1 ◆ 1 2 ⇤ 0 0 2 2 ⇤ ! ⌫ 2↵ ✓ 1 0 0 1 ◆ Laplacian ⇤ e |m m⇤| 0 0 1 2 + ⇤e |m m⇤| 2(m⇤ m)2 3 ! ⌫ 2↵ ✓ 1 0 0 2 4 ◆ ✓ 2 ⇤ 0 0 1 2 ⇤ ◆ ⌫ 2↵ ✓ 1 0 0 2 4 ⇤ ◆ Table: In this table, we present the RIW condition for LSI and PI in various probability families. 38 38

Slide 39

Slide 39 text

Wasserstein natural gradient Given a Loss function F: P(⌦) ! R and probability model ⇢(·, ✓), the associated gradient ﬂow on a Riemannian manifold is deﬁned by d✓ dt = rgF(⇢(·, ✓)). Here rg is the Riemannian gradient operator satisfying g✓(rgF(⇢(·, ✓)), ⇠) = r✓F(⇢(·, ✓)) · ⇠ for any tangent vector ⇠ 2 T✓⇥, where r✓ represents the di↵erential operator (Euclidean gradient). 39 39

Slide 40

Slide 40 text

Wasserstein natural gradient The gradient flow of Loss function F(⇢(·, ✓)) in (⇥, g✓) satisfies d✓ dt = GW (✓) 1r✓F(⇢(·, ✓)). If ⇢(·, ✓) = ⇢(x) is an identity map, then we recover the Wasserstein gradient flow in full probability space: @t ⇢t = r · (⇢t r ⇢ F(⇢t)). 40 40

Slide 41

Slide 41 text

Entropy production The gradient flow of entropy H(⇢) = Z Rd ⇢(x)log ⇢(x)dx , w.r.t. optimal transport distance is the heat equation: @⇢ @t = r · (⇢rlog ⇢) = ⇢ . Along the gradient flow, the following relation holds: d dtH(⇢) = Z Rd log ⇢r · (⇢rlog ⇢)dx = Z Rd (r log ⇢) 2⇢dx . Here the blue term is known as the Fisher information2. 2B. Roy Frieden, Science from Fisher Information: A Unification. 8

Slide 42

Slide 42 text

Online natural gradient algorithm We sample from the unknown distribution once in each step, and use a sample xt to generate an estimator ✓t+1 ✓t+1 = ✓t 1 t rW ✓ f(xt , ✓t), where f is the loss function. To analyze the convergence of this algorithm, we deﬁne the Wasserstein covariance matrix Vt to be Vt = E p✓⇤ rx(✓t ✓⇤) · rx(✓t ✓⇤)T , where ✓⇤ is the optimal value. Deﬁnition (Natural gradient e ciency) The Wasserstein natural gradient is asymptotic e cient if Vt = 1 t G 1 W (✓⇤) + O( 1 t2 ). 41 42

Slide 43

Slide 43 text

Covariance update Theorem (Variance updating equation of the Wasserstein Natural Gradient) For any function f(x, ✓) that satisﬁes the condition E p✓ f(x, ✓) = 0, consider here the asymptotic behavior of the Wasserstein dynamics ✓t+1 = ✓t 1 t G 1 W (✓t)f(xt , ✓t). That is, assume priorly E p✓⇤ h (✓t ✓⇤)2 i , E p✓⇤ h |rx (✓t ✓⇤)|2 i = o(1), 8t. Then, the Wasserstein covariance matrix Vt updates according to the following equation: Vt+1 = Vt + 1 t2 G 1 W (✓⇤)E p✓⇤ ⇥ rx (f(xt , ✓⇤)) · rx f(xt , ✓⇤)T ⇤ G 1 W (✓⇤) 2Vt t E p✓⇤ [r✓ f(xt , ✓⇤)] G 1 W (✓⇤) + o( Vt t ) + o ✓ 1 t2 ◆ . 42 43

Slide 44

Slide 44 text

Wasserstein online e ciency Corollary (Wasserstein Natural Gradient E ciency) For the dynamics ✓t+1 = ✓t 1 t G 1 W (✓t) W (xt , ✓t), the Wasserstein covariance updates according to Vt+1 =Vt + 1 t2 G 1 W (✓⇤) 2 t Vt + o ✓ 1 t2 ◆ + o( Vt t ). Then, the online Wasserstein natural gradient algorithm is Wasserstein e cient, that is: Vt = 1 t G 1 W (✓⇤) + O ✓ 1 t2 ◆ . 43 44

Slide 45

Slide 45 text

Poincare online e ciency Corollary (Poincare E ciency) For the dynamics ✓t+1 = ✓t 1 t rW ✓ l(xt, ✓t), where l(xt, ✓t) = log p (xt, ✓t) is the log-likelihood function. The Wasserstein covariance updates according to Vt+1 = Vt + 1 t2 G 1 W (✓⇤)Ep✓⇤ h rx (r✓l(xt, ✓⇤)) · rx ⇣ r✓l(xt, ✓⇤)T ⌘i G 1 W (✓⇤) 2 t VtGF (✓⇤)G 1 W (✓⇤) + O ✓ 1 t3 ◆ + o ✓ Vt t ◆ . Now suppose that ↵ = sup{a|GF ⌫ aGW }. Then the dynamics is characterized by the following formula Vt = 8 > < > : O t 2↵ , 2↵  1, 1 t 2GF G 1 W I 1 G 1 W (✓⇤)IG 1 W (✓⇤) + O ✓ 1 t2 ◆ , 2↵ > 1, where I = Ep✓⇤ h rx (r✓l(xt, ✓⇤)) · rx ⇣ r✓l(xt, ✓⇤)T ⌘i . 44 45

Slide 46

Slide 46 text

Part II: Transport information proximal Lin, Li, Montufar, Osher, Wasserstein proximal of GANs, 2019. Li, Lin, Montufar, A ne proximal learning, GSI, 2019. Li, Liu, Zha, Zhou, Parametric Fokker–Planck equations, GSI, 2019. Arbel, Gretton, Li, Guido Montufar, Kernelized Wasserstein Natural Gradient, ICLR, spotlight, 2020. Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations, 2020. 45 46

Slide 47

Slide 47 text

Gradient operator Given a function F : M ! R with a metric matrix function G(x) 2 Rn⇥n, the gradient operator is deﬁned by gradF(x) = G(x) 1rx F(x), where rx is the Euclidean or L2 gradient w.r.t. x. If G(x) 1 = I, then gradF(x) = rx F(x); If G(x) 1 = diag(xi)1in , then gradF(x) = (xi rxi F(x))n i=1 ; If G(x) 1 = L(x), where L is a graph Laplacian matrix and x is the weight, then gradF(x) = L(x)rx F(x). 46 47

Slide 48

Slide 48 text

Proximal method Consider an Euclidean Gradient ﬂow: ˙ x = rF(x). The Proximal method refers to xk+1 = arg min x2Rn F(x) + 1 2h kx xk k2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk hrF(xk+1). 47 48

Slide 49

Slide 49 text

Proximal method on a metric structure Consider an Euclidean gradient ﬂow: ˙ x = G(x) 1rF(x). The proximal method refers to xk+1 = arg min x2Rn F(x) + 1 2h dG(x, xk)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, i.e. dG(x, y) = inf n Z 1 0 q ˙(t)T G( (t))˙(t)dt: (0) = x, (1) = y o This is known as the variational backward Euler method xk+1 = xk hG(xk+1) 1rF(xk+1) + o(h). Many other linearization of the methods have been studied. 48 49

Slide 50

Slide 50 text

Proximal method on a probability space Denote G(⇢) 1 = ⇢ . Consider the gradient ﬂow: ˙ ⇢ = G(⇢) 1 ⇢F(⇢) = r · (⇢r F(⇢)). The proximal method refers to ⇢k+1 = arg min ⇢2P F(⇢) + 1 2h dG(⇢, ⇢k)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, known as the transport distance. This is known as the variational backward Euler method ⇢k+1 = ⇢k hr · (⇢k+1 r F(⇢k+1)) + o(h). A linearlization of proximal methods have been formulated by (Li, Lu, Wang, 2019). 49 50

Slide 51

Slide 51 text

Generative Adversary Networks 50 51

Slide 52

Slide 52 text

Generative probability models For each parameter ✓ 2 Rd and given neural network parameterized mapping function g✓ , consider ⇢✓ = g✓#p(z). 51 52

Slide 53

Slide 53 text

Natural proximal on probability models Denote G(✓) = (r✓ ⇢, ( ⇢) 1r✓ ⇢)L2 . Consider the gradient ﬂow: ˙ ✓ = G(✓) 1r✓F(⇢(✓, ·)). The update scheme follows5: ✓k+1 = arg min ✓2⇥ F(⇢✓) + 1 2h dW (✓, ✓k)2. where ✓ is the parameters of the generator, F(⇢✓) is the loss function, and dW is the Wasserstein metric. 5Lin, Li, Osher, Montufar, Wasserstein proximal of GANs, 2018. 52 53

Slide 54

Slide 54 text

Computations In practice, we approximate the Wasserstein metric to obtain the following update: ✓k+1 = arg min ✓2⇥ F(⇢✓) + 1 B B X i 1 2h kg✓(zi) g✓k (zi)k2, where g✓ is the generator, B is the batch size, and zi ⇠ p(z) are inputs to the generator. 53 54

Slide 55

Slide 55 text

Examples: Jensen–Shannon entropy Loss Figure: The relaxed Wasserstein Proximal of GANs, on the CIFA10 (left), CelebA (right) datasets. 54 55

Slide 56

Slide 56 text

Examples: Wasserstein-1 Loss Figure: Wasserstein proximal of Wasserstein-1 Loss function on the CIFA10 data set. 55 56

Slide 57

Slide 57 text

Part III: Transport information Gaussian Mixtures Transport information Gaussian Mixtures Jiaxi Zhao, Wuchen Li, Scaling limit of transport Gaussian mixture models, 2020. 56 57

Slide 58

Slide 58 text

Part IV: Transport information generative models Transport information generative models Shu Liu, Wuchen Li, Hongyuan Zha, Haomin Zhou, Neural parametric Fokker-Planck equations, 2020. 57 58

Slide 59

Slide 59 text

Fokker-Planck Equation I Fokker Planck Equation (FPE) is a parabolic evolution equation: @⇢ @t = r · (⇢rV ) + ⇢ ⇢(·, 0) = ⇢0 The equation plays fundamental roles in probability, geometry, statistics, and also machine learning communities. I This equation has many applications: 2 59

Slide 60

Slide 60 text

Our Goal We are interested in solving high dimensional Fokker-Planck equations. I Di culty: Curse of dimensionality I Existing tools for high dimensional problems: Deep learning I We are going to introduce deep learning techniques to numerically solve FPE. 3 60

Slide 61

Slide 61 text

Motivation I Dimension reduction, Fokker-Planck Equation (inﬁnite dimensional ODE) + Parametric Fokker-Planck Equation (ﬁnite dimensional ODE) I Obtain an e cient sampling-friendly method, i.e. we are able to draw samples from ⇢ t = ⇢(x, t) for arbitrary t. 4 61

Slide 62

Slide 62 text

Our treatment Generative Model12 in deep learning: T ✓ as a deep neural network, p as a reference distribution. Find a suitable parameter ✓ s.t. T ✓# p approximates desired distribution ⇢ well. I Deep structure ) Dealing with high dimensional problem; I Pushforward mechanism ) Sampling-friendly; We numerically solve high dimensional Fokker-Planck equation in Generative Model: I Approximate ⇢ t , the solution of the Fokker-Planck Equation at time t, by T ✓t #p; I Construct an ODE for ✓ t and numerically solve the ODE; 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 5 62

Slide 63

Slide 63 text

Project Fokker-Planck equation to the parameter space 7 63

Slide 64

Slide 64 text

Wasserstien manifold of probability distributions Wasserstein manifold of probability distributions deﬁned on Rd: P = ⇢ ⇢: Z ⇢(x)dx = 1, ⇢(x) 0, Z |x| 2⇢(x) dx < 1 Tangent space of P at ⇢ is characterized as: T ⇢P = n ˙ ⇢: Z ˙ ⇢(x)dx = 0 o . 1Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis. 1998 8 64

Slide 65

Slide 65 text

Wasserstien manifold of probability distributions We deﬁne the Wasserstein metric gW on TP as1: For any ⇢ 2 P and ˙ ⇢1 , ˙ ⇢2 2 T ⇢P: gW (⇢)( ˙ ⇢1 , ˙ ⇢2) = Z r 1(x) · r 2(x)⇢(x) dx, where 1 , 2 solves ˙ ⇢1 = r · (⇢1r 1), ˙ ⇢2 = r · (⇢2r 2) Now (P, gW ) becomes a Riemannian manifold. We call it Wasserstein manifold. The geodesic distance on (P, gW ) corresponds to L2-Wasserstein distance. 1Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis. 1998 9 65

Slide 66

Slide 66 text

Gradient flows on Wasserstein manifold Denote gradW as the manifold gradient on (P, gW ). Then gradW can be computed as follows. Consider a smooth functional F : P ! R, one can verify2: gradW F(⇢) =gW (⇢) 1 ✓ F ⇢ ◆ (x) = r · (⇢(x)r ⇢(x) F(⇢)), where ⇢(x) is the L2 first variation at variable x 2 Rd. Then the gradient flow of F is written as2: @⇢ @t = gradW F(⇢) = r · (⇢(x)r ⇢(x) F(⇢)) 2C´ edric Villani. Optimal transport: old and new. 2008 10 66

Slide 67

Slide 67 text

Fokker-Planck Equation as gradient ﬂow For a given potential function V , consider the relative entropy functional H (also known as KL divergence): H(⇢) = Z ⇢(x) log ⇢(x) 1 Z e V (x) ! dx = Z V (x)⇢(x)dx + Z ⇢(x) log ⇢(x)dx + Const. Then r ⇣ H ⇢ ⌘ = rV + r log ⇢, and gradient ﬂow is written as: @⇢ @t = r · (⇢rV ) + r · (⇢r log ⇢)) = r · (⇢rV ) + ⇢. This is exactly the Fokker-Planck Equation mentioned at beginning. 11 67

Slide 68

Slide 68 text

Parametric Fokker-Planck Equation I Inspired by: Wasserstein geometry5 + Information Geometry6 ) Wasserstein Information Geometry7 I Sketch of our treatment: – Introduce parametric distributions via Generative Model; – Derive the metric tensor on parameter space; – Compute the Wasserstein gradient ﬂow of relative entropy in parameter space. 5Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001 6S Amari. Natural Gradient Works E ciently in Learning. 1998 7Wuchen Li and Guido Montufar. Natural Gradient via Optimal Transport. 2018 12 68

Slide 69

Slide 69 text

Parametric Fokker-Planck Equation I Consider a class of invertible push-forward maps {T ✓}✓2⇥ indexed by parameter ✓ 2 ⇥ ⇢ Rm T ✓ : Rd ! Rd I We obtain family of parametric distributions P⇥ = ⇢ ✓ = T ✓# p | ✓ 2 ⇥ ⇢ (P, gW ) I Deﬁne T# : ✓ 7! T ✓# p = ⇢ ✓ and introduce the metric tensor G = (T#)⇤gW on ⇥ by pulling back gW by T# I We call G on parameter manifold ⇥ the Wasserstein statistical manifold. I For each ✓ 2 ⇥, G(✓) is a bilinear form on T ✓⇥ ' Rm 13 69

Slide 70

Slide 70 text

Parametric Fokker-Planck Equation Theorem 1 Let G = T# ⇤gW , then G(✓) is m ⇥ m non-negative deﬁnite symmetric matrix and can be computed as: G(✓) = Z r (T ✓(x))r (T ✓(x))T p(x)dx, Or in entry-wised form: G ij(✓) = Z r i(T ✓(x)) · r j(T ✓(x)) p(x) dx, 1  i, j  m. Here = ( 1 , ... m)T and r is m ⇥ d Jacobian matrix of . For each k = 1, 2, ..., m, k solves the following equation: r · (⇢ ✓r k(x)) = r · (⇢ ✓ @ ✓k T ✓(T 1 ✓ (x))). (1) 14 70

Slide 71

Slide 71 text

Parametric Fokker-Planck Equation Recall the relative entropy functional H, we consider H = H T# : ⇥ ! R. Then: H(✓) = H(⇢ ✓) = Z V (x)⇢ ✓(x) dx + Z ⇢ ✓(x) log ⇢ ✓(x) dx = Z [V (T ✓(x)) + log ⇢ ✓(T ✓(x))]p(x) dx Since Fokker-Planck equation is gradient ﬂow of H on (P, gW ), a similar version on (⇥, G) is formulated as: ˙ ✓ = G(✓) 1 r✓ H(✓). ✓ Recall: @⇢ @t = gW (⇢) 1 ⇢(x) H(⇢(x)) ◆ This is an ODE in parameter space ⇥, we call it parametric Fokker-Planck equation. 15 71

Slide 72

Slide 72 text

From Fokker-Planck equation to particle dynamics (In this diagram, notice that r H(⇢) ⇢ = rV + r log ⇢.) 16 72

Slide 73

Slide 73 text

Particle version of Fokker-Planck Equation Fokker-Planck equation: @⇢ @t + r · (⇢rV ) = ⇢ ⇢(·, 0) = ⇢0 Particle version: I Over damped Langevin dynamics: dX t = rV (X t)dt + p 2 dB t X0 ⇠ ⇢0 I Vlasov-type dynamics: dX t dt = rV (X t) r log ⇢ t(X t) X0 ⇠ ⇢0 Or equivalently, dX t dt = r H(⇢ t) ⇢ t (X t) X0 ⇠ ⇢0 Here we denote ⇢ t(·) as the density of distribution of X t . 17 73

Slide 74

Slide 74 text

From parametric Fokker-Planck equation to particle dynamics (In this diagram, notice that r H(⇢) ⇢ = rV + r log ⇢.) 18 74

Slide 75

Slide 75 text

Particle version of parametric Fokker-Planck Equation I Let {✓ t} be the solution of the ODE I Take Z ⇠ p, consider Y t = T ✓t (Z), we know for any t, Y t ⇠ ⇢ ✓t . {Y t} satisﬁes the dynamic: ˙ Y t = @ ✓ T ✓t (T 1 ✓t (Y t)) ˙ ✓ t Y0 ⇠ ⇢ ✓0 I It is then more insightful to consider the following dynamic: ˙ X t = r t(X t)T ˙ ✓ t , X0 ⇠ ⇢ ✓0 I X t and Y t have the same distribution. And thus X t ⇠ ⇢ ✓t for all t 0 19 75

Slide 76

Slide 76 text

Particle version of parametric Fokker-Planck Equation I Since ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t), plugging the deﬁnition of the metric tensor ˙ X t = Z K ✓t (X t , ⌘)( rV (⌘) r log ⇢ ✓t (⌘))⇢ ✓t (⌘)d⌘ I Here we deﬁne the kernel function K ✓ : Rd ⇥ Rd ! Rd⇥d K ✓(x, ⌘) = r T (x) ✓Z r (x)r (x)T ⇢ ✓(x)dx ◆ 1 r (⌘) 20 76

Slide 77

Slide 77 text

Particle version of parametric Fokker-Planck Equation I The kernel K ✓(·, ·) induce an orthogonal projection K✓ deﬁned on L2 (Rd; Rd, ⇢ ✓) The range of this projection is the subspace is: span {r 1 , ..., r m} ⇢ L2 (Rd; Rd, ⇢ ✓) Here 1 , . . . , m are the m components of t . I We may rewrite the dynamic as: ˙ X t = K✓t (rV + r log ⇢ ✓t )(X t), X0 ⇠ ⇢ ✓0 ⇢ ✓t is the probability density of X t . 21 77

Slide 78

Slide 78 text

Projected Dynamics (In this diagram, notice that r H(⇢) ⇢ = rV + r log ⇢.) 22 78

Slide 79

Slide 79 text

Projected Dynamics I We now compare – Vlasov-type dynamics (Fokker-Planck equation) ˙ Xt = K✓t (rV + r log ⇢✓t )(Xt), X0 ⇠ ⇢✓0 ⇢✓t is the probability density of Xt. – Projected Vlasov-type dynamics (parametric Fokker-Planck equation) ˙ ˜ Xt = (rV + r log ⇢t)( ˜ Xt), ˜ X0 ⇠ ⇢0 ⇢t is the probability density of ˜ Xt. I From particle viewpoint, the di↵erence between ⇢ ✓t and ⇢ t originates from the orthogonal projection K✓t of vector ﬁeld that drives the Vlasov-type dynamics. 23 79

Slide 80

Slide 80 text

Route map of our research Our parametric Fokker-Planck equation can be treated as a Neural Lagrangian method with Eulerian view . 24 80

Slide 81

Slide 81 text

Algorithm When dimension d gets higher: I Endow T ✓ with the structure of deep neural networks. Our choice: Normalizing Flow9 (There are deﬁnitely other choices.) I Classical schemes for ODE are not suitable: – Direct computation of G(✓) could be expensive; – G(✓) 1r✓H(✓) cannot be e ciently computed under deep learning framework; I We have to seek for new schemes. Since optimization is convenient to perform under deep learning framework, one choice is to design variational scheme for our parametric Fokker-Planck equation. 9Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. ICML, 2015 25 81

Slide 82

Slide 82 text

Algorithm Goal: Propose a variational scheme for solving ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t) I Start with an semi-implicit scheme: ✓ k+1 ✓ k h = G 1 (✓ k)r✓ H(✓ k+1). I Associate with proximal-type algorithm: ✓ k+1 = argmin ✓ {h✓ ✓ k , G(✓ k)(✓ ✓ k)i + 2hH(✓)} I We use the following approximation: h✓ ✓k,G(✓k)(✓ ✓k)i⇡max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx o I Thus, we come up with the following variational saddle scheme: ✓k+1=argmin ✓ max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx+2hH(✓) o . In our implementation, we choose as ReLU network and optimize over its parameters. 26 82

Slide 83

Slide 83 text

Algorithm Algorithm 1 Solving parametric FPE ˙ ✓ = G(✓) 1 r✓ F(✓) Initialize ✓0 for k = 0, 1, ..., N do ✓ ✓ k ; for j = 1, 2, ..., M do Apply Stochastic Gradient Descent to solve: inf ⇢Z |T✓(x) T✓k (x) r (T✓k (x))|2 dp(x) ✓ ✓ stepsize · r✓J(✓, ✓k). Here J(✓, ˜ ✓) is deﬁned as: J(✓, ˜ ✓) = Z T✓(x) · r (T˜ ✓ (x)) dp(x) + hF(✓) end for ✓ k+1 ✓ end for 27 83

Slide 84

Slide 84 text

Algorithm I Our algorithm is convenient in high dimensional cases: – All the integrals can be approximated by Monte-Carlo method; – Derivatives w.r.t. parameters of networks can be computed via back propagation. I An interesting observation: our new scheme can be treated as a modiﬁed version of JKO scheme11 for Wasserstein gradient ﬂow of functional H: ⇢k+1=argmin⇢{ 1 2h W 2 2 (⇢,⇢k)+H(⇢)} , ˙ ⇢= gradW H(⇢) I In our variational scheme, we replace W2 2 (⇢, ⇢ k) with d 2(⇢✓,⇢✓k )=max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx o 10Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 1998 28 84

Slide 85

Slide 85 text

Numerical examples Now we try FPE with more complicated potential in higher dimensions. Assume we are in R30, = 1 and we set: V (x) = 1 50 30 X i=1 x4 i 16x2 i + 5x i ! Here is a plot of such function in 2 dimensional space: 33 85

Slide 86

Slide 86 text

Numerical examples We let time stepsize h = 0.005. Since we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 86

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Slide 90

Slide 90 text

Slide 91

Slide 91 text

Slide 92

Slide 92 text

Slide 93

Slide 93 text

Numerical examples We can verify the graph of trained at the end of each outer iteration. Here are the graphs of at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of at k = 30th time step Figure: Graph of at k = 60th time step Figure: Graph of at k = 150th time step 35 93

Slide 94

Slide 94 text

Numerical examples We can verify the graph of trained at the end of each outer iteration. Here are the graphs of at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of at k = 240th time step Figure: Graph of at k = 300th time step Figure: Graph of at k = 360th time step 36 94

Slide 95

Slide 95 text

Numerical analysis Let us denote {⇢ t} as the solution to the original Fokker-Planck equation. Denote {✓ t} as the solution to parametric Fokker-Planck equation ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t) and ⇢ ✓t = T ✓t# p. I Convergence analysis: we will provide upper bound for DKL(⇢ ✓t k⇢ ⇤). Here ⇢ ⇤ = 1 Z exp( V ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). I Error analysis: we provide uniform bound for W2(⇢ ✓t , ⇢ t). 37 95

Slide 96

Slide 96 text

Numerical analysis Before we introduce our results on numerical analysis, we first introduce an important quantity which will play an essential role in our results: Let us recall as defined from (1) in Theorem 1, we then set: 0 = sup ✓2⇥ min ⇠2T✓⇥ ⇢Z |(r (T ✓(x))T ⇠ r (V + log ⇢ ✓) T ✓(x)| 2 dp(x) One can treat 0 as a quantification of the approximation power of push-forward family {T ✓}✓2⇥ . 38 96

Slide 97

Slide 97 text

Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and r2V ⌫ 0 outside a ﬁnite ball. Suppose {✓ t} solves ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t). Let ⇢ ⇤(x) = 1 Z e V (x)/ be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(⇢ ✓t k⇢ ⇤)  0 ˜ 2 (1 e ˜ t) + KL(⇢ ✓0 k⇢ ⇤)e ˜ t. (2) Here ˜ > 0 is a constant number originating from the Log-Sobolev inequality involving ⇢ ⇤, i.e. KL(⇢k⇢ ⇤)  1 ˜ I(⇢|⇢ ⇤) holds for any probability density ⇢. Here I(⇢|⇢ ⇤) = R r log ⇣ ⇢ ⇢⇤ ⌘ 2 ⇢ dx. 39 97

Slide 98

Slide 98 text

Error analysis Theorem Assume {✓ t} solves ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t); and {⇢ t} solves the Fokker-Planck equation. Assume that the Hessian of the potential function V is bounded below by a constant , i.e. r2V ⌫ I. Then we have: W2(⇢ ✓t , ⇢ t)  p 0 (1 e t) + e tW2(⇢ ✓0 , ⇢0). (3) When V is strictly convex, i.e. 0, inequality (3) provides a uniformly small upper bound for the error W2(⇢ ✓t , ⇢ t); However, under many cases, < 0, then the right hand side of (3) will increase to inﬁnity as time t ! 1. 40 98

Slide 99

Slide 99 text

Error analysis (improved version) Theorem The W2 error W2(⇢ ✓t , ⇢ t) at any time t > 0 can be uniformly bounded by constant number depending on E0 = W2(⇢ ✓0 , ⇢0) and 0. To be more precise, 1. When 0, the error W2(⇢ ✓t , ⇢ t) can be at least uniformly bounded by O(E0 + p 0) term; 2. When < 0, the error W2(⇢ ✓t , ⇢ t) can be at least uniformly bounded by O((E0 + p 0) ˜ 2| |+˜ ) term. 41 99

Slide 100

Slide 100 text

Summary and Future Research Summary: Solving high-dimensional Fokker Planck equation I Dimension reduction: PDE to ODE via implicit generative models I Variational approach for semi-Euler scheme update I Approximation and numerical error bounds Future Research I Application to sampling techniques for time-varying problems I Application to other types of dissipative PDEs I Application to global optimization 42 100

Slide 101

Slide 101 text

101