Transport information geometry: current and future

7a507f364fce7547f94b9a5b4a072c87?s=47 Wuchen Li
May 02, 2020
57

Transport information geometry: current and future

In this talk, I will briefly give an overview of the newly emerged area: transport information geometry. We quickly review information geometry, which emphasizes the geometry of the learning model and theoretical AI, based on invariance concerning transformations of the hypotheses and their parametrization. We formulate the geometry in optimal transport, which departs from the geometry of the data/sample space and relates to math physics equations. We shall apply the information-differential-geometry angle towards optimal transport, named the transport information geometry. Several current developments and future plans will be provided. Many open problems will be given in this direction.

7a507f364fce7547f94b9a5b4a072c87?s=128

Wuchen Li

May 02, 2020
Tweet

Transcript

  1. Transport information geometry: Current and Future Wuchen Li (UCLA/ U

    of SC) IPAM, April 6, 2020 1
  2. Divergences on probability space 2 2

  3. Applications of Divergences 3 3

  4. Information matrix Information matrix (a.k.a Fisher information matrix, Fisher–Rao metric)

    plays important roles in estimation, information science, statistics and machine learning: Statistics and information science: Likelihood principle; Cramer-Rao bound, etc. Statistical physics; Population games (mean field game) via replicator dynamics (Shahshahani, Smith): Applications in biology, economics, cancer, epidemic propagation etc; AI Inference problems: f–divergence, Amari ↵–divergence etc. Information geometry optimization methods; (KFAC, ADAM), with applications in AI: Natural gradient (Amari); ADAM (Kingma 2014); Stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.) 4 4
  5. Probability models 5 5

  6. Learning: Objective function Given a data measure ⇢data(x) = 1

    N PN i=1 Xi (x) and a parameterized model ⇢(x, ✓). Most problems in AI or sampling refer to min ⇢✓ 2⇢(⇥) D(⇢data , ⇢✓). One typical choice of D is the Kullback–Leibler divergence (relative entropy) D(⇢data , ⇢✓) = Z ⌦ ⇢data(x) log ⇢data(x) ⇢(x, ✓) dx. 6 6
  7. Learning: Gradient direction The natural gradient method refers to ✓k+1

    = ✓k hGF (✓k) 1r✓ D(⇢data , ⇢✓k ), where h > 0 is a stepsize and GF (✓) =E X⇠⇢✓ (r✓ log ⇢(X, ✓))T (r✓ log ⇢(X, ✓)) is the Fisher information matrix. Why natural gradient and Fisher information matrix? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; Fisher e cient; Online Cramer-Rao bound. 7 7
  8. Optimal transport In recent years, optimal transport (a.k.a Earth mover’s

    distance, Monge-Kantorovich problem, Wasserstein distance) has witnessed a lot of applications in AI: Theory (Brenier, Gangbo, Villani, Figalli, et.al.); Gradient flows (Otto, Villani, et.al.); Proximal methods (Jordan, Kinderlehrer, Otto et.al.); Mean field games (Larsy, Lions et.al.); Population games via Fokker-Planck equations (Degond et. al. 2014, Li et.al. 2016); Image retrieval (Rubner et.al. 2000); Wasserstein of Wasserstein loss (Dukler, Li, Lin, Guido); Inverse problems (Stuart, Zhou, et.al., Ding, Li, Yin, Osher); Scientific computing: (Benamou, Brenier, Carillo, Oberman, Peyre, Solomon, Osher, Li, Yin, et. al.) AI and Machine learning: Wasserstein Training of Boltzmann Machines (Cuturi et.al. 2015); Learning from Wasserstein Loss (Frogner et.al. 2015); Wasserstein GAN (Bottou et.al. 2017); see NIPS, ICLR, ICML 2015– 2020. 8 8
  9. Why optimal transport? Optimal transport provides a particular distance (W)

    among histograms, which relies on the distance on sample spaces (ground cost c). E.g. Denote X0 ⇠ ⇢0 = x0 , X1 ⇠ ⇢1 = x1 . Compare W(⇢0, ⇢1) = inf ⇡2⇧(⇢0,⇢1) E (X0,X1)⇠⇡ c(X0 , X1) = c(x0 , x1); Vs TV(⇢0, ⇢1) = Z ⌦ |⇢0(x) ⇢1(x)|dx = 2; Vs KL(⇢0k⇢1) = Z ⌦ ⇢0(x) log ⇢0(x) ⇢1(x) dx = 1. 9 9
  10. Learning: Wasserstein objective/loss function Given a data distribution ⇢0 and

    a probability model ⇢1(✓). Consider min ✓2⇥ W(⇢0, ⇢1(✓)). This is a double minimization problem, i.e. W(⇢0, ⇢1(✓)) = min ⇡2⇧(⇢0,⇢1(✓)) E (X0,X1)⇠⇡ c(X0 , X1). Many applications, such as Wasserstein GAN, Wasserstein Loss, are built on the above formulation. 10 10
  11. Learning: Wasserstein gradient direction Main Question: Besides looking at the

    Wasserstein/transport distance, I propose to work on the optimal transport induced information matrices and geometric structures in probability models. Literature review Wasserstein Gradient Flow: Otto (2001); Natural Gradient works e ciently in learning: Amari (1998), ; Constrained gradient: Carlen, Gangbo (2003); Linear programming: Wong (2017), Amari, Karakida, Oizumi (2017); Gaussian measures: Takatsu (2011), Malago et.al. (2017), Modin (2017), Yongxin et.al (2017), Sanctis (2017) and many more. 11 11
  12. Scope of transport information 12 12

  13. Problem formulation Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation

    ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, La↵erty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into learning problems. 13 13
  14. Density manifold Optimal transport has an optimal control reformulation by

    the dual of dual of linear programming: inf ⇢t Z 1 0 gW (@t ⇢t , @t ⇢t)dt = Z 1 0 Z ⌦ (r t , r t)⇢t dxdt, under the dynamical constraint, i.e. continuity equation: @t ⇢t + r · (⇢t r t) = 0, ⇢0 = ⇢0, ⇢1 = ⇢1. Here, (P(⌦), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. La↵erty: the density manifold and configuration space quantization, 1988. 14 14
  15. Topics (O) Transport information geometry and Gamma calculus (Li); (I)

    Transport information matrix (Li, Zhao); (II) Wasserstein natural gradient/proximal (Li, Montufar, Chen, Lin, Abel, Osher, etc.) (III) Neural Fokker-Planck equation (Li, Liu, Zha, Zhou.) (IV) Transport information optimization methods (Li, Wang). 15 15
  16. Part I: Transport information statistics Main results Li, Zhao, Wasserstein

    information matrix, 2019; Amari, Wasserstein statistics in 1D location-scale model, March, 2020. Related studies Wasserstein covariance (Petersen, Muller.) Wasserstein minimal distance estimator (Bernton, Jacob, Gerber, and Robert.) Statistical inference for generative models with maximum mean discrepancy (Briol, Barp, Duncan, Girolami.) 16 16
  17. Information matrix and Pull back 17 17

  18. Statistical information matrix Definition (Statistical information matrix) Consider the density

    manifold (P(X), g) with a metric tensor g, and a smoothly parametrized statistical model p✓ with parameter ✓ 2 ⇥ ⇢ Rd. Then the pull-back G of g onto the parameter space ⇥ is given by G(✓) = D r✓ p✓ , g(p✓)r✓ p✓ E . Denote G(✓) = (G(✓)ij)1i,jd , then G(✓)ij = Z X @ @✓i p(x; ✓) ⇣ g(p✓) @ @✓j p ⌘ (x; ✓)dx. Here we name g the statistical metric, and call G the statistical information matrix. 18 18
  19. Statistical information matrix Definition (Score Function) Denote i : X

    ⇥ ⇥ ! R, i = 1, ..., n satisfying i(x; ✓) =  g(p) ✓ @ @✓i p(x; ✓) ◆ . They are the score functions associated with the statistical information matrix G and are equivalent classes in C(X)/R. The representatives in the equivalent classes are determined by the following normalization condition: E p✓ i = 0, i = 1, ..., n. Then the statistical information matrix satisfies G(✓)ij = Z X i(x; ✓) ⇣ g(p✓) 1 j ⌘ (x; ✓)dx. 19 19
  20. Examples: Fisher information matrix Consider g(p) 1 = p. GF

    (✓)ij = E X⇠p✓ [ i(X; ✓) j(X; ✓)] . where i = 1 p✓ r✓i p✓ , and j = 1 p✓ r✓j p✓ . Notice the fact 1 p r✓ p = r✓ log p, then GF (✓)ij = E X⇠p✓  @ @✓i log p(X; ✓) @ @✓j log p(X; ✓) . In literature, GF (✓) is known as the Fisher information matrix and log p(X; ✓) is named score function. 20 20
  21. Examples: Wasserstein information matrix Consider g(p) 1 = r ·

    (pr): GW (✓)ij = E X⇠p✓ ⇥ rx W i (X; ✓), rx W j (X; ✓) ⇤ . where rx · (p✓ rx W i ) = r✓i p✓ . and rx · (p✓ rx W j ) = r✓j p✓ . Here we call GW (✓) Wasserstein information matrix and name W the Wasserstein score function. 21 21
  22. Distance and information matrix Specifically, given a smooth family of

    probability densities p(x; ✓) and a given perturbation ✓ 2 T✓⇥, consider the following Taylor expansions in term of ✓: KL(p(✓)kp(✓ + ✓)) = 1 2 ✓T GF (✓) ✓ + o(( ✓)2), and W2(p(✓ + ✓), p(✓))2 = ✓T GW (✓) ✓ + o(( ✓)2). 22 22
  23. Poisson equation The Wasserstein score functions W i (x; ✓)

    satisfy the following Poisson equation rx log p(x; ✓) · rx W i (x; ✓) + x W i (x; ✓) = @ @✓i log p(x; ✓). 23 23
  24. Separability If p(x; ✓) is an independence model, i.e. p(x,

    ✓) = ⇧n k=1 pk(xk; ✓), x = (x1 , · · · , xn). Then there exists a set of one dimensional functions W,k : Xk ⇥ ⇥k ! R, such that W (x; ✓) = n X k=1 W,k(xk; ✓). In addition, the Wasserstein information matrix is separable: GW (✓) = n X k=1 Gk W (✓), where Gk W (✓) = E X⇠p✓ ⇥ rxk W,k(X; ✓), rxk W,k(X; ✓) ⇤ . 24 24
  25. One dimensional sample space If X ⇢ R1, the Wasserstein

    score functions satisfy W i (x; ✓) = Z 1 p(z; ✓) @ @✓i F(z; ✓)dz, where F(x; ✓) = R x p(y; ✓)dy is the cumulative distribution function. And the Wasserstein information matrix2 satisfies GW (✓)ij = E X⇠p✓ " @ @✓i F(X; ✓) @ @✓j F(X; ✓) p(X; ✓)2 # . 2Chen, Li, Wasserstein natural gradient in continuous sample space, 2018. 25 25
  26. Remark The Wasserstein score function is the average of the

    cumulative Fisher score function, and the Wasserstein information matrix is the covariance of the density average of the cumulative Fisher score function. 26 26
  27. Analytic examples Classical statistical models: Gaussian family: p(x; µ, )

    = 1 p 2⇡ e 1 2 2 (x µ)2 , GW (µ, ) = ✓ 1 0 0 1 ◆ . Laplacian family: p(x; m, ) = 2 e |x m|, GW (µ, ) = ✓ 1 0 0 1 ◆ . 27 27
  28. Analytic examples Generative models: Continuous 1-d generative family: p(·; ✓)

    = f✓⇤ p0 (·) , p0 a given distribution, GW (✓) = Z |r✓ f✓ (x)|2 p0 (x) dx, where the push-forward distribution is defined as Z A p0 dx = Z f 1 ✓ (A) f✓⇤ p0 dx, 28 28
  29. Analytic examples Generative models with ReLU family: f✓ (x) =

    (x ✓) = ( 0, x  ✓, x ✓, x > ✓. GW (✓) = F0(✓), F0 cumulative distribution function of p0 . Figure: This figure plots two example of the push-forward family we described above with parameter chosen as ✓1 = 3, ✓2 = 5. 29 29
  30. Statistical Information Matrix Probability Family Wasserstein information matrix Fisher information

    matrix Uniform: p(x;a,b) = 1 b a 1(a,b) (x) GW (a,b) = 1 3 ✓ 1 1 2 1 2 1 ◆ GF (a,b) not well-defined Gaussian: p(x;µ, ) = e 1 2 2 (x µ)2 p 2⇡ GW (µ, ) = ✓ 1 0 0 1 ◆ GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ Exponential: p(x;m, ) = e (x m) GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) not well-defined Laplacian: p(x;m, ) = 2 e |x m| GW (m, ) = ✓ 1 0 0 2 4 ◆ GF (m, ) = ✓ 2 0 0 1 2 ◆ Location-scale: p(x;m, ) = 1 p( x p ) GW ( ,m) = E ,mx2 2mE ,mx+m2 2 0 0 1 ! GF ( ,m) = 0 @ 1 2 ⇣ 1 + R R ⇣ (x m)2p02 2p + (x m)p0 ⌘ dx ⌘ R R (x m)p02 3p dx R R (x m)p02 3p dx 1 2 R R p02 p dx 1 A Independent: p(x,y;✓) = p(x;✓)p(y;✓) GW (x,y;✓) = G1 W (x;✓) + G2 W (y;✓) GF (x,y;✓) = G1 F (x;✓) + G2 F (y;✓) ReLU push-forward: p(x;✓) = f✓⇤ p(x), f✓ ✓-parameterized ReLUs.. GW (✓) = F (✓), F cdf of p(x) GF (✓) not well-defined Table: In this table, we present Wasserstein, Fisher information matrices for various probability families. 30 30
  31. Classical (Fisher) statistics & Wasserstein statistics We develop a parallel

    Wasserstein statistics following the classical statistics approach: Covariance inner product ! Wasserstein covariance Cramer-Rao bound cotangent space ! Wasserstein-Cramer-Rao bound Natural gradient e ciency separability ! Wasserstein Natural gradient e ciency 31 31
  32. Wasserstein covariance Definition (Wasserstein covariance) Given a statistical model ⇥,

    denote the Wasserstein covariance as follows: CovW ✓ [T1 , T2] = E X⇠p✓ rx T1(X), rx T2(X)T , where T1 , T2 are random variables as functions of X and the expectation is taken w.r.t. X ⇠ p✓ . Denote the Wasserstein variance: VarW ✓ [T] = E X⇠p✓ rx T(X), rx T(X)T . 32 32
  33. Wasserstein-Cramer-Rao bound Theorem (Wasserstein-Cramer-Rao inequalities) Given any set of statistics

    T = (T1 , ..., Tn) : X ! Rn, where n is the number of the statistics, define two matrices CovW ✓ [T(x)], r✓ E p✓ [T(x)]T as below: CovW ✓ [T(x)]ij = CovW ✓ [Ti , Tj], r✓ E p✓ [T(x)]T ij = @ @✓j E p✓ [Ti(x)], then CovW ✓ [T(x)] ⌫ r✓ E p✓ [T(x)]T GW (✓) 1r✓ E p✓ [T(x)]. 33 33
  34. Cramer-Rao bound: Fisher vs Wasserstein Gaussian: GW (µ, ) =

    ✓ 1 0 0 1 ◆ , GF (µ, ) = ✓ 1 2 0 0 2 2 ◆ . Laplacian: GW (m, ) = ✓ 1 1 2 1 2 2 4 ◆ , GF not well-defined. Comparison: GW is well-defined for wider range of families. Tighter bound on the variance of an estimator 34 34
  35. Information functional inequalities Comparison between Fisher and Wasserstein information matrices

    relates to famous information functional inequalities. Here we study them in parameter statistics. Dissipation of entropy & Fisher information functional d dt H(µ|⌫) = Z X rx log µ(x) ⌫(x) 2 µ(x)dx = I(µ|⌫) d dt ˜ H(p✓ |p✓⇤ ) = r✓ e HT e G 1 W r✓ e H = e I(p✓ |p✓⇤ ) Log-Sobolev inequality (LSI) H(µ|⌫) < 1 2↵ I(µ|⌫), µ 2 P(X) e H(p✓ |p✓⇤ ) < 1 2↵ e I(p✓ |p✓⇤ ), ✓ 2 ⇥ 35 35
  36. Functional inequalities via Wasserstein and Fisher information matrices Theorem (RIW-condition3

    4) The information matrix criterion for LSI can be written as: GF (✓) + r2 ✓ p✓ log p✓ p✓⇤ W r✓ e H(p✓ |p✓⇤ ) 2↵GW (✓), where W is the Christo↵el symbol in Wasserstein statistical model ⇥, while for PI can be written as: GF (✓) + r2 ✓ p✓ log p✓ p✓⇤ 2↵GW (✓). 3Li, Geometry of probability simplex via optimal transport, 2018. 4Li, Montufar, Ricci curvature for parameter statistics, 2018. 36 36
  37. List of functional inequalities in family Family Fisher-information functional Log-Sobolev

    inequality(LSI(↵)) Gaussian e I(pµ, ) = 1 2 , e I(pµ, |p⇤) = (µ µ⇤)2 4 4 ⇤ + ✓ 1 + 2 ⇤ ◆2 . 1 2 2↵ 1 2 log 2⇡ log 1 2 , (µ µ⇤)2 4 ⇤ + ✓ 1 + 2 ⇤ ◆2 2↵ ✓ log + log ⇤ 1 2 + 2 + (µ µ⇤)2 2 2 ⇤ ! . Laplacian e I(p ,m) = 2 2 , e I(pm, |p⇤) = 2 ⇤ ⇣ 1 e |m m⇤| ⌘2 + ( |m m⇤ | + 1) ⇤ e |m m⇤| 2 2 . 2 ⇤ ⇣ 1 e |m m⇤| ⌘2 + ( |m m⇤ | + 1) ⇤ e |m m⇤| 2 2 . 2↵( 1 + log log ⇤ + ⇤ |m m⇤ | + ⇤ e |m m⇤| ◆ . Table: In this table, we continue the list to include the Fisher information functional, Log-Sobolev inequality for various probability families. 37 37
  38. List of functional inequalities in family Family RIW condition for

    LSI(↵) RIW condition for PI(↵) Gaussian 1 2 ⇤ 0 0 1 2 ⇤ + 1 2 ! ⌫ 2↵ ✓ 1 0 0 1 ◆ 1 2 ⇤ 0 0 2 2 ⇤ ! ⌫ 2↵ ✓ 1 0 0 1 ◆ Laplacian ⇤ e |m m⇤| 0 0 1 2 + ⇤e |m m⇤| 2(m⇤ m)2 3 ! ⌫ 2↵ ✓ 1 0 0 2 4 ◆ ✓ 2 ⇤ 0 0 1 2 ⇤ ◆ ⌫ 2↵ ✓ 1 0 0 2 4 ⇤ ◆ Table: In this table, we present the RIW condition for LSI and PI in various probability families. 38 38
  39. Wasserstein natural gradient Given a Loss function F: P(⌦) !

    R and probability model ⇢(·, ✓), the associated gradient flow on a Riemannian manifold is defined by d✓ dt = rgF(⇢(·, ✓)). Here rg is the Riemannian gradient operator satisfying g✓(rgF(⇢(·, ✓)), ⇠) = r✓F(⇢(·, ✓)) · ⇠ for any tangent vector ⇠ 2 T✓⇥, where r✓ represents the di↵erential operator (Euclidean gradient). 39 39
  40. Wasserstein natural gradient The gradient flow of Loss function F(⇢(·,

    ✓)) in (⇥, g✓) satisfies d✓ dt = GW (✓) 1r✓F(⇢(·, ✓)). If ⇢(·, ✓) = ⇢(x) is an identity map, then we recover the Wasserstein gradient flow in full probability space: @t ⇢t = r · (⇢t r ⇢ F(⇢t)). 40 40
  41. Entropy production The gradient flow of entropy H(⇢) = Z

    Rd ⇢(x)log ⇢(x)dx , w.r.t. optimal transport distance is the heat equation: @⇢ @t = r · (⇢rlog ⇢) = ⇢ . Along the gradient flow, the following relation holds: d dtH(⇢) = Z Rd log ⇢r · (⇢rlog ⇢)dx = Z Rd (r log ⇢) 2⇢dx . Here the blue term is known as the Fisher information2. 2B. Roy Frieden, Science from Fisher Information: A Unification. 8
  42. Online natural gradient algorithm We sample from the unknown distribution

    once in each step, and use a sample xt to generate an estimator ✓t+1 ✓t+1 = ✓t 1 t rW ✓ f(xt , ✓t), where f is the loss function. To analyze the convergence of this algorithm, we define the Wasserstein covariance matrix Vt to be Vt = E p✓⇤ rx(✓t ✓⇤) · rx(✓t ✓⇤)T , where ✓⇤ is the optimal value. Definition (Natural gradient e ciency) The Wasserstein natural gradient is asymptotic e cient if Vt = 1 t G 1 W (✓⇤) + O( 1 t2 ). 41 42
  43. Covariance update Theorem (Variance updating equation of the Wasserstein Natural

    Gradient) For any function f(x, ✓) that satisfies the condition E p✓ f(x, ✓) = 0, consider here the asymptotic behavior of the Wasserstein dynamics ✓t+1 = ✓t 1 t G 1 W (✓t)f(xt , ✓t). That is, assume priorly E p✓⇤ h (✓t ✓⇤)2 i , E p✓⇤ h |rx (✓t ✓⇤)|2 i = o(1), 8t. Then, the Wasserstein covariance matrix Vt updates according to the following equation: Vt+1 = Vt + 1 t2 G 1 W (✓⇤)E p✓⇤ ⇥ rx (f(xt , ✓⇤)) · rx f(xt , ✓⇤)T ⇤ G 1 W (✓⇤) 2Vt t E p✓⇤ [r✓ f(xt , ✓⇤)] G 1 W (✓⇤) + o( Vt t ) + o ✓ 1 t2 ◆ . 42 43
  44. Wasserstein online e ciency Corollary (Wasserstein Natural Gradient E ciency)

    For the dynamics ✓t+1 = ✓t 1 t G 1 W (✓t) W (xt , ✓t), the Wasserstein covariance updates according to Vt+1 =Vt + 1 t2 G 1 W (✓⇤) 2 t Vt + o ✓ 1 t2 ◆ + o( Vt t ). Then, the online Wasserstein natural gradient algorithm is Wasserstein e cient, that is: Vt = 1 t G 1 W (✓⇤) + O ✓ 1 t2 ◆ . 43 44
  45. Poincare online e ciency Corollary (Poincare E ciency) For the

    dynamics ✓t+1 = ✓t 1 t rW ✓ l(xt, ✓t), where l(xt, ✓t) = log p (xt, ✓t) is the log-likelihood function. The Wasserstein covariance updates according to Vt+1 = Vt + 1 t2 G 1 W (✓⇤)Ep✓⇤ h rx (r✓l(xt, ✓⇤)) · rx ⇣ r✓l(xt, ✓⇤)T ⌘i G 1 W (✓⇤) 2 t VtGF (✓⇤)G 1 W (✓⇤) + O ✓ 1 t3 ◆ + o ✓ Vt t ◆ . Now suppose that ↵ = sup{a|GF ⌫ aGW }. Then the dynamics is characterized by the following formula Vt = 8 > < > : O t 2↵ , 2↵  1, 1 t 2GF G 1 W I 1 G 1 W (✓⇤)IG 1 W (✓⇤) + O ✓ 1 t2 ◆ , 2↵ > 1, where I = Ep✓⇤ h rx (r✓l(xt, ✓⇤)) · rx ⇣ r✓l(xt, ✓⇤)T ⌘i . 44 45
  46. Part II: Transport information proximal Lin, Li, Montufar, Osher, Wasserstein

    proximal of GANs, 2019. Li, Lin, Montufar, A ne proximal learning, GSI, 2019. Li, Liu, Zha, Zhou, Parametric Fokker–Planck equations, GSI, 2019. Arbel, Gretton, Li, Guido Montufar, Kernelized Wasserstein Natural Gradient, ICLR, spotlight, 2020. Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations, 2020. 45 46
  47. Gradient operator Given a function F : M ! R

    with a metric matrix function G(x) 2 Rn⇥n, the gradient operator is defined by gradF(x) = G(x) 1rx F(x), where rx is the Euclidean or L2 gradient w.r.t. x. If G(x) 1 = I, then gradF(x) = rx F(x); If G(x) 1 = diag(xi)1in , then gradF(x) = (xi rxi F(x))n i=1 ; If G(x) 1 = L(x), where L is a graph Laplacian matrix and x is the weight, then gradF(x) = L(x)rx F(x). 46 47
  48. Proximal method Consider an Euclidean Gradient flow: ˙ x =

    rF(x). The Proximal method refers to xk+1 = arg min x2Rn F(x) + 1 2h kx xk k2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk hrF(xk+1). 47 48
  49. Proximal method on a metric structure Consider an Euclidean gradient

    flow: ˙ x = G(x) 1rF(x). The proximal method refers to xk+1 = arg min x2Rn F(x) + 1 2h dG(x, xk)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, i.e. dG(x, y) = inf n Z 1 0 q ˙(t)T G( (t))˙(t)dt: (0) = x, (1) = y o This is known as the variational backward Euler method xk+1 = xk hG(xk+1) 1rF(xk+1) + o(h). Many other linearization of the methods have been studied. 48 49
  50. Proximal method on a probability space Denote G(⇢) 1 =

    ⇢ . Consider the gradient flow: ˙ ⇢ = G(⇢) 1 ⇢F(⇢) = r · (⇢r F(⇢)). The proximal method refers to ⇢k+1 = arg min ⇢2P F(⇢) + 1 2h dG(⇢, ⇢k)2, where h > 0 is the step-size, and dG(x, y) is the distance function induced by G, known as the transport distance. This is known as the variational backward Euler method ⇢k+1 = ⇢k hr · (⇢k+1 r F(⇢k+1)) + o(h). A linearlization of proximal methods have been formulated by (Li, Lu, Wang, 2019). 49 50
  51. Generative Adversary Networks 50 51

  52. Generative probability models For each parameter ✓ 2 Rd and

    given neural network parameterized mapping function g✓ , consider ⇢✓ = g✓#p(z). 51 52
  53. Natural proximal on probability models Denote G(✓) = (r✓ ⇢,

    ( ⇢) 1r✓ ⇢)L2 . Consider the gradient flow: ˙ ✓ = G(✓) 1r✓F(⇢(✓, ·)). The update scheme follows5: ✓k+1 = arg min ✓2⇥ F(⇢✓) + 1 2h dW (✓, ✓k)2. where ✓ is the parameters of the generator, F(⇢✓) is the loss function, and dW is the Wasserstein metric. 5Lin, Li, Osher, Montufar, Wasserstein proximal of GANs, 2018. 52 53
  54. Computations In practice, we approximate the Wasserstein metric to obtain

    the following update: ✓k+1 = arg min ✓2⇥ F(⇢✓) + 1 B B X i 1 2h kg✓(zi) g✓k (zi)k2, where g✓ is the generator, B is the batch size, and zi ⇠ p(z) are inputs to the generator. 53 54
  55. Examples: Jensen–Shannon entropy Loss Figure: The relaxed Wasserstein Proximal of

    GANs, on the CIFA10 (left), CelebA (right) datasets. 54 55
  56. Examples: Wasserstein-1 Loss Figure: Wasserstein proximal of Wasserstein-1 Loss function

    on the CIFA10 data set. 55 56
  57. Part III: Transport information Gaussian Mixtures Transport information Gaussian Mixtures

    Jiaxi Zhao, Wuchen Li, Scaling limit of transport Gaussian mixture models, 2020. 56 57
  58. Part IV: Transport information generative models Transport information generative models

    Shu Liu, Wuchen Li, Hongyuan Zha, Haomin Zhou, Neural parametric Fokker-Planck equations, 2020. 57 58
  59. Fokker-Planck Equation I Fokker Planck Equation (FPE) is a parabolic

    evolution equation: @⇢ @t = r · (⇢rV ) + ⇢ ⇢(·, 0) = ⇢0 The equation plays fundamental roles in probability, geometry, statistics, and also machine learning communities. I This equation has many applications: 2 59
  60. Our Goal We are interested in solving high dimensional Fokker-Planck

    equations. I Di culty: Curse of dimensionality I Existing tools for high dimensional problems: Deep learning I We are going to introduce deep learning techniques to numerically solve FPE. 3 60
  61. Motivation I Dimension reduction, Fokker-Planck Equation (infinite dimensional ODE) +

    Parametric Fokker-Planck Equation (finite dimensional ODE) I Obtain an e cient sampling-friendly method, i.e. we are able to draw samples from ⇢ t = ⇢(x, t) for arbitrary t. 4 61
  62. Our treatment Generative Model12 in deep learning: T ✓ as

    a deep neural network, p as a reference distribution. Find a suitable parameter ✓ s.t. T ✓# p approximates desired distribution ⇢ well. I Deep structure ) Dealing with high dimensional problem; I Pushforward mechanism ) Sampling-friendly; We numerically solve high dimensional Fokker-Planck equation in Generative Model: I Approximate ⇢ t , the solution of the Fokker-Planck Equation at time t, by T ✓t #p; I Construct an ODE for ✓ t and numerically solve the ODE; 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 5 62
  63. Project Fokker-Planck equation to the parameter space 7 63

  64. Wasserstien manifold of probability distributions Wasserstein manifold of probability distributions

    defined on Rd: P = ⇢ ⇢: Z ⇢(x)dx = 1, ⇢(x) 0, Z |x| 2⇢(x) dx < 1 Tangent space of P at ⇢ is characterized as: T ⇢P = n ˙ ⇢: Z ˙ ⇢(x)dx = 0 o . 1Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis. 1998 8 64
  65. Wasserstien manifold of probability distributions We define the Wasserstein metric

    gW on TP as1: For any ⇢ 2 P and ˙ ⇢1 , ˙ ⇢2 2 T ⇢P: gW (⇢)( ˙ ⇢1 , ˙ ⇢2) = Z r 1(x) · r 2(x)⇢(x) dx, where 1 , 2 solves ˙ ⇢1 = r · (⇢1r 1), ˙ ⇢2 = r · (⇢2r 2) Now (P, gW ) becomes a Riemannian manifold. We call it Wasserstein manifold. The geodesic distance on (P, gW ) corresponds to L2-Wasserstein distance. 1Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis. 1998 9 65
  66. Gradient flows on Wasserstein manifold Denote gradW as the manifold

    gradient on (P, gW ). Then gradW can be computed as follows. Consider a smooth functional F : P ! R, one can verify2: gradW F(⇢) =gW (⇢) 1 ✓ F ⇢ ◆ (x) = r · (⇢(x)r ⇢(x) F(⇢)), where ⇢(x) is the L2 first variation at variable x 2 Rd. Then the gradient flow of F is written as2: @⇢ @t = gradW F(⇢) = r · (⇢(x)r ⇢(x) F(⇢)) 2C´ edric Villani. Optimal transport: old and new. 2008 10 66
  67. Fokker-Planck Equation as gradient flow For a given potential function

    V , consider the relative entropy functional H (also known as KL divergence): H(⇢) = Z ⇢(x) log ⇢(x) 1 Z e V (x) ! dx = Z V (x)⇢(x)dx + Z ⇢(x) log ⇢(x)dx + Const. Then r ⇣ H ⇢ ⌘ = rV + r log ⇢, and gradient flow is written as: @⇢ @t = r · (⇢rV ) + r · (⇢r log ⇢)) = r · (⇢rV ) + ⇢. This is exactly the Fokker-Planck Equation mentioned at beginning. 11 67
  68. Parametric Fokker-Planck Equation I Inspired by: Wasserstein geometry5 + Information

    Geometry6 ) Wasserstein Information Geometry7 I Sketch of our treatment: – Introduce parametric distributions via Generative Model; – Derive the metric tensor on parameter space; – Compute the Wasserstein gradient flow of relative entropy in parameter space. 5Felix Otto. The geometry of dissipative evolution equations: the porous medium equation. 2001 6S Amari. Natural Gradient Works E ciently in Learning. 1998 7Wuchen Li and Guido Montufar. Natural Gradient via Optimal Transport. 2018 12 68
  69. Parametric Fokker-Planck Equation I Consider a class of invertible push-forward

    maps {T ✓}✓2⇥ indexed by parameter ✓ 2 ⇥ ⇢ Rm T ✓ : Rd ! Rd I We obtain family of parametric distributions P⇥ = ⇢ ✓ = T ✓# p | ✓ 2 ⇥ ⇢ (P, gW ) I Define T# : ✓ 7! T ✓# p = ⇢ ✓ and introduce the metric tensor G = (T#)⇤gW on ⇥ by pulling back gW by T# I We call G on parameter manifold ⇥ the Wasserstein statistical manifold. I For each ✓ 2 ⇥, G(✓) is a bilinear form on T ✓⇥ ' Rm 13 69
  70. Parametric Fokker-Planck Equation Theorem 1 Let G = T# ⇤gW

    , then G(✓) is m ⇥ m non-negative definite symmetric matrix and can be computed as: G(✓) = Z r (T ✓(x))r (T ✓(x))T p(x)dx, Or in entry-wised form: G ij(✓) = Z r i(T ✓(x)) · r j(T ✓(x)) p(x) dx, 1  i, j  m. Here = ( 1 , ... m)T and r is m ⇥ d Jacobian matrix of . For each k = 1, 2, ..., m, k solves the following equation: r · (⇢ ✓r k(x)) = r · (⇢ ✓ @ ✓k T ✓(T 1 ✓ (x))). (1) 14 70
  71. Parametric Fokker-Planck Equation Recall the relative entropy functional H, we

    consider H = H T# : ⇥ ! R. Then: H(✓) = H(⇢ ✓) = Z V (x)⇢ ✓(x) dx + Z ⇢ ✓(x) log ⇢ ✓(x) dx = Z [V (T ✓(x)) + log ⇢ ✓(T ✓(x))]p(x) dx Since Fokker-Planck equation is gradient flow of H on (P, gW ), a similar version on (⇥, G) is formulated as: ˙ ✓ = G(✓) 1 r✓ H(✓). ✓ Recall: @⇢ @t = gW (⇢) 1 ⇢(x) H(⇢(x)) ◆ This is an ODE in parameter space ⇥, we call it parametric Fokker-Planck equation. 15 71
  72. From Fokker-Planck equation to particle dynamics (In this diagram, notice

    that r H(⇢) ⇢ = rV + r log ⇢.) 16 72
  73. Particle version of Fokker-Planck Equation Fokker-Planck equation: @⇢ @t +

    r · (⇢rV ) = ⇢ ⇢(·, 0) = ⇢0 Particle version: I Over damped Langevin dynamics: dX t = rV (X t)dt + p 2 dB t X0 ⇠ ⇢0 I Vlasov-type dynamics: dX t dt = rV (X t) r log ⇢ t(X t) X0 ⇠ ⇢0 Or equivalently, dX t dt = r H(⇢ t) ⇢ t (X t) X0 ⇠ ⇢0 Here we denote ⇢ t(·) as the density of distribution of X t . 17 73
  74. From parametric Fokker-Planck equation to particle dynamics (In this diagram,

    notice that r H(⇢) ⇢ = rV + r log ⇢.) 18 74
  75. Particle version of parametric Fokker-Planck Equation I Let {✓ t}

    be the solution of the ODE I Take Z ⇠ p, consider Y t = T ✓t (Z), we know for any t, Y t ⇠ ⇢ ✓t . {Y t} satisfies the dynamic: ˙ Y t = @ ✓ T ✓t (T 1 ✓t (Y t)) ˙ ✓ t Y0 ⇠ ⇢ ✓0 I It is then more insightful to consider the following dynamic: ˙ X t = r t(X t)T ˙ ✓ t , X0 ⇠ ⇢ ✓0 I X t and Y t have the same distribution. And thus X t ⇠ ⇢ ✓t for all t 0 19 75
  76. Particle version of parametric Fokker-Planck Equation I Since ˙ ✓

    t = G(✓ t) 1 r✓ H(✓ t), plugging the definition of the metric tensor ˙ X t = Z K ✓t (X t , ⌘)( rV (⌘) r log ⇢ ✓t (⌘))⇢ ✓t (⌘)d⌘ I Here we define the kernel function K ✓ : Rd ⇥ Rd ! Rd⇥d K ✓(x, ⌘) = r T (x) ✓Z r (x)r (x)T ⇢ ✓(x)dx ◆ 1 r (⌘) 20 76
  77. Particle version of parametric Fokker-Planck Equation I The kernel K

    ✓(·, ·) induce an orthogonal projection K✓ defined on L2 (Rd; Rd, ⇢ ✓) The range of this projection is the subspace is: span {r 1 , ..., r m} ⇢ L2 (Rd; Rd, ⇢ ✓) Here 1 , . . . , m are the m components of t . I We may rewrite the dynamic as: ˙ X t = K✓t (rV + r log ⇢ ✓t )(X t), X0 ⇠ ⇢ ✓0 ⇢ ✓t is the probability density of X t . 21 77
  78. Projected Dynamics (In this diagram, notice that r H(⇢) ⇢

    = rV + r log ⇢.) 22 78
  79. Projected Dynamics I We now compare – Vlasov-type dynamics (Fokker-Planck

    equation) ˙ Xt = K✓t (rV + r log ⇢✓t )(Xt), X0 ⇠ ⇢✓0 ⇢✓t is the probability density of Xt. – Projected Vlasov-type dynamics (parametric Fokker-Planck equation) ˙ ˜ Xt = (rV + r log ⇢t)( ˜ Xt), ˜ X0 ⇠ ⇢0 ⇢t is the probability density of ˜ Xt. I From particle viewpoint, the di↵erence between ⇢ ✓t and ⇢ t originates from the orthogonal projection K✓t of vector field that drives the Vlasov-type dynamics. 23 79
  80. Route map of our research Our parametric Fokker-Planck equation can

    be treated as a Neural Lagrangian method with Eulerian view . 24 80
  81. Algorithm When dimension d gets higher: I Endow T ✓

    with the structure of deep neural networks. Our choice: Normalizing Flow9 (There are definitely other choices.) I Classical schemes for ODE are not suitable: – Direct computation of G(✓) could be expensive; – G(✓) 1r✓H(✓) cannot be e ciently computed under deep learning framework; I We have to seek for new schemes. Since optimization is convenient to perform under deep learning framework, one choice is to design variational scheme for our parametric Fokker-Planck equation. 9Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ICML, 2015 25 81
  82. Algorithm Goal: Propose a variational scheme for solving ˙ ✓

    t = G(✓ t) 1 r✓ H(✓ t) I Start with an semi-implicit scheme: ✓ k+1 ✓ k h = G 1 (✓ k)r✓ H(✓ k+1). I Associate with proximal-type algorithm: ✓ k+1 = argmin ✓ {h✓ ✓ k , G(✓ k)(✓ ✓ k)i + 2hH(✓)} I We use the following approximation: h✓ ✓k,G(✓k)(✓ ✓k)i⇡max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx o I Thus, we come up with the following variational saddle scheme: ✓k+1=argmin ✓ max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx+2hH(✓) o . In our implementation, we choose as ReLU network and optimize over its parameters. 26 82
  83. Algorithm Algorithm 1 Solving parametric FPE ˙ ✓ = G(✓)

    1 r✓ F(✓) Initialize ✓0 for k = 0, 1, ..., N do ✓ ✓ k ; for j = 1, 2, ..., M do Apply Stochastic Gradient Descent to solve: inf ⇢Z |T✓(x) T✓k (x) r (T✓k (x))|2 dp(x) ✓ ✓ stepsize · r✓J(✓, ✓k). Here J(✓, ˜ ✓) is defined as: J(✓, ˜ ✓) = Z T✓(x) · r (T˜ ✓ (x)) dp(x) + hF(✓) end for ✓ k+1 ✓ end for 27 83
  84. Algorithm I Our algorithm is convenient in high dimensional cases:

    – All the integrals can be approximated by Monte-Carlo method; – Derivatives w.r.t. parameters of networks can be computed via back propagation. I An interesting observation: our new scheme can be treated as a modified version of JKO scheme11 for Wasserstein gradient flow of functional H: ⇢k+1=argmin⇢{ 1 2h W 2 2 (⇢,⇢k)+H(⇢)} , ˙ ⇢= gradW H(⇢) I In our variational scheme, we replace W2 2 (⇢, ⇢ k) with d 2(⇢✓,⇢✓k )=max nR M (2r (x)·((T✓ T✓k ) T 1 ✓k (x)) |r (x)| 2)⇢✓k (x) dx o 10Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 1998 28 84
  85. Numerical examples Now we try FPE with more complicated potential

    in higher dimensions. Assume we are in R30, = 1 and we set: V (x) = 1 50 30 X i=1 x4 i 16x2 i + 5x i ! Here is a plot of such function in 2 dimensional space: 33 85
  86. Numerical examples We let time stepsize h = 0.005. Since

    we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 86
  87. Numerical examples We let time stepsize h = 0.005. Since

    we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 87
  88. Numerical examples We let time stepsize h = 0.005. Since

    we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 88
  89. Numerical examples We let time stepsize h = 0.005. Since

    we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 89
  90. Numerical examples We let time stepsize h = 0.005. Since

    we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 90
  91. Numerical examples We let time stepsize h = 0.005. Since

    we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 34 91
  92. Numerical examples We let time stepsize h = 0.005. Since

    we cannot plot sample points in 30 dimensional space, we project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: Figure: Density estimation of the samples points at 360th iteration 34 92
  93. Numerical examples We can verify the graph of trained at

    the end of each outer iteration. Here are the graphs of at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of at k = 30th time step Figure: Graph of at k = 60th time step Figure: Graph of at k = 150th time step 35 93
  94. Numerical examples We can verify the graph of trained at

    the end of each outer iteration. Here are the graphs of at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of at k = 240th time step Figure: Graph of at k = 300th time step Figure: Graph of at k = 360th time step 36 94
  95. Numerical analysis Let us denote {⇢ t} as the solution

    to the original Fokker-Planck equation. Denote {✓ t} as the solution to parametric Fokker-Planck equation ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t) and ⇢ ✓t = T ✓t# p. I Convergence analysis: we will provide upper bound for DKL(⇢ ✓t k⇢ ⇤). Here ⇢ ⇤ = 1 Z exp( V ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). I Error analysis: we provide uniform bound for W2(⇢ ✓t , ⇢ t). 37 95
  96. Numerical analysis Before we introduce our results on numerical analysis,

    we first introduce an important quantity which will play an essential role in our results: Let us recall as defined from (1) in Theorem 1, we then set: 0 = sup ✓2⇥ min ⇠2T✓⇥ ⇢Z |(r (T ✓(x))T ⇠ r (V + log ⇢ ✓) T ✓(x)| 2 dp(x) One can treat 0 as a quantification of the approximation power of push-forward family {T ✓}✓2⇥ . 38 96
  97. Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and

    r2V ⌫ 0 outside a finite ball. Suppose {✓ t} solves ˙ ✓ t = G(✓ t) 1 r✓ H(✓ t). Let ⇢ ⇤(x) = 1 Z e V (x)/ be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(⇢ ✓t k⇢ ⇤)  0 ˜ 2 (1 e ˜ t) + KL(⇢ ✓0 k⇢ ⇤)e ˜ t. (2) Here ˜ > 0 is a constant number originating from the Log-Sobolev inequality involving ⇢ ⇤, i.e. KL(⇢k⇢ ⇤)  1 ˜ I(⇢|⇢ ⇤) holds for any probability density ⇢. Here I(⇢|⇢ ⇤) = R r log ⇣ ⇢ ⇢⇤ ⌘ 2 ⇢ dx. 39 97
  98. Error analysis Theorem Assume {✓ t} solves ˙ ✓ t

    = G(✓ t) 1 r✓ H(✓ t); and {⇢ t} solves the Fokker-Planck equation. Assume that the Hessian of the potential function V is bounded below by a constant , i.e. r2V ⌫ I. Then we have: W2(⇢ ✓t , ⇢ t)  p 0 (1 e t) + e tW2(⇢ ✓0 , ⇢0). (3) When V is strictly convex, i.e. 0, inequality (3) provides a uniformly small upper bound for the error W2(⇢ ✓t , ⇢ t); However, under many cases, < 0, then the right hand side of (3) will increase to infinity as time t ! 1. 40 98
  99. Error analysis (improved version) Theorem The W2 error W2(⇢ ✓t

    , ⇢ t) at any time t > 0 can be uniformly bounded by constant number depending on E0 = W2(⇢ ✓0 , ⇢0) and 0. To be more precise, 1. When 0, the error W2(⇢ ✓t , ⇢ t) can be at least uniformly bounded by O(E0 + p 0) term; 2. When < 0, the error W2(⇢ ✓t , ⇢ t) can be at least uniformly bounded by O((E0 + p 0) ˜ 2| |+˜ ) term. 41 99
  100. Summary and Future Research Summary: Solving high-dimensional Fokker Planck equation

    I Dimension reduction: PDE to ODE via implicit generative models I Variational approach for semi-Euler scheme update I Approximation and numerical error bounds Future Research I Application to sampling techniques for time-varying problems I Application to other types of dissipative PDEs I Application to global optimization 42 100
  101. 101