Neural Fokker-Planck equation

Slide 1

Slide 1 text

Transport information scientiﬁc computing Wuchen Li University of South Carolina Aug 27

Slide 2

Slide 2 text

Learning 2

Slide 3

Slide 3 text

Learning applications 3

Slide 4

Slide 4 text

Density ﬂows: Fokker-Planck equation Fokker Planck equation (FPE) is a parabolic evolution equation: ∂ρ(t, x) ∂t = ∇ · (ρ(t, x)∇V (x)) + β∆ρ(t, x), ρ(0, x) = ρ0 (x), where x ∈ Rd and β > 0 is a given coeﬃcient constant. There are several properties of FPE, such as (i) Conservation of Mass; (ii) Entropy dissipation. 4

Slide 5

Slide 5 text

Related works In computing FPEs, many numerical methods have been considered. Traditional methods (i) Discretization: Finite volume; Finite diﬀerence; Finite element e.t.c. (ii) Variation: Optimal transport gradient ﬂows, in both Eulerian and Lagrangian formulations. (Carrilo, Craig, Wang, Wei, Li, Cances, Chow, et.al.) (iii) Tools: Primal dual methods (Li, Osher, et.al.). Curse of dimensionality In traditional numerical methods, solving high dimensional FPE is not simple, whenever d ≥ 3. This is known as the curse of dimensionality. In this talk, we are aiming to provide a new scheme by deep learning variational methods. 5

Slide 6

Slide 6 text

Overcome curse of dimensionality Many methods have been proposed for the Hamilton-Jacobi equation: Hopf-Lax formula: Darbon, Osher, et.al; Path space: Weinan E et.al; Lagrangian Methods: Lars, Halder, et.al; Inverse problems: L. Yang and G. E. Karniadakis et.al. 6

Slide 7

Slide 7 text

Goal Main question: Can we apply the learning-related basis functions (Neural network) to approximate the solution of FPE in high dimensional spaces? Answer: 1. Dimension reduction via Neural networks: Fokker-Planck equation (PDE) ⇓ Neural parametric Fokker-Planck equation (ODE) 2. Approximate a solution of FPE in data-poor situations. I.e. introduce sampling friendly variational schemes. 7

Slide 8

Slide 8 text

Learning+Density flows Natural gradient: (i) Amari, Natural gradient works efficient in learning; Ny et.al. Information geometry; (ii) ADAM, KFAC etc; (iii) Li, Transport information geometry I, II. Learning: (i) Li, Guido, Chen, Natural gradient via optimal transport I, II, Ricci curvature of parametric statistics via optimal transport, IG; (ii) Lin, Li, Osher, Guido, Wasserstein proximal of GANs; Natural affine proximal learning, GSI; (iii) Abel, Gretton, Li, Guido, Kernelized Wasserstein Natural Gradient, ICLR, 2020; (iv) Lars, Osher, Li, Levon, Samy, A machine learning framework for mean field games, PNAS, 2020; (v) Lin, Samy, Li, Levon, Osher, APAC net: Neural primal dual methods for mean field games, 2020; (vi) Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations. 8

Slide 9

Slide 9 text

Neural network+Density ﬂows Computational neural networks12: Deep structure ⇒ Dealing with high dimensional problem; Pushforward mechanism ⇒ Sampling-friendly. 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 9

Slide 10

Slide 10 text

Optimal transport Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Laﬀerty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into data-driven scientiﬁc computing methods. 10

Slide 11

Slide 11 text

Density control+Applications 11

Slide 12

Slide 12 text

“Least square” in fluid dynamics Optimal transport has an optimal control formulation as follows: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt, under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ρ0 = ρ0, ρ1 = ρ1. Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 12

Slide 13

Slide 13 text

Optimal transport gradient Denote gradW the gradient operator on (P, gW ). Consider a smooth functional F: P → R, then the optimal transport gradient operator satisﬁes gradW F(ρ) =gW (ρ)−1 δ δρ F (x) = − ∇ · (ρ(x)∇ δ δρ(x) F(ρ)), where δ δρ(x) is the L2 ﬁrst variation at variable x ∈ Rd. 13

Slide 14

Slide 14 text

Optimal transport gradient flows Consider the Kullback–Leibler (KL) divergence: H(ρ) =βDKL (ρ e− V β ) = V (x)ρ(x)dx + β ρ(x) log ρ(x)dx + Const. The optimal transport gradient flow of KL divergence satisfies: ∂ρ ∂t = − βgW (ρ)−1 δDKL δρ (x) =∇ · (ρ∇V ) + β∇ · (ρ∇ log ρ)) =∇ · (ρ∇V ) + β∆ρ, where ρ∇ log ρ = ∇ρ. 14

Slide 15

Slide 15 text

Generative Adversary Networks For each parameter θ ∈ Rd and given neural network parameterized mapping function Tθ , consider ρθ = Tθ #p(z). 15

Slide 16

Slide 16 text

Generative probability models Consider a class of invertible push-forward maps {Tθ }θ∈Θ indexed by parameter θ ∈ Θ ⊂ Rm Tθ : Rd → Rd. We obtain a family of parametric distributions PΘ = ρθ = Tθ# p | θ ∈ Θ ⊂ (P, gW ). 16

Slide 17

Slide 17 text

Neural information matrix Here GW (θ) ∈ Rm×m is an matrix2 deﬁned by GW,ij (θ) = ∇Φi (x), ∇Φj (x)ρ(x, θ)dx, where −∇ · (ρθ ∇Φk (x)) = ∇θk ρθ , where k = i, j. In term of generative models, the matrix forms GW,ij (θ) = ∇ψi (Tθ (x)) · ∇ψj (Tθ (x)) p(x) dx, 1 ≤ i, j ≤ m. where ∇ · (ρθ ∇ψk (x)) = ∇ · (ρθ ∂θk Tθ (T−1 θ (x))). 2Li, Zhao, Wasserstein information matrix, 2019. 17

Slide 18

Slide 18 text

Neural parametric Fokker-Planck equation Consider H = H ◦ T# : Θ → R, then H(θ) = H(ρθ ) = [V (Tθ (x)) + β log ρθ (Tθ (x))]p(x)dx. Writing the gradient ﬂow of H(θ) in (Θ, GW (θ)), we derive ˙ θ = −GW (θ)−1∇θ H(θ). We call the above ODE Neural Fokker-Planck equation. This is the neural approximation of FPE ∂ρ ∂t = −gW (ρ)−1 δ δρ H(ρ). 18

Slide 19

Slide 19 text

Algorithm We endow Tθ with the structure of deep neural networks. In particular, we choose the normalizing flow9: Tθ (z) = T1θ ◦ T2θ ◦ · · · ◦ TNθ (z). Difficulties (i) Direct computation of GW (θ) could be expensive; (ii) And GW (θ)−1 may not be efficiently computed under deep learning framework; Here we have to seek a variational proximal scheme to handle them. 9D. Rezende and S. Mohamed. Variational inference with normalizing flows. ICML, 2015 19

Slide 20

Slide 20 text

Review: Proximal method Consider an Euclidean gradient ﬂow in (Rd, I): ˙ x = −∇F(x). The Proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h x − xk 2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk − h∇F(xk+1 ). 20

Slide 21

Slide 21 text

Review: Proximal method on a metric structure Consider a gradient ﬂow in (Rd, G): ˙ x = −G(x)−1∇F(x). The proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h dG (x, xk )2, where h > 0 is the step-size, and dG (x, y) is the distance function induced by G, i.e. dG (x, y)2 = inf γ 1 0 ˙ γ(t)T G(γ(t))˙ γ(t)dt: γ(0) = x, γ(1) = y . This is known as the variational backward Euler method xk+1 = xk − hG(xk+1 )−1∇F(xk+1 ) + o(h). 21

Slide 22

Slide 22 text

Primal-dual scheme for FPE In term of optimal transport, the proximal method, also named Jordan–Kinderlehrer–Otto (JKO) scheme, forms ρk+1 = arg min ρ F(ρ) + 1 2h dW (ρ, ρk )2 In other words, A suitable linearization on the metric can be considered; see Li, Wang, JCP, (2020). 22

Slide 23

Slide 23 text

AI Discretization Goal: Propose a variational scheme for solving gradient ﬂows in (Θ, GW (θ)): ˙ θt = −GW (θt )−1∇θ H(θt ). Consider a linearized proximal algorithm: θk+1 = argmin θ { θ − θk , GW (θk )(θ − θk ) + 2hH(θ)} . 23

Slide 24

Slide 24 text

GAN=Neural proximal dual methods Consider the following approximation: θ − θk , GW (θk )(θ − θk ) ≈ max ψ M (2∇ψ(Tθk ) · ((Tθ − Tθk ) − |∇ψ(Tθk )|2)p(x)dx Hence the scheme forms θk+1 =argmin θ max ψ M 2∇ψ(Tθk (x)) · (Tθ − Tθk ) − |∇ψ(Tθk (x))|2 p(x)dx + 2hH(θ) . Here we choose ψ as ReLU network and optimize over its parameters. Write the saddle point scheme in terms of ﬁnite samples and carry out the optimization step, as in GAN. 24

Slide 25

Slide 25 text

Algorithm Algorithm 1 Solving parametric FPE ˙ θ = −G(θ)−1∇θ F(θ) Initialize θ0 for k = 0, 1, ..., N do θ ← θk ; for j = 1, 2, ..., M do Apply Stochastic gradient descent to solve: inf ψ |Tθ(x) − Tθk (x) − ∇ψ(Tθk (x))|2 p(x)dx θ ← θ − stepsize · ∇θJ(θ, θk). Here J(θ, ˜ θ) is deﬁned as: J(θ, ˜ θ) = Tθ(x) · ∇ψ(T˜ θ (x)) p(x)dx + hF(θ) end for θk+1 ← θ end for 25

Slide 26

Slide 26 text

Numerical examples We compute FPE with a complicated potential in higher dimensions. In R30, β = 1, we set: V (x) = 1 50 30 i=1 x4 i − 16x2 i + 5xi Here is a plot of such function in 2 dimensional space: 26

Slide 27

Slide 27 text

Numerical examples Let the time stepsize be h = 0.005. We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 27

Slide 28

Slide 28 text

Numerical examples We can verify the graph of ψ trained at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 30th time step Figure: Graph of ψ at k = 60th time step Figure: Graph of ψ at k = 150th time step 28

Slide 29

Slide 29 text

Numerical examples We can verify the graph of ψ trained at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 240th time step Figure: Graph of ψ at k = 300th time step Figure: Graph of ψ at k = 360th time step 29

Slide 30

Slide 30 text

Neural numerical analysis Denote {ρt } as the solution to the original FPE. Suppose {θt } as the solution to parametric FPE with ρθt = Tθt# p. 1 Convergence analysis: we will provide upper bound for DKL (ρθt ρ∗ ). Here ρ∗ = 1 Zβ exp(−V β ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). 2 Error analysis: we provide uniform bound for W2 (ρθt , ρt ). 30

Slide 31

Slide 31 text

Neural numerical analysis Consider δ0 = sup θ∈Θ min ξ∈TθΘ |(∇Ψ(Tθ (x))T ξ − ∇ (V + β log ρθ ) ◦ Tθ (x)|2p(x)dx One can treat δ0 as a quantiﬁcation of the approximation power of push-forward family {Tθ }θ∈Θ . 31

Slide 32

Slide 32 text

Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and ∇2V 0 outside a ﬁnite ball. Suppose {θt } solves ˙ θt = −GW (θt )−1∇θ H(θt ). Let ρ∗ (x) = 1 Zβ e−V (x)/β be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(ρθt ρ∗ ) ≤ δ0 ˜ λβ β2 (1 − e−β˜ λβ t) + KL(ρθ0 ρ∗ )e−β˜ λβ t. Here ˜ λβ > 0 is a constant number originating from the Log-Sobolev inequality involving ρ∗ , i.e. KL(ρ ρ∗ ) ≤ 1 ˜ λβ I(ρ|ρ∗ ) holds for any probability density ρ. Here I(ρ|ρ∗ ) = ∇ log ρ ρ∗ 2 ρ dx. 32

Slide 33

Slide 33 text

Future works (i) (Scientific computing) Solve general transport related kinetic PDEs, including mean field games, Schr¨ odinger equations, Schr¨ odinger bridge problems, e.t.c; (ii) (Sampling optimization) Design transport information PDEs on sampling problems; (iii) (Estimation) Apply statistical efficiency, estimation of Neural networks via transport information matrix. 33