# Neural Fokker-Planck equation

In this talk, we develop and analyze numerical methods for high dimensional Fokker-Planck equations by leveraging generative models from deep learning. Our starting point is a formulation of the Fokker-Planck equation as a system of ordinary differential equations (ODEs) on finite-dimensional parameter space with the parameters inherited from generative models such as normalizing flows. We call such ODEs neural parametric Fokker-Planck equation. The fact that the Fokker-Planck equation can be viewed as the L2-Wasserstein gradient flow of Kullback-Leibler (KL) divergence allows us to derive the ODEs as the constrained L2-Wasserstein gradient flow of KL divergence on the set of probability densities generated by neural networks. For numerical computation, we design a variational semi-implicit scheme for the time discretization of the proposed ODE. Such an algorithm is sampling-based, which can readily handle Fokker-Planck equations in higher dimensional spaces. Moreover, we also establish bounds for the asymptotic convergence analysis of the neural parametric Fokker-Planck equation as well as its error analysis for both the continuous and discrete (forward-Euler time discretization) versions. Several numerical examples are provided to illustrate the performance of the proposed algorithms and analysis.

May 06, 2020

## Transcript

Aug 27

4. ### Density ﬂows: Fokker-Planck equation Fokker Planck equation (FPE) is a

parabolic evolution equation: ∂ρ(t, x) ∂t = ∇ · (ρ(t, x)∇V (x)) + β∆ρ(t, x), ρ(0, x) = ρ0 (x), where x ∈ Rd and β > 0 is a given coeﬃcient constant. There are several properties of FPE, such as (i) Conservation of Mass; (ii) Entropy dissipation. 4
5. ### Related works In computing FPEs, many numerical methods have been

considered. Traditional methods (i) Discretization: Finite volume; Finite diﬀerence; Finite element e.t.c. (ii) Variation: Optimal transport gradient ﬂows, in both Eulerian and Lagrangian formulations. (Carrilo, Craig, Wang, Wei, Li, Cances, Chow, et.al.) (iii) Tools: Primal dual methods (Li, Osher, et.al.). Curse of dimensionality In traditional numerical methods, solving high dimensional FPE is not simple, whenever d ≥ 3. This is known as the curse of dimensionality. In this talk, we are aiming to provide a new scheme by deep learning variational methods. 5
6. ### Overcome curse of dimensionality Many methods have been proposed for

the Hamilton-Jacobi equation: Hopf-Lax formula: Darbon, Osher, et.al; Path space: Weinan E et.al; Lagrangian Methods: Lars, Halder, et.al; Inverse problems: L. Yang and G. E. Karniadakis et.al. 6
7. ### Goal Main question: Can we apply the learning-related basis functions

(Neural network) to approximate the solution of FPE in high dimensional spaces? Answer: 1. Dimension reduction via Neural networks: Fokker-Planck equation (PDE) ⇓ Neural parametric Fokker-Planck equation (ODE) 2. Approximate a solution of FPE in data-poor situations. I.e. introduce sampling friendly variational schemes. 7
8. ### Learning+Density ﬂows Natural gradient: (i) Amari, Natural gradient works eﬃcient

in learning; Ny et.al. Information geometry; (ii) ADAM, KFAC etc; (iii) Li, Transport information geometry I, II. Learning: (i) Li, Guido, Chen, Natural gradient via optimal transport I, II, Ricci curvature of parametric statistics via optimal transport, IG; (ii) Lin, Li, Osher, Guido, Wasserstein proximal of GANs; Natural aﬃne proximal learning, GSI; (iii) Abel, Gretton, Li, Guido, Kernelized Wasserstein Natural Gradient, ICLR, 2020; (iv) Lars, Osher, Li, Levon, Samy, A machine learning framework for mean ﬁeld games, PNAS, 2020; (v) Lin, Samy, Li, Levon, Osher, APAC net: Neural primal dual methods for mean ﬁeld games, 2020; (vi) Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations. 8
9. ### Neural network+Density ﬂows Computational neural networks12: Deep structure ⇒ Dealing

with high dimensional problem; Pushforward mechanism ⇒ Sampling-friendly. 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 9
10. ### Optimal transport Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation

; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Laﬀerty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into data-driven scientiﬁc computing methods. 10

12. ### “Least square” in ﬂuid dynamics Optimal transport has an optimal

control formulation as follows: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt, under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ρ0 = ρ0, ρ1 = ρ1. Here, (P(Ω), gW ) forms an inﬁnite-dimensional Riemannian manifold1. 1John D. Laﬀerty: the density manifold and conﬁguration space quantization, 1988. 12

gW ). Consider a smooth functional F: P → R, then the optimal transport gradient operator satisﬁes gradW F(ρ) =gW (ρ)−1 δ δρ F (x) = − ∇ · (ρ(x)∇ δ δρ(x) F(ρ)), where δ δρ(x) is the L2 ﬁrst variation at variable x ∈ Rd. 13
14. ### Optimal transport gradient ﬂows Consider the Kullback–Leibler (KL) divergence: H(ρ)

=βDKL (ρ e− V β ) = V (x)ρ(x)dx + β ρ(x) log ρ(x)dx + Const. The optimal transport gradient ﬂow of KL divergence satisﬁes: ∂ρ ∂t = − βgW (ρ)−1 δDKL δρ (x) =∇ · (ρ∇V ) + β∇ · (ρ∇ log ρ)) =∇ · (ρ∇V ) + β∆ρ, where ρ∇ log ρ = ∇ρ. 14
15. ### Generative Adversary Networks For each parameter θ ∈ Rd and

given neural network parameterized mapping function Tθ , consider ρθ = Tθ #p(z). 15
16. ### Generative probability models Consider a class of invertible push-forward maps

{Tθ }θ∈Θ indexed by parameter θ ∈ Θ ⊂ Rm Tθ : Rd → Rd. We obtain a family of parametric distributions PΘ = ρθ = Tθ# p | θ ∈ Θ ⊂ (P, gW ). 16
17. ### Neural information matrix Here GW (θ) ∈ Rm×m is an

matrix2 deﬁned by GW,ij (θ) = ∇Φi (x), ∇Φj (x)ρ(x, θ)dx, where −∇ · (ρθ ∇Φk (x)) = ∇θk ρθ , where k = i, j. In term of generative models, the matrix forms GW,ij (θ) = ∇ψi (Tθ (x)) · ∇ψj (Tθ (x)) p(x) dx, 1 ≤ i, j ≤ m. where ∇ · (ρθ ∇ψk (x)) = ∇ · (ρθ ∂θk Tθ (T−1 θ (x))). 2Li, Zhao, Wasserstein information matrix, 2019. 17
18. ### Neural parametric Fokker-Planck equation Consider H = H ◦ T#

: Θ → R, then H(θ) = H(ρθ ) = [V (Tθ (x)) + β log ρθ (Tθ (x))]p(x)dx. Writing the gradient ﬂow of H(θ) in (Θ, GW (θ)), we derive ˙ θ = −GW (θ)−1∇θ H(θ). We call the above ODE Neural Fokker-Planck equation. This is the neural approximation of FPE ∂ρ ∂t = −gW (ρ)−1 δ δρ H(ρ). 18
19. ### Algorithm We endow Tθ with the structure of deep neural

networks. In particular, we choose the normalizing ﬂow9: Tθ (z) = T1θ ◦ T2θ ◦ · · · ◦ TNθ (z). Diﬃculties (i) Direct computation of GW (θ) could be expensive; (ii) And GW (θ)−1 may not be eﬃciently computed under deep learning framework; Here we have to seek a variational proximal scheme to handle them. 9D. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. ICML, 2015 19
20. ### Review: Proximal method Consider an Euclidean gradient ﬂow in (Rd,

I): ˙ x = −∇F(x). The Proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h x − xk 2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk − h∇F(xk+1 ). 20
21. ### Review: Proximal method on a metric structure Consider a gradient

ﬂow in (Rd, G): ˙ x = −G(x)−1∇F(x). The proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h dG (x, xk )2, where h > 0 is the step-size, and dG (x, y) is the distance function induced by G, i.e. dG (x, y)2 = inf γ 1 0 ˙ γ(t)T G(γ(t))˙ γ(t)dt: γ(0) = x, γ(1) = y . This is known as the variational backward Euler method xk+1 = xk − hG(xk+1 )−1∇F(xk+1 ) + o(h). 21
22. ### Primal-dual scheme for FPE In term of optimal transport, the

proximal method, also named Jordan–Kinderlehrer–Otto (JKO) scheme, forms ρk+1 = arg min ρ F(ρ) + 1 2h dW (ρ, ρk )2 In other words, A suitable linearization on the metric can be considered; see Li, Wang, JCP, (2020). 22
23. ### AI Discretization Goal: Propose a variational scheme for solving gradient

ﬂows in (Θ, GW (θ)): ˙ θt = −GW (θt )−1∇θ H(θt ). Consider a linearized proximal algorithm: θk+1 = argmin θ { θ − θk , GW (θk )(θ − θk ) + 2hH(θ)} . 23
24. ### GAN=Neural proximal dual methods Consider the following approximation: θ −

θk , GW (θk )(θ − θk ) ≈ max ψ M (2∇ψ(Tθk ) · ((Tθ − Tθk ) − |∇ψ(Tθk )|2)p(x)dx Hence the scheme forms θk+1 =argmin θ max ψ M 2∇ψ(Tθk (x)) · (Tθ − Tθk ) − |∇ψ(Tθk (x))|2 p(x)dx + 2hH(θ) . Here we choose ψ as ReLU network and optimize over its parameters. Write the saddle point scheme in terms of ﬁnite samples and carry out the optimization step, as in GAN. 24
25. ### Algorithm Algorithm 1 Solving parametric FPE ˙ θ = −G(θ)−1∇θ

F(θ) Initialize θ0 for k = 0, 1, ..., N do θ ← θk ; for j = 1, 2, ..., M do Apply Stochastic gradient descent to solve: inf ψ |Tθ(x) − Tθk (x) − ∇ψ(Tθk (x))|2 p(x)dx θ ← θ − stepsize · ∇θJ(θ, θk). Here J(θ, ˜ θ) is deﬁned as: J(θ, ˜ θ) = Tθ(x) · ∇ψ(T˜ θ (x)) p(x)dx + hF(θ) end for θk+1 ← θ end for 25
26. ### Numerical examples We compute FPE with a complicated potential in

higher dimensions. In R30, β = 1, we set: V (x) = 1 50 30 i=1 x4 i − 16x2 i + 5xi Here is a plot of such function in 2 dimensional space: 26
27. ### Numerical examples Let the time stepsize be h = 0.005.

We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 27
28. ### Numerical examples We can verify the graph of ψ trained

at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 30th time step Figure: Graph of ψ at k = 60th time step Figure: Graph of ψ at k = 150th time step 28
29. ### Numerical examples We can verify the graph of ψ trained

at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 240th time step Figure: Graph of ψ at k = 300th time step Figure: Graph of ψ at k = 360th time step 29
30. ### Neural numerical analysis Denote {ρt } as the solution to

the original FPE. Suppose {θt } as the solution to parametric FPE with ρθt = Tθt# p. 1 Convergence analysis: we will provide upper bound for DKL (ρθt ρ∗ ). Here ρ∗ = 1 Zβ exp(−V β ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). 2 Error analysis: we provide uniform bound for W2 (ρθt , ρt ). 30
31. ### Neural numerical analysis Consider δ0 = sup θ∈Θ min ξ∈TθΘ

|(∇Ψ(Tθ (x))T ξ − ∇ (V + β log ρθ ) ◦ Tθ (x)|2p(x)dx One can treat δ0 as a quantiﬁcation of the approximation power of push-forward family {Tθ }θ∈Θ . 31
32. ### Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and

∇2V 0 outside a ﬁnite ball. Suppose {θt } solves ˙ θt = −GW (θt )−1∇θ H(θt ). Let ρ∗ (x) = 1 Zβ e−V (x)/β be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(ρθt ρ∗ ) ≤ δ0 ˜ λβ β2 (1 − e−β˜ λβ t) + KL(ρθ0 ρ∗ )e−β˜ λβ t. Here ˜ λβ > 0 is a constant number originating from the Log-Sobolev inequality involving ρ∗ , i.e. KL(ρ ρ∗ ) ≤ 1 ˜ λβ I(ρ|ρ∗ ) holds for any probability density ρ. Here I(ρ|ρ∗ ) = ∇ log ρ ρ∗ 2 ρ dx. 32
33. ### Future works (i) (Scientiﬁc computing) Solve general transport related kinetic

PDEs, including mean ﬁeld games, Schr¨ odinger equations, Schr¨ odinger bridge problems, e.t.c; (ii) (Sampling optimization) Design transport information PDEs on sampling problems; (iii) (Estimation) Apply statistical eﬃciency, estimation of Neural networks via transport information matrix. 33