Neural Fokker-Planck equation

Neural Fokker-Planck equation

In this talk, we develop and analyze numerical methods for high dimensional Fokker-Planck equations by leveraging generative models from deep learning. Our starting point is a formulation of the Fokker-Planck equation as a system of ordinary differential equations (ODEs) on finite-dimensional parameter space with the parameters inherited from generative models such as normalizing flows. We call such ODEs neural parametric Fokker-Planck equation. The fact that the Fokker-Planck equation can be viewed as the L2-Wasserstein gradient flow of Kullback-Leibler (KL) divergence allows us to derive the ODEs as the constrained L2-Wasserstein gradient flow of KL divergence on the set of probability densities generated by neural networks. For numerical computation, we design a variational semi-implicit scheme for the time discretization of the proposed ODE. Such an algorithm is sampling-based, which can readily handle Fokker-Planck equations in higher dimensional spaces. Moreover, we also establish bounds for the asymptotic convergence analysis of the neural parametric Fokker-Planck equation as well as its error analysis for both the continuous and discrete (forward-Euler time discretization) versions. Several numerical examples are provided to illustrate the performance of the proposed algorithms and analysis.


Wuchen Li

May 06, 2020


  1. Transport information scientific computing Wuchen Li University of South Carolina

    Aug 27
  2. Learning 2

  3. Learning applications 3

  4. Density flows: Fokker-Planck equation Fokker Planck equation (FPE) is a

    parabolic evolution equation: ∂ρ(t, x) ∂t = ∇ · (ρ(t, x)∇V (x)) + β∆ρ(t, x), ρ(0, x) = ρ0 (x), where x ∈ Rd and β > 0 is a given coefficient constant. There are several properties of FPE, such as (i) Conservation of Mass; (ii) Entropy dissipation. 4
  5. Related works In computing FPEs, many numerical methods have been

    considered. Traditional methods (i) Discretization: Finite volume; Finite difference; Finite element e.t.c. (ii) Variation: Optimal transport gradient flows, in both Eulerian and Lagrangian formulations. (Carrilo, Craig, Wang, Wei, Li, Cances, Chow, (iii) Tools: Primal dual methods (Li, Osher, Curse of dimensionality In traditional numerical methods, solving high dimensional FPE is not simple, whenever d ≥ 3. This is known as the curse of dimensionality. In this talk, we are aiming to provide a new scheme by deep learning variational methods. 5
  6. Overcome curse of dimensionality Many methods have been proposed for

    the Hamilton-Jacobi equation: Hopf-Lax formula: Darbon, Osher,; Path space: Weinan E; Lagrangian Methods: Lars, Halder,; Inverse problems: L. Yang and G. E. Karniadakis 6
  7. Goal Main question: Can we apply the learning-related basis functions

    (Neural network) to approximate the solution of FPE in high dimensional spaces? Answer: 1. Dimension reduction via Neural networks: Fokker-Planck equation (PDE) ⇓ Neural parametric Fokker-Planck equation (ODE) 2. Approximate a solution of FPE in data-poor situations. I.e. introduce sampling friendly variational schemes. 7
  8. Learning+Density flows Natural gradient: (i) Amari, Natural gradient works efficient

    in learning; Ny Information geometry; (ii) ADAM, KFAC etc; (iii) Li, Transport information geometry I, II. Learning: (i) Li, Guido, Chen, Natural gradient via optimal transport I, II, Ricci curvature of parametric statistics via optimal transport, IG; (ii) Lin, Li, Osher, Guido, Wasserstein proximal of GANs; Natural affine proximal learning, GSI; (iii) Abel, Gretton, Li, Guido, Kernelized Wasserstein Natural Gradient, ICLR, 2020; (iv) Lars, Osher, Li, Levon, Samy, A machine learning framework for mean field games, PNAS, 2020; (v) Lin, Samy, Li, Levon, Osher, APAC net: Neural primal dual methods for mean field games, 2020; (vi) Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations. 8
  9. Neural network+Density flows Computational neural networks12: Deep structure ⇒ Dealing

    with high dimensional problem; Pushforward mechanism ⇒ Sampling-friendly. 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 9
  10. Optimal transport Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation

    ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Lafferty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into data-driven scientific computing methods. 10
  11. Density control+Applications 11

  12. “Least square” in fluid dynamics Optimal transport has an optimal

    control formulation as follows: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt, under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ρ0 = ρ0, ρ1 = ρ1. Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 12
  13. Optimal transport gradient Denote gradW the gradient operator on (P,

    gW ). Consider a smooth functional F: P → R, then the optimal transport gradient operator satisfies gradW F(ρ) =gW (ρ)−1 δ δρ F (x) = − ∇ · (ρ(x)∇ δ δρ(x) F(ρ)), where δ δρ(x) is the L2 first variation at variable x ∈ Rd. 13
  14. Optimal transport gradient flows Consider the Kullback–Leibler (KL) divergence: H(ρ)

    =βDKL (ρ e− V β ) = V (x)ρ(x)dx + β ρ(x) log ρ(x)dx + Const. The optimal transport gradient flow of KL divergence satisfies: ∂ρ ∂t = − βgW (ρ)−1 δDKL δρ (x) =∇ · (ρ∇V ) + β∇ · (ρ∇ log ρ)) =∇ · (ρ∇V ) + β∆ρ, where ρ∇ log ρ = ∇ρ. 14
  15. Generative Adversary Networks For each parameter θ ∈ Rd and

    given neural network parameterized mapping function Tθ , consider ρθ = Tθ #p(z). 15
  16. Generative probability models Consider a class of invertible push-forward maps

    {Tθ }θ∈Θ indexed by parameter θ ∈ Θ ⊂ Rm Tθ : Rd → Rd. We obtain a family of parametric distributions PΘ = ρθ = Tθ# p | θ ∈ Θ ⊂ (P, gW ). 16
  17. Neural information matrix Here GW (θ) ∈ Rm×m is an

    matrix2 defined by GW,ij (θ) = ∇Φi (x), ∇Φj (x)ρ(x, θ)dx, where −∇ · (ρθ ∇Φk (x)) = ∇θk ρθ , where k = i, j. In term of generative models, the matrix forms GW,ij (θ) = ∇ψi (Tθ (x)) · ∇ψj (Tθ (x)) p(x) dx, 1 ≤ i, j ≤ m. where ∇ · (ρθ ∇ψk (x)) = ∇ · (ρθ ∂θk Tθ (T−1 θ (x))). 2Li, Zhao, Wasserstein information matrix, 2019. 17
  18. Neural parametric Fokker-Planck equation Consider H = H ◦ T#

    : Θ → R, then H(θ) = H(ρθ ) = [V (Tθ (x)) + β log ρθ (Tθ (x))]p(x)dx. Writing the gradient flow of H(θ) in (Θ, GW (θ)), we derive ˙ θ = −GW (θ)−1∇θ H(θ). We call the above ODE Neural Fokker-Planck equation. This is the neural approximation of FPE ∂ρ ∂t = −gW (ρ)−1 δ δρ H(ρ). 18
  19. Algorithm We endow Tθ with the structure of deep neural

    networks. In particular, we choose the normalizing flow9: Tθ (z) = T1θ ◦ T2θ ◦ · · · ◦ TNθ (z). Difficulties (i) Direct computation of GW (θ) could be expensive; (ii) And GW (θ)−1 may not be efficiently computed under deep learning framework; Here we have to seek a variational proximal scheme to handle them. 9D. Rezende and S. Mohamed. Variational inference with normalizing flows. ICML, 2015 19
  20. Review: Proximal method Consider an Euclidean gradient flow in (Rd,

    I): ˙ x = −∇F(x). The Proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h x − xk 2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk − h∇F(xk+1 ). 20
  21. Review: Proximal method on a metric structure Consider a gradient

    flow in (Rd, G): ˙ x = −G(x)−1∇F(x). The proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h dG (x, xk )2, where h > 0 is the step-size, and dG (x, y) is the distance function induced by G, i.e. dG (x, y)2 = inf γ 1 0 ˙ γ(t)T G(γ(t))˙ γ(t)dt: γ(0) = x, γ(1) = y . This is known as the variational backward Euler method xk+1 = xk − hG(xk+1 )−1∇F(xk+1 ) + o(h). 21
  22. Primal-dual scheme for FPE In term of optimal transport, the

    proximal method, also named Jordan–Kinderlehrer–Otto (JKO) scheme, forms ρk+1 = arg min ρ F(ρ) + 1 2h dW (ρ, ρk )2 In other words, A suitable linearization on the metric can be considered; see Li, Wang, JCP, (2020). 22
  23. AI Discretization Goal: Propose a variational scheme for solving gradient

    flows in (Θ, GW (θ)): ˙ θt = −GW (θt )−1∇θ H(θt ). Consider a linearized proximal algorithm: θk+1 = argmin θ { θ − θk , GW (θk )(θ − θk ) + 2hH(θ)} . 23
  24. GAN=Neural proximal dual methods Consider the following approximation: θ −

    θk , GW (θk )(θ − θk ) ≈ max ψ M (2∇ψ(Tθk ) · ((Tθ − Tθk ) − |∇ψ(Tθk )|2)p(x)dx Hence the scheme forms θk+1 =argmin θ max ψ M 2∇ψ(Tθk (x)) · (Tθ − Tθk ) − |∇ψ(Tθk (x))|2 p(x)dx + 2hH(θ) . Here we choose ψ as ReLU network and optimize over its parameters. Write the saddle point scheme in terms of finite samples and carry out the optimization step, as in GAN. 24
  25. Algorithm Algorithm 1 Solving parametric FPE ˙ θ = −G(θ)−1∇θ

    F(θ) Initialize θ0 for k = 0, 1, ..., N do θ ← θk ; for j = 1, 2, ..., M do Apply Stochastic gradient descent to solve: inf ψ |Tθ(x) − Tθk (x) − ∇ψ(Tθk (x))|2 p(x)dx θ ← θ − stepsize · ∇θJ(θ, θk). Here J(θ, ˜ θ) is defined as: J(θ, ˜ θ) = Tθ(x) · ∇ψ(T˜ θ (x)) p(x)dx + hF(θ) end for θk+1 ← θ end for 25
  26. Numerical examples We compute FPE with a complicated potential in

    higher dimensions. In R30, β = 1, we set: V (x) = 1 50 30 i=1 x4 i − 16x2 i + 5xi Here is a plot of such function in 2 dimensional space: 26
  27. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 27
  28. Numerical examples We can verify the graph of ψ trained

    at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 30th time step Figure: Graph of ψ at k = 60th time step Figure: Graph of ψ at k = 150th time step 28
  29. Numerical examples We can verify the graph of ψ trained

    at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 240th time step Figure: Graph of ψ at k = 300th time step Figure: Graph of ψ at k = 360th time step 29
  30. Neural numerical analysis Denote {ρt } as the solution to

    the original FPE. Suppose {θt } as the solution to parametric FPE with ρθt = Tθt# p. 1 Convergence analysis: we will provide upper bound for DKL (ρθt ρ∗ ). Here ρ∗ = 1 Zβ exp(−V β ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). 2 Error analysis: we provide uniform bound for W2 (ρθt , ρt ). 30
  31. Neural numerical analysis Consider δ0 = sup θ∈Θ min ξ∈TθΘ

    |(∇Ψ(Tθ (x))T ξ − ∇ (V + β log ρθ ) ◦ Tθ (x)|2p(x)dx One can treat δ0 as a quantification of the approximation power of push-forward family {Tθ }θ∈Θ . 31
  32. Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and

    ∇2V 0 outside a finite ball. Suppose {θt } solves ˙ θt = −GW (θt )−1∇θ H(θt ). Let ρ∗ (x) = 1 Zβ e−V (x)/β be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(ρθt ρ∗ ) ≤ δ0 ˜ λβ β2 (1 − e−β˜ λβ t) + KL(ρθ0 ρ∗ )e−β˜ λβ t. Here ˜ λβ > 0 is a constant number originating from the Log-Sobolev inequality involving ρ∗ , i.e. KL(ρ ρ∗ ) ≤ 1 ˜ λβ I(ρ|ρ∗ ) holds for any probability density ρ. Here I(ρ|ρ∗ ) = ∇ log ρ ρ∗ 2 ρ dx. 32
  33. Future works (i) (Scientific computing) Solve general transport related kinetic

    PDEs, including mean field games, Schr¨ odinger equations, Schr¨ odinger bridge problems, e.t.c; (ii) (Sampling optimization) Design transport information PDEs on sampling problems; (iii) (Estimation) Apply statistical efficiency, estimation of Neural networks via transport information matrix. 33