Neural Fokker-Planck equation

Neural Fokker-Planck equation

In this talk, we develop and analyze numerical methods for high dimensional Fokker-Planck equations by leveraging generative models from deep learning. Our starting point is a formulation of the Fokker-Planck equation as a system of ordinary differential equations (ODEs) on finite-dimensional parameter space with the parameters inherited from generative models such as normalizing flows. We call such ODEs neural parametric Fokker-Planck equation. The fact that the Fokker-Planck equation can be viewed as the L2-Wasserstein gradient flow of Kullback-Leibler (KL) divergence allows us to derive the ODEs as the constrained L2-Wasserstein gradient flow of KL divergence on the set of probability densities generated by neural networks. For numerical computation, we design a variational semi-implicit scheme for the time discretization of the proposed ODE. Such an algorithm is sampling-based, which can readily handle Fokker-Planck equations in higher dimensional spaces. Moreover, we also establish bounds for the asymptotic convergence analysis of the neural parametric Fokker-Planck equation as well as its error analysis for both the continuous and discrete (forward-Euler time discretization) versions. Several numerical examples are provided to illustrate the performance of the proposed algorithms and analysis.

7a507f364fce7547f94b9a5b4a072c87?s=128

Wuchen Li

May 06, 2020
Tweet

Transcript

  1. Transport information scientific computing Wuchen Li University of South Carolina

    Aug 27
  2. Learning 2

  3. Learning applications 3

  4. Density flows: Fokker-Planck equation Fokker Planck equation (FPE) is a

    parabolic evolution equation: ∂ρ(t, x) ∂t = ∇ · (ρ(t, x)∇V (x)) + β∆ρ(t, x), ρ(0, x) = ρ0 (x), where x ∈ Rd and β > 0 is a given coefficient constant. There are several properties of FPE, such as (i) Conservation of Mass; (ii) Entropy dissipation. 4
  5. Related works In computing FPEs, many numerical methods have been

    considered. Traditional methods (i) Discretization: Finite volume; Finite difference; Finite element e.t.c. (ii) Variation: Optimal transport gradient flows, in both Eulerian and Lagrangian formulations. (Carrilo, Craig, Wang, Wei, Li, Cances, Chow, et.al.) (iii) Tools: Primal dual methods (Li, Osher, et.al.). Curse of dimensionality In traditional numerical methods, solving high dimensional FPE is not simple, whenever d ≥ 3. This is known as the curse of dimensionality. In this talk, we are aiming to provide a new scheme by deep learning variational methods. 5
  6. Overcome curse of dimensionality Many methods have been proposed for

    the Hamilton-Jacobi equation: Hopf-Lax formula: Darbon, Osher, et.al; Path space: Weinan E et.al; Lagrangian Methods: Lars, Halder, et.al; Inverse problems: L. Yang and G. E. Karniadakis et.al. 6
  7. Goal Main question: Can we apply the learning-related basis functions

    (Neural network) to approximate the solution of FPE in high dimensional spaces? Answer: 1. Dimension reduction via Neural networks: Fokker-Planck equation (PDE) ⇓ Neural parametric Fokker-Planck equation (ODE) 2. Approximate a solution of FPE in data-poor situations. I.e. introduce sampling friendly variational schemes. 7
  8. Learning+Density flows Natural gradient: (i) Amari, Natural gradient works efficient

    in learning; Ny et.al. Information geometry; (ii) ADAM, KFAC etc; (iii) Li, Transport information geometry I, II. Learning: (i) Li, Guido, Chen, Natural gradient via optimal transport I, II, Ricci curvature of parametric statistics via optimal transport, IG; (ii) Lin, Li, Osher, Guido, Wasserstein proximal of GANs; Natural affine proximal learning, GSI; (iii) Abel, Gretton, Li, Guido, Kernelized Wasserstein Natural Gradient, ICLR, 2020; (iv) Lars, Osher, Li, Levon, Samy, A machine learning framework for mean field games, PNAS, 2020; (v) Lin, Samy, Li, Levon, Osher, APAC net: Neural primal dual methods for mean field games, 2020; (vi) Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations. 8
  9. Neural network+Density flows Computational neural networks12: Deep structure ⇒ Dealing

    with high dimensional problem; Pushforward mechanism ⇒ Sampling-friendly. 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 9
  10. Optimal transport Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation

    ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Lafferty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into data-driven scientific computing methods. 10
  11. Density control+Applications 11

  12. “Least square” in fluid dynamics Optimal transport has an optimal

    control formulation as follows: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt, under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ρ0 = ρ0, ρ1 = ρ1. Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 12
  13. Optimal transport gradient Denote gradW the gradient operator on (P,

    gW ). Consider a smooth functional F: P → R, then the optimal transport gradient operator satisfies gradW F(ρ) =gW (ρ)−1 δ δρ F (x) = − ∇ · (ρ(x)∇ δ δρ(x) F(ρ)), where δ δρ(x) is the L2 first variation at variable x ∈ Rd. 13
  14. Optimal transport gradient flows Consider the Kullback–Leibler (KL) divergence: H(ρ)

    =βDKL (ρ e− V β ) = V (x)ρ(x)dx + β ρ(x) log ρ(x)dx + Const. The optimal transport gradient flow of KL divergence satisfies: ∂ρ ∂t = − βgW (ρ)−1 δDKL δρ (x) =∇ · (ρ∇V ) + β∇ · (ρ∇ log ρ)) =∇ · (ρ∇V ) + β∆ρ, where ρ∇ log ρ = ∇ρ. 14
  15. Generative Adversary Networks For each parameter θ ∈ Rd and

    given neural network parameterized mapping function Tθ , consider ρθ = Tθ #p(z). 15
  16. Generative probability models Consider a class of invertible push-forward maps

    {Tθ }θ∈Θ indexed by parameter θ ∈ Θ ⊂ Rm Tθ : Rd → Rd. We obtain a family of parametric distributions PΘ = ρθ = Tθ# p | θ ∈ Θ ⊂ (P, gW ). 16
  17. Neural information matrix Here GW (θ) ∈ Rm×m is an

    matrix2 defined by GW,ij (θ) = ∇Φi (x), ∇Φj (x)ρ(x, θ)dx, where −∇ · (ρθ ∇Φk (x)) = ∇θk ρθ , where k = i, j. In term of generative models, the matrix forms GW,ij (θ) = ∇ψi (Tθ (x)) · ∇ψj (Tθ (x)) p(x) dx, 1 ≤ i, j ≤ m. where ∇ · (ρθ ∇ψk (x)) = ∇ · (ρθ ∂θk Tθ (T−1 θ (x))). 2Li, Zhao, Wasserstein information matrix, 2019. 17
  18. Neural parametric Fokker-Planck equation Consider H = H ◦ T#

    : Θ → R, then H(θ) = H(ρθ ) = [V (Tθ (x)) + β log ρθ (Tθ (x))]p(x)dx. Writing the gradient flow of H(θ) in (Θ, GW (θ)), we derive ˙ θ = −GW (θ)−1∇θ H(θ). We call the above ODE Neural Fokker-Planck equation. This is the neural approximation of FPE ∂ρ ∂t = −gW (ρ)−1 δ δρ H(ρ). 18
  19. Algorithm We endow Tθ with the structure of deep neural

    networks. In particular, we choose the normalizing flow9: Tθ (z) = T1θ ◦ T2θ ◦ · · · ◦ TNθ (z). Difficulties (i) Direct computation of GW (θ) could be expensive; (ii) And GW (θ)−1 may not be efficiently computed under deep learning framework; Here we have to seek a variational proximal scheme to handle them. 9D. Rezende and S. Mohamed. Variational inference with normalizing flows. ICML, 2015 19
  20. Review: Proximal method Consider an Euclidean gradient flow in (Rd,

    I): ˙ x = −∇F(x). The Proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h x − xk 2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk − h∇F(xk+1 ). 20
  21. Review: Proximal method on a metric structure Consider a gradient

    flow in (Rd, G): ˙ x = −G(x)−1∇F(x). The proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h dG (x, xk )2, where h > 0 is the step-size, and dG (x, y) is the distance function induced by G, i.e. dG (x, y)2 = inf γ 1 0 ˙ γ(t)T G(γ(t))˙ γ(t)dt: γ(0) = x, γ(1) = y . This is known as the variational backward Euler method xk+1 = xk − hG(xk+1 )−1∇F(xk+1 ) + o(h). 21
  22. Primal-dual scheme for FPE In term of optimal transport, the

    proximal method, also named Jordan–Kinderlehrer–Otto (JKO) scheme, forms ρk+1 = arg min ρ F(ρ) + 1 2h dW (ρ, ρk )2 In other words, A suitable linearization on the metric can be considered; see Li, Wang, JCP, (2020). 22
  23. AI Discretization Goal: Propose a variational scheme for solving gradient

    flows in (Θ, GW (θ)): ˙ θt = −GW (θt )−1∇θ H(θt ). Consider a linearized proximal algorithm: θk+1 = argmin θ { θ − θk , GW (θk )(θ − θk ) + 2hH(θ)} . 23
  24. GAN=Neural proximal dual methods Consider the following approximation: θ −

    θk , GW (θk )(θ − θk ) ≈ max ψ M (2∇ψ(Tθk ) · ((Tθ − Tθk ) − |∇ψ(Tθk )|2)p(x)dx Hence the scheme forms θk+1 =argmin θ max ψ M 2∇ψ(Tθk (x)) · (Tθ − Tθk ) − |∇ψ(Tθk (x))|2 p(x)dx + 2hH(θ) . Here we choose ψ as ReLU network and optimize over its parameters. Write the saddle point scheme in terms of finite samples and carry out the optimization step, as in GAN. 24
  25. Algorithm Algorithm 1 Solving parametric FPE ˙ θ = −G(θ)−1∇θ

    F(θ) Initialize θ0 for k = 0, 1, ..., N do θ ← θk ; for j = 1, 2, ..., M do Apply Stochastic gradient descent to solve: inf ψ |Tθ(x) − Tθk (x) − ∇ψ(Tθk (x))|2 p(x)dx θ ← θ − stepsize · ∇θJ(θ, θk). Here J(θ, ˜ θ) is defined as: J(θ, ˜ θ) = Tθ(x) · ∇ψ(T˜ θ (x)) p(x)dx + hF(θ) end for θk+1 ← θ end for 25
  26. Numerical examples We compute FPE with a complicated potential in

    higher dimensions. In R30, β = 1, we set: V (x) = 1 50 30 i=1 x4 i − 16x2 i + 5xi Here is a plot of such function in 2 dimensional space: 26
  27. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 27
  28. Numerical examples We can verify the graph of ψ trained

    at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 30th time step Figure: Graph of ψ at k = 60th time step Figure: Graph of ψ at k = 150th time step 28
  29. Numerical examples We can verify the graph of ψ trained

    at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 240th time step Figure: Graph of ψ at k = 300th time step Figure: Graph of ψ at k = 360th time step 29
  30. Neural numerical analysis Denote {ρt } as the solution to

    the original FPE. Suppose {θt } as the solution to parametric FPE with ρθt = Tθt# p. 1 Convergence analysis: we will provide upper bound for DKL (ρθt ρ∗ ). Here ρ∗ = 1 Zβ exp(−V β ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). 2 Error analysis: we provide uniform bound for W2 (ρθt , ρt ). 30
  31. Neural numerical analysis Consider δ0 = sup θ∈Θ min ξ∈TθΘ

    |(∇Ψ(Tθ (x))T ξ − ∇ (V + β log ρθ ) ◦ Tθ (x)|2p(x)dx One can treat δ0 as a quantification of the approximation power of push-forward family {Tθ }θ∈Θ . 31
  32. Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and

    ∇2V 0 outside a finite ball. Suppose {θt } solves ˙ θt = −GW (θt )−1∇θ H(θt ). Let ρ∗ (x) = 1 Zβ e−V (x)/β be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(ρθt ρ∗ ) ≤ δ0 ˜ λβ β2 (1 − e−β˜ λβ t) + KL(ρθ0 ρ∗ )e−β˜ λβ t. Here ˜ λβ > 0 is a constant number originating from the Log-Sobolev inequality involving ρ∗ , i.e. KL(ρ ρ∗ ) ≤ 1 ˜ λβ I(ρ|ρ∗ ) holds for any probability density ρ. Here I(ρ|ρ∗ ) = ∇ log ρ ρ∗ 2 ρ dx. 32
  33. Future works (i) (Scientific computing) Solve general transport related kinetic

    PDEs, including mean field games, Schr¨ odinger equations, Schr¨ odinger bridge problems, e.t.c; (ii) (Sampling optimization) Design transport information PDEs on sampling problems; (iii) (Estimation) Apply statistical efficiency, estimation of Neural networks via transport information matrix. 33