Neural Fokker-Planck equation

Neural Fokker-Planck equation

In this talk, we develop and analyze numerical methods for high dimensional Fokker-Planck equations by leveraging generative models from deep learning. Our starting point is a formulation of the Fokker-Planck equation as a system of ordinary differential equations (ODEs) on finite-dimensional parameter space with the parameters inherited from generative models such as normalizing flows. We call such ODEs neural parametric Fokker-Planck equation. The fact that the Fokker-Planck equation can be viewed as the L2-Wasserstein gradient flow of Kullback-Leibler (KL) divergence allows us to derive the ODEs as the constrained L2-Wasserstein gradient flow of KL divergence on the set of probability densities generated by neural networks. For numerical computation, we design a variational semi-implicit scheme for the time discretization of the proposed ODE. Such an algorithm is sampling-based, which can readily handle Fokker-Planck equations in higher dimensional spaces. Moreover, we also establish bounds for the asymptotic convergence analysis of the neural parametric Fokker-Planck equation as well as its error analysis for both the continuous and discrete (forward-Euler time discretization) versions. Several numerical examples are provided to illustrate the performance of the proposed algorithms and analysis.

7a507f364fce7547f94b9a5b4a072c87?s=128

Wuchen Li

May 06, 2020
Tweet

Transcript

  1. Transport information geometric computation: Neural Fokker-Planck equations Wuchen Li UCLA/University

    of South Carolina Shanghai Jiao Tong university, Data science seminar. This is based on a joint work with Shu Liu (GT), Hongyuan Zha (GT) and Haomin Zhou (GT).
  2. Information and Learning 2

  3. Learning applications 3

  4. Learning Given a data measure ρdata (x) = 1 N

    N i=1 δXi (x) and a parameterized model ρ(x, θ). Most problems in AI or sampling refer to min ρθ∈ρ(Θ) D(ρdata , ρθ ). One typical choice of D is the Kullback–Leibler divergence (relative entropy) D(ρdata , ρθ ) = Ω ρdata (x) log ρdata (x) ρ(x, θ) dx. One can design a gradient descent method to solve the parameter optimization problem. 4
  5. Fokker-Planck equation (i) Fokker Planck equation (FPE) is a parabolic

    evolution equation: ∂ρ(t, x) ∂t = ∇ · (ρ(t, x)∇V (x)) + β∆ρ(t, x), ρ(0, x) = ρ0 (x), where x ∈ Rd and β > 0 is a given coefficient constant. (ii) FPE describes the probability density evolutionary of stochastic differential equations, known as the Langevin dynamics. Denote Xt ∼ ρt , then ˙ Xt = −∇V (Xt ) + 2β ˙ Bt , X0 ∼ ρ0 , where Bt is the standard Brownian motion in Rd. (iii) The stationary solution of FPE satisfies the Gibbs distribution ρ∗(x) = 1 Z e−V (x)/β, where Z = e−V (y)/βdy is a normalization constant. 5
  6. Related works In computing FPEs, many numerical methods have been

    considered. Traditional methods (i) Discretization: Finite volume; Finite difference; Finite element e.t.c. (ii) Variation: Optimal transport gradient flows, in both Eulerian and Lagrangian formulations. (Carrilo, Craig, Wang, Wei, Li, Cances, et.al.) (iii) Tools: Primal dual methods (Li, Osher, et.al.). Curse of dimensionality In traditional numerical methods, solving high dimensional FPE is not simple, whenever d ≥ 3. This is known as the curse of dimensionality. In this talk, we are aiming to provide a new scheme by deep learning variational methods. 6
  7. Overcome curse of dimensionality Many methods have been proposed for

    the Hamilton-Jacobi equation: Hopf-Lax formula: Darbon, Osher; Path space: Weinan E et.al; Lagrangian Methods: Lars, Halder, et.al; Inverse problems: L. Yang and G. E. Karniadakis et.al. 7
  8. Goal Main question: (a) Can we apply the learning-related basis

    functions (Neural network) to approximate the solution of FPE in high dimensional spaces? (b) Is there a systematic way to connect the classical numerical methods of PDEs with Neural networks? Answer: 1. Dimension reduction via Neural networks: Fokker-Planck equation (infinite dimensional ODE) ⇓ Parametric Fokker-Planck equation (finite dimensional ODE) 2. Approximate a solution of FPE in data-poor situations. I.e. introduce sampling friendly variational schemes. 8
  9. Transport information geometric computation 9

  10. Learning and Gradient flows Natural gradient: (i) Amari, Natural gradient

    works efficient in learning; Ny et.al. Information geometry; (ii) ADAM, KFAC etc; (iii) Li, Transport information geometry I, II. Learning: (i) Li, Guido, Chen, Natural gradient via optimal transport I, II, Ricci curvature of parametric statistics via optimal transport, IG; (ii) Lin, Li, Osher, Guido, Wasserstein proximal of GANs, Natural affine proximal learning, GSI; (iii) Abel, Gretton, Li, Guido, Kernelized Wasserstein Natural Gradient, ICLR, 2020; (iv) Lars, Osher, Li, Levon, Samy, A machine learning framework for mean field games, PNAS, 2020; (v) Li, Liu, Zha, Zhou, Parametric Fokker-Planck equations, GSI; (vi) Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations. 10
  11. Neural network+FPE Neural networks in Generative Model12: Deep structure ⇒

    Dealing with high dimensional problem; Pushforward mechanism ⇒ Sampling-friendly. Concretely, Approximate ρt by a pushforward map; Construct an ODE for parameters and numerically solve the ODE. 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 11
  12. Route map 12

  13. Optimal transport Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation

    ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Lafferty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into data-driven scientific computing methods. 13
  14. Variational structure Optimal transport has an optimal control formulation as

    follows: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt, under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ρ0 = ρ0, ρ1 = ρ1. Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 14
  15. Optimal transport gradient Denote gradW the gradient operator on (P,

    gW ). Consider a smooth functional F: P → R, then the optimal transport gradient operator satisfies gradW F(ρ) =gW (ρ)−1 δ δρ F (x) = − ∇ · (ρ(x)∇ δ δρ(x) F(ρ)), where δ δρ(x) is the L2 first variation at variable x ∈ Rd. 15
  16. Optimal transport gradient flows Consider the Kullback–Leibler (KL) divergence: H(ρ)

    =βDKL (ρ e− V β ) = V (x)ρ(x)dx + β ρ(x) log ρ(x)dx + Const. The optimal transport gradient flow of KL divergence satisfies: ∂ρ ∂t = − βgW (ρ)−1 δDKL δρ (x) =∇ · (ρ∇V ) + β∇ · (ρ∇ log ρ)) =∇ · (ρ∇V ) + β∆ρ, where ρ∇ log ρ = ∇ρ. This corresponds to a variational structure of FPE. In fact, there are many other PDEs satisfying this variational formulation, such as aggregation equations, quantum drift diffusion equation arised in kinetic equations . 16
  17. Generative Adversary Networks For each parameter θ ∈ Rd and

    given neural network parameterized mapping function Tθ , consider ρθ = Tθ #p(z). 17
  18. Generative probability models Consider a class of invertible push-forward maps

    {Tθ }θ∈Θ indexed by parameter θ ∈ Θ ⊂ Rm Tθ : Rd → Rd. We obtain a family of parametric distributions PΘ = ρθ = Tθ# p | θ ∈ Θ ⊂ (P, gW ). 18
  19. Neural transport information matrix Here GW (θ) ∈ Rm×m is

    an matrix2 defined by GW,ij (θ) = ∇Φi (x), ∇Φj (x)ρ(x, θ)dx, where −∇ · (ρθ ∇Φk (x)) = ∇θk ρθ , where k = i, j. In term of generative models, the matrix forms GW,ij (θ) = ∇ψi (Tθ (x)) · ∇ψj (Tθ (x)) p(x) dx, 1 ≤ i, j ≤ m. where ∇ · (ρθ ∇ψk (x)) = ∇ · (ρθ ∂θk Tθ (T−1 θ (x))). 2Li, Zhao, Wasserstein information matrix, 2019. 19
  20. Neural parametric Fokker-Planck Equation Consider H = H ◦ T#

    : Θ → R, then H(θ) = H(ρθ ) = [V (Tθ (x)) + β log ρθ (Tθ (x))]p(x)dx. Writing the gradient flow of H(θ) in (Θ, GW (θ)), we derive ˙ θ = −GW (θ)−1∇θ H(θ). We call the above ODE Neural Fokker-Planck equation. This is the neural approximation of FPE ∂ρ ∂t = −gW (ρ)−1 δ δρ H(ρ). 20
  21. Algorithm We endow Tθ with the structure of deep neural

    networks. In particular, we choose the normalizing flow9: Tθ (z) = T1θ ◦ T2θ ◦ · · · ◦ TNθ (z). Difficulties (i) Direct computation of GW (θ) could be expensive; (ii) And GW (θ)−1 may not be efficiently computed under deep learning framework; Here we have to seek a variational proximal scheme to handle them. 9D. Rezende and S. Mohamed. Variational inference with normalizing flows. ICML, 2015 21
  22. Review: Proximal method Consider an Euclidean gradient flow in (Rd,

    I): ˙ x = −∇F(x). The Proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h x − xk 2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk − h∇F(xk+1 ). 22
  23. Review: Proximal method on a metric structure Consider a gradient

    flow in (Rd, G): ˙ x = −G(x)−1∇F(x). The proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h dG (x, xk )2, where h > 0 is the step-size, and dG (x, y) is the distance function induced by G, i.e. dG (x, y)2 = inf γ 1 0 ˙ γ(t)T G(γ(t))˙ γ(t)dt: γ(0) = x, γ(1) = y . This is known as the variational backward Euler method xk+1 = xk − hG(xk+1 )−1∇F(xk+1 ) + o(h). 23
  24. Proximal scheme for FPE In term of optimal transport, the

    proximal method, also named Jordan–Kinderlehrer–Otto (JKO) scheme, forms ρk+1 = arg min ρ F(ρ) + 1 2h dW (ρ, ρk )2 In other words, A suitable linearization on the metric can be considered; see Li, Wang, JCP, (2020). 24
  25. Algorithm Goal: Propose a variational scheme for solving gradient flows

    in (Θ, GW (θ)): ˙ θt = −GW (θt )−1∇θ H(θt ). Consider a linearized proximal algorithm: θk+1 = argmin θ { θ − θk , GW (θk )(θ − θk ) + 2hH(θ)} . 25
  26. GAN Proximal scheme Consider the following approximation: θ − θk

    , GW (θk )(θ − θk ) ≈ max ψ M (2∇ψ(x) · ((Tθ − Tθk ) ◦ T−1 θk (x)) − |∇ψ(x)|2)ρθk (x) dx The variational saddle scheme satisfies θk+1 =argmin θ max ψ M 2∇ψ(x) · ((Tθ − Tθk ) ◦ T−1 θk (x)) − |∇ψ(x)|2 ρθk (x) dx + 2hH(θ) . Here we choose ψ as ReLU network and optimize over its parameters. Write the saddle point scheme in terms of finite samples and carry out the optimization step, as in GAN. 26
  27. Algorithm Algorithm 1 Solving parametric FPE ˙ θ = −G(θ)−1∇θ

    F(θ) Initialize θ0 for k = 0, 1, ..., N do θ ← θk ; for j = 1, 2, ..., M do Apply Stochastic gradient descent to solve: inf ψ |Tθ(x) − Tθk (x) − ∇ψ(Tθk (x))|2 p(x)dx θ ← θ − stepsize · ∇θJ(θ, θk). Here J(θ, ˜ θ) is defined as: J(θ, ˜ θ) = Tθ(x) · ∇ψ(T˜ θ (x)) p(x)dx + hF(θ) end for θk+1 ← θ end for 27
  28. Numerical examples We compute FPE with a complicated potential in

    higher dimensions. In R30, β = 1, we set: V (x) = 1 50 30 i=1 x4 i − 16x2 i + 5xi Here is a plot of such function in 2 dimensional space: 28
  29. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 29
  30. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 29
  31. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 29
  32. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 29
  33. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 29
  34. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 29
  35. Numerical examples Let the time stepsize be h = 0.005.

    We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: Figure: Density estimation of the samples points at 360th iteration 29
  36. Numerical examples We can verify the graph of ψ trained

    at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 30th time step Figure: Graph of ψ at k = 60th time step Figure: Graph of ψ at k = 150th time step 30
  37. Numerical examples We can verify the graph of ψ trained

    at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 240th time step Figure: Graph of ψ at k = 300th time step Figure: Graph of ψ at k = 360th time step 31
  38. Neural numerical analysis Denote {ρt } as the solution to

    the original FPE. Suppose {θt } as the solution to parametric FPE with ρθt = Tθt# p. 1 Convergence analysis: we will provide upper bound for DKL (ρθt ρ∗ ). Here ρ∗ = 1 Zβ exp(−V β ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). 2 Error analysis: we provide uniform bound for W2 (ρθt , ρt ). 32
  39. Neural numerical analysis Before we introduce our results on numerical

    analysis, we first introduce an important quantity which will play an essential role in our results: Let us recall Ψ as defined from (19) in Theorem 1, we set: δ0 = sup θ∈Θ min ξ∈TθΘ |(∇Ψ(Tθ (x))T ξ − ∇ (V + β log ρθ ) ◦ Tθ (x)|2 dp(x) One can treat δ0 as a quantification of the approximation power of push-forward family {Tθ }θ∈Θ . 33
  40. Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and

    ∇2V 0 outside a finite ball. Suppose {θt } solves ˙ θt = −GW (θt )−1∇θ H(θt ). Let ρ∗ (x) = 1 Zβ e−V (x)/β be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(ρθt ρ∗ ) ≤ δ0 ˜ λβ β2 (1 − e−β˜ λβ t) + KL(ρθ0 ρ∗ )e−β˜ λβ t. Here ˜ λβ > 0 is a constant number originating from the Log-Sobolev inequality involving ρ∗ , i.e. KL(ρ ρ∗ ) ≤ 1 ˜ λβ I(ρ|ρ∗ ) holds for any probability density ρ. Here I(ρ|ρ∗ ) = ∇ log ρ ρ∗ 2 ρ dx. 34
  41. Error analysis Theorem Assume {θt } solves ˙ θt =

    −GW (θt )−1∇θ H(θt ); and {ρt } solves the Fokker-Planck equation. Assume that the Hessian of the potential function V is bounded below by a constant λ, i.e. ∇2V λ I. Then we have: W2 (ρθt , ρt ) ≤ √ δ0 λ (1 − e−λt) + e−λtW2 (ρθ0 , ρ0 ). When V is strictly convex, i.e. λ ≥ 0, inequality (2) provides a uniformly small upper bound for the error W2 (ρθt , ρt ); However, under many cases, λ < 0, then the right hand side of (2) will increase to infinity as time t → ∞. 35
  42. Revisit Langevin dynamics by Kernel and Neural networks 36

  43. Neural network particles dynamics Fokker-Planck equation: ∂ρ ∂t = ∇

    · (ρ∇V ) + β∆ρ, ρ(·, 0) = ρ0 . Particle version: Over damped Langevin dynamics: dXt = −∇V (Xt )dt + 2βdBt , X0 ∼ ρ0 . Vlasov-type dynamics: dXt dt = −∇V (Xt ) − β ∇ log ρt (Xt ), X0 ∼ ρ0 . Here Xt ∼ ρt . 37
  44. Particle’s Fokker-Planck Equation Since ˙ θt = −GW (θt )−1∇θ

    H(θt ), plugging the definition of the metric tensor ˙ Xt = Kθt (Xt , η)(−∇V (η) − β∇ log ρθt (η))ρθt (η)dη Here we define the kernel function Kθ : Rd × Rd → Rd×d Kθ (x, η) = ∇ΨT (x) ∇Ψ(x)∇Ψ(x)T ρθ (x)dx −1 ∇Ψ(η) 38
  45. Neural Vlasov Dynamics Vlasov-type dynamics (Neural FPE) ˙ Xt =

    −Kθt (∇V + β∇ log ρθt )(Xt ), X0 ∼ ρθ0 ρθt is the probability density of Xt . Vlasov-type SDE (FPE) ˙ ˜ Xt = −(∇V + β∇ log ρt )( ˜ Xt ), ˜ X0 ∼ ρ0 ρt is the probability density of ˜ Xt . 39
  46. Future works Solve other types of kinetic PDEs; Study the

    convergence behavior of neural parametric Wasserstein space; Combine with transport information PDEs to work on sampling problems; 40