Neural Fokker-Planck equation

Transport information scientiﬁc computing Wuchen Li University of South Carolina
Aug 27

Learning 2

Learning applications 3

Density ﬂows: Fokker-Planck equation Fokker Planck equation (FPE) is a
parabolic evolution equation: ∂ρ(t, x) ∂t = ∇ · (ρ(t, x)∇V (x)) + β∆ρ(t, x), ρ(0, x) = ρ0 (x), where x ∈ Rd and β > 0 is a given coeﬃcient constant. There are several properties of FPE, such as (i) Conservation of Mass; (ii) Entropy dissipation. 4

Related works In computing FPEs, many numerical methods have been
considered. Traditional methods (i) Discretization: Finite volume; Finite diﬀerence; Finite element e.t.c. (ii) Variation: Optimal transport gradient ﬂows, in both Eulerian and Lagrangian formulations. (Carrilo, Craig, Wang, Wei, Li, Cances, Chow, et.al.) (iii) Tools: Primal dual methods (Li, Osher, et.al.). Curse of dimensionality In traditional numerical methods, solving high dimensional FPE is not simple, whenever d ≥ 3. This is known as the curse of dimensionality. In this talk, we are aiming to provide a new scheme by deep learning variational methods. 5

Overcome curse of dimensionality Many methods have been proposed for
the Hamilton-Jacobi equation: Hopf-Lax formula: Darbon, Osher, et.al; Path space: Weinan E et.al; Lagrangian Methods: Lars, Halder, et.al; Inverse problems: L. Yang and G. E. Karniadakis et.al. 6

Goal Main question: Can we apply the learning-related basis functions
(Neural network) to approximate the solution of FPE in high dimensional spaces? Answer: 1. Dimension reduction via Neural networks: Fokker-Planck equation (PDE) ⇓ Neural parametric Fokker-Planck equation (ODE) 2. Approximate a solution of FPE in data-poor situations. I.e. introduce sampling friendly variational schemes. 7

Learning+Density flows Natural gradient: (i) Amari, Natural gradient works efficient
in learning; Ny et.al. Information geometry; (ii) ADAM, KFAC etc; (iii) Li, Transport information geometry I, II. Learning: (i) Li, Guido, Chen, Natural gradient via optimal transport I, II, Ricci curvature of parametric statistics via optimal transport, IG; (ii) Lin, Li, Osher, Guido, Wasserstein proximal of GANs; Natural affine proximal learning, GSI; (iii) Abel, Gretton, Li, Guido, Kernelized Wasserstein Natural Gradient, ICLR, 2020; (iv) Lars, Osher, Li, Levon, Samy, A machine learning framework for mean field games, PNAS, 2020; (v) Lin, Samy, Li, Levon, Osher, APAC net: Neural primal dual methods for mean field games, 2020; (vi) Liu, Li, Zha, Zhou, Neural parametric Fokker-Planck equations. 8

Neural network+Density ﬂows Computational neural networks12: Deep structure ⇒ Dealing
with high dimensional problem; Pushforward mechanism ⇒ Sampling-friendly. 1Ian J. Goodfellow et al. Generative Adversarial Nets. 2014 2Martin Arjovsky et al. Wasserstein Generative Adversarial Networks. 2016 9

Optimal transport Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation
; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Laﬀerty, Benamou, Brenier, Otto, Villani, etc). In this talk, we will apply density optimal control into data-driven scientiﬁc computing methods. 10

Density control+Applications 11

“Least square” in fluid dynamics Optimal transport has an optimal
control formulation as follows: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt, under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ρ0 = ρ0, ρ1 = ρ1. Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 12

Optimal transport gradient Denote gradW the gradient operator on (P,
gW ). Consider a smooth functional F: P → R, then the optimal transport gradient operator satisﬁes gradW F(ρ) =gW (ρ)−1 δ δρ F (x) = − ∇ · (ρ(x)∇ δ δρ(x) F(ρ)), where δ δρ(x) is the L2 ﬁrst variation at variable x ∈ Rd. 13

Optimal transport gradient flows Consider the Kullback–Leibler (KL) divergence: H(ρ)
=βDKL (ρ e− V β ) = V (x)ρ(x)dx + β ρ(x) log ρ(x)dx + Const. The optimal transport gradient flow of KL divergence satisfies: ∂ρ ∂t = − βgW (ρ)−1 δDKL δρ (x) =∇ · (ρ∇V ) + β∇ · (ρ∇ log ρ)) =∇ · (ρ∇V ) + β∆ρ, where ρ∇ log ρ = ∇ρ. 14

Generative Adversary Networks For each parameter θ ∈ Rd and
given neural network parameterized mapping function Tθ , consider ρθ = Tθ #p(z). 15

Generative probability models Consider a class of invertible push-forward maps
{Tθ }θ∈Θ indexed by parameter θ ∈ Θ ⊂ Rm Tθ : Rd → Rd. We obtain a family of parametric distributions PΘ = ρθ = Tθ# p | θ ∈ Θ ⊂ (P, gW ). 16

Neural information matrix Here GW (θ) ∈ Rm×m is an
matrix2 deﬁned by GW,ij (θ) = ∇Φi (x), ∇Φj (x)ρ(x, θ)dx, where −∇ · (ρθ ∇Φk (x)) = ∇θk ρθ , where k = i, j. In term of generative models, the matrix forms GW,ij (θ) = ∇ψi (Tθ (x)) · ∇ψj (Tθ (x)) p(x) dx, 1 ≤ i, j ≤ m. where ∇ · (ρθ ∇ψk (x)) = ∇ · (ρθ ∂θk Tθ (T−1 θ (x))). 2Li, Zhao, Wasserstein information matrix, 2019. 17

Neural parametric Fokker-Planck equation Consider H = H ◦ T#
: Θ → R, then H(θ) = H(ρθ ) = [V (Tθ (x)) + β log ρθ (Tθ (x))]p(x)dx. Writing the gradient ﬂow of H(θ) in (Θ, GW (θ)), we derive ˙ θ = −GW (θ)−1∇θ H(θ). We call the above ODE Neural Fokker-Planck equation. This is the neural approximation of FPE ∂ρ ∂t = −gW (ρ)−1 δ δρ H(ρ). 18

Algorithm We endow Tθ with the structure of deep neural
networks. In particular, we choose the normalizing flow9: Tθ (z) = T1θ ◦ T2θ ◦ · · · ◦ TNθ (z). Difficulties (i) Direct computation of GW (θ) could be expensive; (ii) And GW (θ)−1 may not be efficiently computed under deep learning framework; Here we have to seek a variational proximal scheme to handle them. 9D. Rezende and S. Mohamed. Variational inference with normalizing flows. ICML, 2015 19

Review: Proximal method Consider an Euclidean gradient ﬂow in (Rd,
I): ˙ x = −∇F(x). The Proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h x − xk 2, where h > 0 is the step-size. This is known as the backward Euler method xk+1 = xk − h∇F(xk+1 ). 20

Review: Proximal method on a metric structure Consider a gradient
ﬂow in (Rd, G): ˙ x = −G(x)−1∇F(x). The proximal method refers to xk+1 = arg min x∈Rn F(x) + 1 2h dG (x, xk )2, where h > 0 is the step-size, and dG (x, y) is the distance function induced by G, i.e. dG (x, y)2 = inf γ 1 0 ˙ γ(t)T G(γ(t))˙ γ(t)dt: γ(0) = x, γ(1) = y . This is known as the variational backward Euler method xk+1 = xk − hG(xk+1 )−1∇F(xk+1 ) + o(h). 21

Primal-dual scheme for FPE In term of optimal transport, the
proximal method, also named Jordan–Kinderlehrer–Otto (JKO) scheme, forms ρk+1 = arg min ρ F(ρ) + 1 2h dW (ρ, ρk )2 In other words, A suitable linearization on the metric can be considered; see Li, Wang, JCP, (2020). 22

AI Discretization Goal: Propose a variational scheme for solving gradient
ﬂows in (Θ, GW (θ)): ˙ θt = −GW (θt )−1∇θ H(θt ). Consider a linearized proximal algorithm: θk+1 = argmin θ { θ − θk , GW (θk )(θ − θk ) + 2hH(θ)} . 23

GAN=Neural proximal dual methods Consider the following approximation: θ −
θk , GW (θk )(θ − θk ) ≈ max ψ M (2∇ψ(Tθk ) · ((Tθ − Tθk ) − |∇ψ(Tθk )|2)p(x)dx Hence the scheme forms θk+1 =argmin θ max ψ M 2∇ψ(Tθk (x)) · (Tθ − Tθk ) − |∇ψ(Tθk (x))|2 p(x)dx + 2hH(θ) . Here we choose ψ as ReLU network and optimize over its parameters. Write the saddle point scheme in terms of ﬁnite samples and carry out the optimization step, as in GAN. 24

Algorithm Algorithm 1 Solving parametric FPE ˙ θ = −G(θ)−1∇θ
F(θ) Initialize θ0 for k = 0, 1, ..., N do θ ← θk ; for j = 1, 2, ..., M do Apply Stochastic gradient descent to solve: inf ψ |Tθ(x) − Tθk (x) − ∇ψ(Tθk (x))|2 p(x)dx θ ← θ − stepsize · ∇θJ(θ, θk). Here J(θ, ˜ θ) is deﬁned as: J(θ, ˜ θ) = Tθ(x) · ∇ψ(T˜ θ (x)) p(x)dx + hF(θ) end for θk+1 ← θ end for 25

Numerical examples We compute FPE with a complicated potential in
higher dimensions. In R30, β = 1, we set: V (x) = 1 50 30 i=1 x4 i − 16x2 i + 5xi Here is a plot of such function in 2 dimensional space: 26

Numerical examples Let the time stepsize be h = 0.005.
We project the samples in R30 to a two dimensional subspace, say the plane spanned by the 5th and 15th components. Here are the numerical results: 27

Numerical examples We can verify the graph of ψ trained
at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 30th time step Figure: Graph of ψ at k = 60th time step Figure: Graph of ψ at k = 150th time step 28

Numerical examples We can verify the graph of ψ trained
at the end of each outer iteration. Here are the graphs of ψ at 30th, 60th, 150th, 240th, 300th and 360th iteration: Figure: Graph of ψ at k = 240th time step Figure: Graph of ψ at k = 300th time step Figure: Graph of ψ at k = 360th time step 29

Neural numerical analysis Denote {ρt } as the solution to
the original FPE. Suppose {θt } as the solution to parametric FPE with ρθt = Tθt# p. 1 Convergence analysis: we will provide upper bound for DKL (ρθt ρ∗ ). Here ρ∗ = 1 Zβ exp(−V β ) is the Gibbs distribution (equilibrium solution of Fokker-Planck equation). 2 Error analysis: we provide uniform bound for W2 (ρθt , ρt ). 30

Neural numerical analysis Consider δ0 = sup θ∈Θ min ξ∈TθΘ
|(∇Ψ(Tθ (x))T ξ − ∇ (V + β log ρθ ) ◦ Tθ (x)|2p(x)dx One can treat δ0 as a quantiﬁcation of the approximation power of push-forward family {Tθ }θ∈Θ . 31

Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and
∇2V 0 outside a ﬁnite ball. Suppose {θt } solves ˙ θt = −GW (θt )−1∇θ H(θt ). Let ρ∗ (x) = 1 Zβ e−V (x)/β be the Gibbs distribution of original Fokker-Planck equation. Then we have the inequality: KL(ρθt ρ∗ ) ≤ δ0 ˜ λβ β2 (1 − e−β˜ λβ t) + KL(ρθ0 ρ∗ )e−β˜ λβ t. Here ˜ λβ > 0 is a constant number originating from the Log-Sobolev inequality involving ρ∗ , i.e. KL(ρ ρ∗ ) ≤ 1 ˜ λβ I(ρ|ρ∗ ) holds for any probability density ρ. Here I(ρ|ρ∗ ) = ∇ log ρ ρ∗ 2 ρ dx. 32

Future works (i) (Scientific computing) Solve general transport related kinetic
PDEs, including mean field games, Schr¨ odinger equations, Schr¨ odinger bridge problems, e.t.c; (ii) (Sampling optimization) Design transport information PDEs on sampling problems; (iii) (Estimation) Apply statistical efficiency, estimation of Neural networks via transport information matrix. 33

Neural Fokker-Planck equation

Neural Fokker-Planck equation

Wuchen Li

More Decks by Wuchen Li

Other Decks in Research

Featured

Transcript

Transport information scientiﬁc computing Wuchen Li University of South Carolina

Learning 2

Learning applications 3

Density ﬂows: Fokker-Planck equation Fokker Planck equation (FPE) is a

Related works In computing FPEs, many numerical methods have been

Overcome curse of dimensionality Many methods have been proposed for

Goal Main question: Can we apply the learning-related basis functions

Learning+Density ﬂows Natural gradient: (i) Amari, Natural gradient works eﬃcient

Neural network+Density ﬂows Computational neural networks12: Deep structure ⇒ Dealing

Optimal transport Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation

Density control+Applications 11

“Least square” in ﬂuid dynamics Optimal transport has an optimal

Optimal transport gradient Denote gradW the gradient operator on (P,

Optimal transport gradient ﬂows Consider the Kullback–Leibler (KL) divergence: H(ρ)

Generative Adversary Networks For each parameter θ ∈ Rd and

Generative probability models Consider a class of invertible push-forward maps

Neural information matrix Here GW (θ) ∈ Rm×m is an

Neural parametric Fokker-Planck equation Consider H = H ◦ T#

Algorithm We endow Tθ with the structure of deep neural

Review: Proximal method Consider an Euclidean gradient ﬂow in (Rd,

Review: Proximal method on a metric structure Consider a gradient

Primal-dual scheme for FPE In term of optimal transport, the

AI Discretization Goal: Propose a variational scheme for solving gradient

GAN=Neural proximal dual methods Consider the following approximation: θ −

Algorithm Algorithm 1 Solving parametric FPE ˙ θ = −G(θ)−1∇θ

Numerical examples We compute FPE with a complicated potential in

Numerical examples Let the time stepsize be h = 0.005.

Numerical examples We can verify the graph of ψ trained

Numerical examples We can verify the graph of ψ trained

Neural numerical analysis Denote {ρt } as the solution to

Neural numerical analysis Consider δ0 = sup θ∈Θ min ξ∈TθΘ

Convergence analysis Theorem Consider Fokker-Planck equation with smooth V and

Future works (i) (Scientiﬁc computing) Solve general transport related kinetic