Wuchen Li (University of South Carolina, Columbia, USA) Information Gamma Calculus: Convexity Analysis for Stochastic Differential Equations

Information Gamma calculus: Convexity analysis for stochastic diﬀerential equations Wuchen
Li University of South Carolina, Columbia Humboldt University of Berlin, OT Berlin 2024 March 15, 2024. 1

Collaborators 2

AI sampling problems 3

Sampling images from distributions Figure: Results from Wasserstein Generative adversary
networks. They are samples from image distributions. 4

Sampling problems Main problem Given a function V : Ω
→ R, the problem is to sample from π(x) = 1 Z e−V (x), where x ∈ Ω is a discrete or continuous sampling (state) space, π is a probability density function, and Z is a normalization constant. Diﬃculties Fast, eﬃcient, accurate algorithms; High dimensional sampling: Ω ⊂ RN , N >> 3. E.g., Image distributions Ω = R64∗64; Function V = − log π − log Z, is often unknown or with intractable formulations. For simplicity of talk, we assume V or π is analytical known. 5

Stochastic differential equations Consider a stochastic differential equation (SDE) by
˙ Xt = b(Xt ) + √ 2a(Xt ) ˙ Bt , where (n, m) ∈ N, Xt ∈ Rn+m, b ∈ Rn+m is a drift vector function, a(Xt ) ∈ R(n+m)×n is a diffusion matrix function, and Bt ∈ Rn is a standard n dimensional Brownian motion. For a diffusion matrix a and drift vector field b, assume that it has an invariant measure of shape π, how can we analyze the convergence speed of stochastic dynamics towards its invariant distribution? This is crucial in designing AI type sampling algorithms. 6

Transport, Entropy and Information, Geometry 7

Example: Langevin dynamics We review a classical example. Consider an
overdamped Langevin dynamics in Rn: ˙ Xt = −∇V (Xt ) + √ 2 ˙ Bt , where V is a given potential function. Let ρ(t, x) be the probability density function of Xt , which satisﬁes the following Fokker-Planck equation ∂t ρ(t, x) = ∇ · (ρ(t, x)∇V (x)) + ∆ρ(t, x). Here π(x) := 1 Z e−V (x), Z = e−V (y)dy < ∞, is the invariant distribution of the SDE. 8

Lyapunov methods To study the dynamical behavior of ρ, we
apply a global Lyapunov functional: DKL (ρt π) = ρt (x)log ρt (x) π(x) dx. Along the Fokker-Planck equation, the first order dissipation satisfies d dt DKL (ρt π) = − ∇x log ρt (x) π(x) 2ρt dx. And the second order dissipation satisfies d2 dt2 DKL (ρt π) = 2 ∇2 xx log ρt π 2 F −∇2 xx log π(∇x log ρt π , ∇x log ρt π ) ρt dx, where · F is a matrix Frobenius norm. In literature, DKL is named the Kullback–Leibler divergence (relative entropy, also free energy in statistical physics community) and I = − d dt DKL is called the relative Fisher information functional. 9

Lyapunov constant Suppose there exists a “Lyapunov constant” λ >
0, such that −∇2 xx log π(x) λI. Then d2 dt2 DKL (ρt π) ≥ −2λ d dt DKL (ρt π). By integrating on the time variable, one can prove the exponential convergence below: DKL (ρt π) ≤ e−2λtDKL (ρ0 π). 10

Literature There are several approaches to establish the Lyapunov constant
for gradient dynamics. Log-Sobolev inequality (Gross); Iterative Gamma calculus (Bakry, Emery, Baudoin, Garofalo et.al.); Entropy dissipation and Hypocoercivity (Arnold, Carlen, Carrilo, Villani, Mohout, Jungel, Markowich, Toscani, et.al.); Optimal transport, displacement convexity and Hessian operators in density space. (McCann, Ambrosio, Villani, Otto, Gangbo, Mielke, Maas, Liero, et.al.); Transport Lyapunov functional (Renesse, Strum et.al.). How can we obtain Polyak-Lojasiewicz (PL) conditions in Wasserstein type spaces? 11

Convergence analysis Recall that ˙ Xt = b(Xt ) +
√ 2a(Xt ) ˙ Bt , where b can be a non-gradient drift vector, and a is a degenerate matrix. Its Fokker-Planck equation satisﬁes ∂t ρ(t, x) = −∇ · (ρ(t, x)b(x)) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ(t, x) . Assume that there exists an invariant distribution π ∈ C∞(Rn+m) with an explicit formulation. How fast does ρ converge to the invariant distribution π? 12

Goals In this talk, we mainly consider the entropy dissipation
analysis for non-gradient and degenerate stochastic dynamical systems. Main diﬃculties. Degeneracy of diﬀusion matrix; Non-gradient drift vectors. Our analysis is based on the construction of auxiliary second order operators in generalized optimal transport spaces. 13

Optimal transport distances The optimal transport has a variational formulation
(Benamou-Brenier 2000): D(ρ0, ρ1)2 := inf v 1 0 EXt∼ρt v(t, Xt ) 2 dt, where E is the expectation operator and the infimum runs over all vector fields vt , such that ˙ Xt = v(t, Xt ), X0 ∼ ρ0, X1 ∼ ρ1. Under this metric, the probability set has a metric structure1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 14

Optimal transport metrics Informally speaking, the optimal transport metric refers
to the following bilinear form: ˙ ρ1 , G(ρ) ˙ ρ2 = ( ˙ ρ1 , (−∇ · (ρ∇))−1 ˙ ρ2 )dx. In other words, denote ˙ ρi = −∇ · (ρ∇φi ), i = 1, 2, then φ1 , G(ρ)−1φ2 = (φ1 , −∇ · (ρ∇)φ2 )dx = (∇φ1 , ∇φ2 )ρdx, where ρ ∈ P(Ω), ˙ ρi is the tangent vector in P(Ω), i.e. ˙ ρi dx = 0, and φi ∈ C∞(Ω) are cotangent vectors in P(Ω) at the point ρ. 15

Optimal transport first and second order calculus Given a metric
space (P, G), there is an optimal transport type calculus method, namely the first order operator (gradient) and the second order operator (Hessian) in the probability density space. They belong to transport information geometric analysis (TIGA). See [Li, TIG, 2018.] Shortly, we design a TIGA method to study non-gradient drift and degenerate diffusion coefficient type stochastic dynamical systems. 16

Optimal transport gradient operators The Wasserstein gradient flow of an
energy functional F(ρ) leads to ∂t ρ = − G(ρ)−1 δ δρ F(ρ) = ∇ · (ρ∇ δ δρ F(ρ)), where δ δρ is the L2 first variation operator. Example If F(ρ) = DKL (ρ π) = ρ(x)log ρ(x) π(x) dx, then the gradient flow satisfies the gradient-drift Fokker-Planck equation ∂ρ ∂t =∇ · (ρ∇log ρ π ) =∇ · (ρ∇ log ρ) − ∇ · (ρ∇ log π) =∆ρ + ∇ · (ρ∇V ). Here the trick is that ρ∇ log ρ = ∇ρ, ∇ log π = −∇V. 17

First and second order optimal transport calculuses In this way,
one can study the ﬁrst order entropy dissipation by d dt DKL (ρt π) = log ρt π ∇ · (ρ∇log ρt π )dx = − ∇log ρt π 2ρdx. Similarly, we study the second order entropy dissipation by d2 dt2 DKL (ρt π) =2 HessW DKL (ρt π)(log ρt π , log ρt π )ρt dx =2 Γ2 (log ρt π , log ρt π )ρt dx, where HessW DKL (ρt |π) := Γ2 is a bilinear form, which can be deﬁned by the optimal transport second order operator. 18

Convergence analysis of non-gradient ﬂows? Consider a general Fokker-Planck equation
as ∂t ρt = −∇ · (ρt b) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρt . Suppose there exists a smooth invariant distribution π, such that −∇ · (πb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij π = 0. To study the convergence behavior of ρ towards π, we need to understand and estimate optimal transport type operators on both gradient and non-gradient directions. 19

Decomposition: Gradient and Non-gradient (ﬂux) One can rewrite the Fokker-Planck
equation by ∂t ρ(t, x) = ∇ · (ρ(t, x)a(x)a(x)T∇ log ρ(t, x) π(x) ) + ∇ · (ρ(t, x)γ(x)), Gradient Non-gradient (Flux) where γ(x) :=a(x)a(x)T∇ log π(x) − b(x) + n+m j=1 ∂ ∂xj (a(x)a(x)T)ij 1≤i≤n+m , and ∇ · (π(x)γ(x)) = 0. This is an example of generalized entropy-entropy ﬂux-metric condition. See examples in [Li-Liu-Osher, Controlling conservation laws I, II], and GENERIC or pre-GENERIC systems [Grmela and Ottinger]. 20

Orthogonalizations Proposition d dt DKL (ρt π) = − (∇x
log ρt π , a(x)a(x)T∇ log ρt π )ρt dx := −Ia (ρt ). In many situations, aaT-weighted Fisher information functional Ia (ρt ) is not enough to estimate the convergence of Fokker-Planck equation. 21

Main result: Structure condition Assumption: for any i ∈ {1,
· · · , n} and k ∈ {1, · · · , m}, we assume zT k ∇aT i ∈ Span{aT 1 , · · · , aT n }. Examples a is a constant vector; a is a matrix function deﬁned by a = a(x1 , · · · , xn ), z ∈ span{en+1 , · · · , en+m }, where ei is the i-th Euclidean basis function. 22

Main result: Entropy dissipation [Feng and Li, 2020-2022] Under the
assumption, for any β ∈ R and a given vector function z, deﬁne matrix functions as R = Ra + Rz + Rπ − MΛ + βRIa + (1 − β)Rγa + Rγz . If there exists a constant λ > 0, such that R λ(aaT + zzT), then the following decay results hold: DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 23

Tensors 24

Comparisons (i) If γ = 0 and m = 0:
[Bakry-Emery, 1985]. (ii) If γ = 0 and m = 0: [Baudoin-Garofalo, 2017], [F.-Li, 2019]. (iii) If β = 0 and m = 0, [Arnold-Carlen-Ju, 2000, 2008]. (iv) If a, z are constants and β = 0, [Arnold-Erb, 2014][Baudoin-Gordina-Herzog, 2019]. (v) If β = 1, m = 0 and a = I, [Arnold-Carlen]; [F.-Li, 2020]. 25

Main idea of proof Step 1: We ﬁrst compute the
dissipation of the weighted Fisher information functional (a.k.a. gradient norms in generalized OT space). d dt Ia,z (ρt π) = −2 Γ2,a,z,γ (f, f)ρt dx, where f = log ρt π . Step 2: We next decompose the weak form of information Gamma calculus (a.k.a. Hessian matrices in generalized OT space). Γ2,a,z,γ (f, f)ρt dx = Hessβ f 2 F + R(∇f, ∇f) ρt dx. Step 3: From the information Bochner’s formula, we establish the convergence result. In other words, if R λ(aaT + zzT), then d dt Ia,z (ρt π) ≤ −2λIa,z (ρt π). From the Gronwall’s inequality, we can prove that the weighted Fisher information decays, so as the KL divergence decay. 26

Details of proof Deﬁne Ia,z (ρ π) = Rn+m ∇
log ρ π , (aaT + zzT)∇ log ρ π ρdx. Consider − 1 2 d dt Ia,z (ρt ) = Γ2 (f, f)ρt dxdx · · · (I) + Γz,π 2 (f, f)ρt dx · · · (II) + ΓIa,z (f, f)ρt dx · · · (III) where f = log ρt π , and Γ2 , Γz 2 , Γγ are designed bilinear forms, coming from the second order calculus in density space. (i) If a is non-degenerate, then (II) = 0; (ii) If b is a gradient vector ﬁeld, then (III) = 0. 27

Detailed approach For any f ∈ C∞(Rn+m), the generator of
Itˆ o SDE satisﬁes Lf = Lf − γ, ∇f , where Lf = ∇ · (aaT∇f) + aaT∇ log π, ∇f . For a given matrix function a ∈ R(n+m)×n, we construct a matrix function z ∈ R(n+m)×m, and deﬁne a z-direction generator by Lz f = ∇ · (zzT∇f) + zzT∇ log π, ∇f . 28

Global computation=Information gamma operators Deﬁne Gamma one bilinear forms by
Γ1 (f, f) = aT∇f, aT∇f Rn , Γz 1 (f, f) = zT∇f, zT∇f Rm . Deﬁne Gamma two bilinear forms by (i) Gamma two operator: Γ2 (f, f) = 1 2 LΓ1 (f, f) − Γ1 (Lf, f). (ii) Generalized Gamma z operator: Γz,π 2 (f, f) = 1 2 LΓz 1 (f, f) − Γz 1 (Lf, f) + divπ z Γ1,∇(aaT) (f, f) − divπ a Γ1,∇(zzT) (f, f) . (iii) Irreversible Gamma operator: ΓIa,z (f, f) = (Lf + Lz f) ∇f, γ − 1 2 ∇ Γ1 (f, f) + Γz 1 (f, f) , γ . 29

Local calculation=Information Bochner’s formula For any f = log ρt
π ∈ C∞(Rn+m, R) and any β ∈ R, under the assumption, we derive − 1 2 d dt Ia,z (ρt π) = Γ2 (f, f) + Γz,π 2 (f, f) + ΓIa,z (f, f) ρt dx = Hessβ f 2 + R(∇f, ∇f) ρt dx ≥ R(∇f, ∇f)ρt dx. Clearly, if R λ(aaT + zzT), then − 1 2 d dt Ia,z (ρt π) ≥ λIa,z (ρt π). From Grownwall’s inequality, we prove the Lyapunov convergence result. 30

Example I: Non-degenerate Langevin dynamics Consider a stochastic process: dXt
= − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n − γ(Xt ) dt + √ 2a(Xt )dBt , where Xt ∈ Rn, Bt is a standard n dimensional Brownian motion, V ∈ C2(Rn; R) is a potential function with ∇ · (e−V (x)γ(x)) = 0, and the diﬀusion matrix a is a smooth diagonal matrix with a(x) = diag(aii (xi ))1≤i≤n . Its invariant density satisﬁes π(x) = 1 Z e−V (x), Z = e−V (y)dy. 31

Example I: Hessian matrix The Hessian matrix R has the
following form: R(x) = Ra + βRIa + (1 − β)Rγa (x), where                    Ra,ii = a3 ii ∂xi aii ∂xi V + a4 ii ∂2 xixi V − a3 ii ∂2 xixi aii , Ra,ij = a2 ii a2 jj ∂2 xixj V, RIa,ii = γi [aii ∂xi aii − (aii )2∂xi V ], RIa,ij = 1 2 [γj (2aii ∂xi aii − (aii )2∂xi V ) + γi (2ajj ∂xj ajj − (ajj )2∂xj V )], Rγa,ii = γi aii ∂xi aii − ∂xi γi (aii )2, Rγa,ij = −1 2 [∂xi γj (ajj )2 + ∂xj γi (aii )2]. 32

Example I: One dimensional case Let n = 1 and
m = 0. R = a3a V + a4V − a3a + βγ(aa − a2V ) + (1 − β)(γaa − a2γ ), where and represent the ﬁrst and second-order derivatives w.r.t. x. If a = 1 and γ = 0, then R = V (x); If a = 1 and γ = 0, β = 0, then R = V (x) − γ (x); If a = 1 and γ = 0, β = 1, then R = V (x) − γ(x)V (x); If a = 1 and γ = 0, β ∈ R, then R = V (x) − βγ(x)V (x) − (1 − β)γ (x). 33

Example II: Underdamped Langevin dynamics Consider a underdamped Langevin dynamic
by dxt =vt dt dvt =(−T(xt )vt − ∇x U(xt ))dt + 2T(xt )dBt . It can be viewed as, Yt = (xt , vt ), dYt = b(Yt )dt + √ 2a(Yt )dBt , with drift vectors and diﬀusion coeﬃcient satisfying b = v −T(x)v − ∇U(x) , a = 0 T(x) . Its invariant measure has a closed form, π(x, v) = 1 Z e−H(x,v), H(x, v) = v 2 2 + U(x). 34

Example II: Decomposition Write a vector ﬁeld γ ∈ R2
as γ(x, v) = 0 −rv − v −rv − ∇x U = J∇x,v H(x, v), where J = 0 −1 1 0 . Clearly, ∇x,v · π(x, v)γ(x, v) = 1 Z ∇x,v · e−H(x,v) J∇x,v H(x, v) = 0. 35

Example II: Hessian matrix 36

II: Constant diﬀusion -1 -0.5 0 0.5 1 x 1
-1 -0.5 0 0.5 1 x 2 -1 -0.5 0 0.5 1 1.5 Smallest eignvalue: 0.094 0.095 1 1 0.096 0.097 Smallest eignvalue: 0.098 0.099 0.5 0.5 0.1 x 2 x 1 0 0 -0.5 -0.5 -1 -1 Figure: T=1, U(x) = x2/2; Left β = 0 [Arnold-Erb]; Right β = 0.1; z = (1, 0.1)T. 37

II: Variable diﬀusion -0.2 -0.15 1 1 -0.1 -0.05 Smallest
eignvalue: 0.9 0.9 0 0.05 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 -0.04 -0.02 1 1 0 0.02 0.04 Smallest eignvalue: 0.06 0.9 0.9 0.08 0.1 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 Figure: U(x) = xc−x c(c−1) , T(x) = (∇2 x U(x))−1, c=2.5, z = 1 0.1 . (Left: β = 0; Right: β = 0.6.) 38

Example III: SDEs on Lie groups There are some other
convergence analysis for SDEs in Lie groups, including the Heisenberg group (quantum SDEs), the displacement group, and the Martinet flat sub-Riemannian structure. Consider dXt = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n dt + √ 2a(Xt )dBt , where a(x) ∈ C∞(Rn+m; R(n+m)×n) and π ∈ C∞(Rn+m; R). They are gradient flow examples with degenerate diffusion matrix coefficients. We still manage to derive suitable Hessian matrix conditions. See details in Feng-Li, Entropy dissipation for degenerate stochastic differential equations via sub-Riemannian density manifold, 88 pages, Entropy, 2023. 39

Example III.1: Heisenberg group If R λ(aaT + zzT), then
DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 40

Example III.2: Displacement group If R λ(aaT + zzT), then
DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 41

Example IV: Mean field underdamped Langevin dynamics There are some
other Hessian matrix examples for mean field SDEs, whose Kolmogorov forward equation satisfies the weakly self-consistent Vlasov-Fokker-Planck equation. See analysis of overdamped case in [Carrillo-McCann-Villani, 2003]. We now consider dxt = vt dt dvt = −vt dt − ( Td×Rd ∇x W(xt , y)f(t, y, v)dvdy + ∇x U(xt ))dt + √ 2dBt , where (xt , vt ) ∈ Ω = Td × Rd presents an identical particle’s position and velocity, Td is a d dimensional torus representing a position domain, and Bt is a standard Brownian motion in Rd. Here f is the probability density of both spatial and velocity variables (x, v). ∂t f + v · ∇x f − ( Td×Rd ∇x W(x, y)f(t, y, v)dvdy + ∇x U(x)) · ∇v f = ∇v · (fv) + ∇v · (∇v f). 42

Example IV: Decomposition Similarly, we rewrite the above weakly self-consistent
Vlasov-Fokker-Planck equation in the following equivalent form: ∂t f = ∇x,v · (fγ) + ∇x,v · (faaT∇x,v δ δf E(f)), where E(f) = Ω f(x, v) log f(x, v)dxdv + Ω 1 2 v 2f(x, v)dxdv + 1 2 Ω×Ω W(x, y)f(x, v)f(y, v)dxdvdydv + Ω U(x)f(x, v)dxdv, with γ = J∇x,v [ δ δf E(f)], and J = 0 −Id Id 0 2d×2d . 43

Example IV: Mean ﬁeld Hessian matrix 44

Example IV: Free energy dissipation 45

Example IV We apply a similar method in this paper
and derive the Hessian matrix condition for the mean ﬁeld PDE. See details about choices of W and z with convergence rates λ in Bayraktar-Feng-Li, Exponential Entropy dissipation for weakly self-consistent Vlasov-Fokker-Planck equations, Journal of Nonlinear Science, 2024. 46

Example V: Time dependent stochastic differential equations Consider Ito type
stochastic differential equations (SDEs) in Rn+m as follows: dXt = b(t, Xt )dt + √ 2a(t, Xt )dBt . For m, n ∈ Z+ , we assume that a(t, x) ∈ C∞(R+ × Rn+m; R(n+m)×n) is a time dependent diffusion matrix, b(t, x) ∈ C∞(R+ × Rn+m; Rn+m) is a time dependent vector field, and Bt is a standard Rn-valued Brownian motion. 47

Example V: Time dependent entropy dissipation Here π(t, x) is
a selected time dependent reference density function. We also deﬁne γ(t, x) := a(t, x)a(t, x)T ∇ log π(x) − b(x) + ∇ · a(t, x)a(t, x)T . Due to this general setting, we may not have ∇ · (π(t, x)γ(t, x)) = 0. 48

Example V: Time dependent Hessian matrix 49

Example V We have shown some time dependent SDE analysis,
and conducted some related numerical simulations. Typical examples are: Time dependent overdamped dynamics (simulated annealing); Time dependent irreversible Langevin dynamics; Time dependent underdamped Langevin dynamics. See details in Feng-Zuo-Li, Fisher information dissipation for time inhomogeneous stochastic diﬀerential equations, arXiv:2402.01036, 2024. The construction of fast and stable stochastic algorithm is a more challenging future work! 50

TIGA for Stochastic Analysis & Algorithms Optimal choices of auxiliary
directions for capturing sharp convergence rates; Design fast and scalable AI type sampling algorithms with robust convergence guarantees; Flux-gradient flow functional inequalities; Convergence analysis of generalized flux-gradient flows and Generalized time-dependent mean field information Gamma calculus. 51

Wuchen Li (University of South Carolina, Columb...

Wuchen Li (University of South Carolina, Columbia, USA) Information Gamma Calculus: Convexity Analysis for Stochastic Differential Equations

More Decks by Jia-Jie Zhu

Featured

Transcript