Jia-Jie Zhu
March 27, 2024
160

# Wuchen Li (University of South Carolina, Columbia, USA) Information Gamma Calculus: Convexity Analysis for Stochastic Differential Equations

WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24

Berlin, Germany. March 11th - 15th, 2024

March 27, 2024

## Transcript

1. ### Information Gamma calculus: Convexity analysis for stochastic diﬀerential equations Wuchen

Li University of South Carolina, Columbia Humboldt University of Berlin, OT Berlin 2024 March 15, 2024. 1

4. ### Sampling images from distributions Figure: Results from Wasserstein Generative adversary

networks. They are samples from image distributions. 4
5. ### Sampling problems Main problem Given a function V : Ω

→ R, the problem is to sample from π(x) = 1 Z e−V (x), where x ∈ Ω is a discrete or continuous sampling (state) space, π is a probability density function, and Z is a normalization constant. Diﬃculties Fast, eﬃcient, accurate algorithms; High dimensional sampling: Ω ⊂ RN , N >> 3. E.g., Image distributions Ω = R64∗64; Function V = − log π − log Z, is often unknown or with intractable formulations. For simplicity of talk, we assume V or π is analytical known. 5
6. ### Stochastic diﬀerential equations Consider a stochastic diﬀerential equation (SDE) by

˙ Xt = b(Xt ) + √ 2a(Xt ) ˙ Bt , where (n, m) ∈ N, Xt ∈ Rn+m, b ∈ Rn+m is a drift vector function, a(Xt ) ∈ R(n+m)×n is a diﬀusion matrix function, and Bt ∈ Rn is a standard n dimensional Brownian motion. For a diﬀusion matrix a and drift vector ﬁeld b, assume that it has an invariant measure of shape π, how can we analyze the convergence speed of stochastic dynamics towards its invariant distribution? This is crucial in designing AI type sampling algorithms. 6

8. ### Example: Langevin dynamics We review a classical example. Consider an

overdamped Langevin dynamics in Rn: ˙ Xt = −∇V (Xt ) + √ 2 ˙ Bt , where V is a given potential function. Let ρ(t, x) be the probability density function of Xt , which satisﬁes the following Fokker-Planck equation ∂t ρ(t, x) = ∇ · (ρ(t, x)∇V (x)) + ∆ρ(t, x). Here π(x) := 1 Z e−V (x), Z = e−V (y)dy < ∞, is the invariant distribution of the SDE. 8
9. ### Lyapunov methods To study the dynamical behavior of ρ, we

apply a global Lyapunov functional: DKL (ρt π) = ρt (x)log ρt (x) π(x) dx. Along the Fokker-Planck equation, the ﬁrst order dissipation satisﬁes d dt DKL (ρt π) = − ∇x log ρt (x) π(x) 2ρt dx. And the second order dissipation satisﬁes d2 dt2 DKL (ρt π) = 2 ∇2 xx log ρt π 2 F −∇2 xx log π(∇x log ρt π , ∇x log ρt π ) ρt dx, where · F is a matrix Frobenius norm. In literature, DKL is named the Kullback–Leibler divergence (relative entropy, also free energy in statistical physics community) and I = − d dt DKL is called the relative Fisher information functional. 9
10. ### Lyapunov constant Suppose there exists a “Lyapunov constant” λ >

0, such that −∇2 xx log π(x) λI. Then d2 dt2 DKL (ρt π) ≥ −2λ d dt DKL (ρt π). By integrating on the time variable, one can prove the exponential convergence below: DKL (ρt π) ≤ e−2λtDKL (ρ0 π). 10
11. ### Literature There are several approaches to establish the Lyapunov constant

for gradient dynamics. Log-Sobolev inequality (Gross); Iterative Gamma calculus (Bakry, Emery, Baudoin, Garofalo et.al.); Entropy dissipation and Hypocoercivity (Arnold, Carlen, Carrilo, Villani, Mohout, Jungel, Markowich, Toscani, et.al.); Optimal transport, displacement convexity and Hessian operators in density space. (McCann, Ambrosio, Villani, Otto, Gangbo, Mielke, Maas, Liero, et.al.); Transport Lyapunov functional (Renesse, Strum et.al.). How can we obtain Polyak-Lojasiewicz (PL) conditions in Wasserstein type spaces? 11
12. ### Convergence analysis Recall that ˙ Xt = b(Xt ) +

√ 2a(Xt ) ˙ Bt , where b can be a non-gradient drift vector, and a is a degenerate matrix. Its Fokker-Planck equation satisﬁes ∂t ρ(t, x) = −∇ · (ρ(t, x)b(x)) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ(t, x) . Assume that there exists an invariant distribution π ∈ C∞(Rn+m) with an explicit formulation. How fast does ρ converge to the invariant distribution π? 12
13. ### Goals In this talk, we mainly consider the entropy dissipation

analysis for non-gradient and degenerate stochastic dynamical systems. Main diﬃculties. Degeneracy of diﬀusion matrix; Non-gradient drift vectors. Our analysis is based on the construction of auxiliary second order operators in generalized optimal transport spaces. 13
14. ### Optimal transport distances The optimal transport has a variational formulation

(Benamou-Brenier 2000): D(ρ0, ρ1)2 := inf v 1 0 EXt∼ρt v(t, Xt ) 2 dt, where E is the expectation operator and the inﬁmum runs over all vector ﬁelds vt , such that ˙ Xt = v(t, Xt ), X0 ∼ ρ0, X1 ∼ ρ1. Under this metric, the probability set has a metric structure1. 1John D. Laﬀerty: the density manifold and conﬁguration space quantization, 1988. 14
15. ### Optimal transport metrics Informally speaking, the optimal transport metric refers

to the following bilinear form: ˙ ρ1 , G(ρ) ˙ ρ2 = ( ˙ ρ1 , (−∇ · (ρ∇))−1 ˙ ρ2 )dx. In other words, denote ˙ ρi = −∇ · (ρ∇φi ), i = 1, 2, then φ1 , G(ρ)−1φ2 = (φ1 , −∇ · (ρ∇)φ2 )dx = (∇φ1 , ∇φ2 )ρdx, where ρ ∈ P(Ω), ˙ ρi is the tangent vector in P(Ω), i.e. ˙ ρi dx = 0, and φi ∈ C∞(Ω) are cotangent vectors in P(Ω) at the point ρ. 15
16. ### Optimal transport ﬁrst and second order calculus Given a metric

space (P, G), there is an optimal transport type calculus method, namely the ﬁrst order operator (gradient) and the second order operator (Hessian) in the probability density space. They belong to transport information geometric analysis (TIGA). See [Li, TIG, 2018.] Shortly, we design a TIGA method to study non-gradient drift and degenerate diﬀusion coeﬃcient type stochastic dynamical systems. 16
17. ### Optimal transport gradient operators The Wasserstein gradient ﬂow of an

energy functional F(ρ) leads to ∂t ρ = − G(ρ)−1 δ δρ F(ρ) = ∇ · (ρ∇ δ δρ F(ρ)), where δ δρ is the L2 ﬁrst variation operator. Example If F(ρ) = DKL (ρ π) = ρ(x)log ρ(x) π(x) dx, then the gradient ﬂow satisﬁes the gradient-drift Fokker-Planck equation ∂ρ ∂t =∇ · (ρ∇log ρ π ) =∇ · (ρ∇ log ρ) − ∇ · (ρ∇ log π) =∆ρ + ∇ · (ρ∇V ). Here the trick is that ρ∇ log ρ = ∇ρ, ∇ log π = −∇V. 17
18. ### First and second order optimal transport calculuses In this way,

one can study the ﬁrst order entropy dissipation by d dt DKL (ρt π) = log ρt π ∇ · (ρ∇log ρt π )dx = − ∇log ρt π 2ρdx. Similarly, we study the second order entropy dissipation by d2 dt2 DKL (ρt π) =2 HessW DKL (ρt π)(log ρt π , log ρt π )ρt dx =2 Γ2 (log ρt π , log ρt π )ρt dx, where HessW DKL (ρt |π) := Γ2 is a bilinear form, which can be deﬁned by the optimal transport second order operator. 18
19. ### Convergence analysis of non-gradient ﬂows? Consider a general Fokker-Planck equation

as ∂t ρt = −∇ · (ρt b) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρt . Suppose there exists a smooth invariant distribution π, such that −∇ · (πb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij π = 0. To study the convergence behavior of ρ towards π, we need to understand and estimate optimal transport type operators on both gradient and non-gradient directions. 19
20. ### Decomposition: Gradient and Non-gradient (ﬂux) One can rewrite the Fokker-Planck

equation by ∂t ρ(t, x) = ∇ · (ρ(t, x)a(x)a(x)T∇ log ρ(t, x) π(x) ) + ∇ · (ρ(t, x)γ(x)), Gradient Non-gradient (Flux) where γ(x) :=a(x)a(x)T∇ log π(x) − b(x) + n+m j=1 ∂ ∂xj (a(x)a(x)T)ij 1≤i≤n+m , and ∇ · (π(x)γ(x)) = 0. This is an example of generalized entropy-entropy ﬂux-metric condition. See examples in [Li-Liu-Osher, Controlling conservation laws I, II], and GENERIC or pre-GENERIC systems [Grmela and Ottinger]. 20
21. ### Orthogonalizations Proposition d dt DKL (ρt π) = − (∇x

log ρt π , a(x)a(x)T∇ log ρt π )ρt dx := −Ia (ρt ). In many situations, aaT-weighted Fisher information functional Ia (ρt ) is not enough to estimate the convergence of Fokker-Planck equation. 21
22. ### Main result: Structure condition Assumption: for any i ∈ {1,

· · · , n} and k ∈ {1, · · · , m}, we assume zT k ∇aT i ∈ Span{aT 1 , · · · , aT n }. Examples a is a constant vector; a is a matrix function deﬁned by a = a(x1 , · · · , xn ), z ∈ span{en+1 , · · · , en+m }, where ei is the i-th Euclidean basis function. 22
23. ### Main result: Entropy dissipation [Feng and Li, 2020-2022] Under the

assumption, for any β ∈ R and a given vector function z, deﬁne matrix functions as R = Ra + Rz + Rπ − MΛ + βRIa + (1 − β)Rγa + Rγz . If there exists a constant λ > 0, such that R λ(aaT + zzT), then the following decay results hold: DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 23

25. ### Comparisons (i) If γ = 0 and m = 0:

[Bakry-Emery, 1985]. (ii) If γ = 0 and m = 0: [Baudoin-Garofalo, 2017], [F.-Li, 2019]. (iii) If β = 0 and m = 0, [Arnold-Carlen-Ju, 2000, 2008]. (iv) If a, z are constants and β = 0, [Arnold-Erb, 2014][Baudoin-Gordina-Herzog, 2019]. (v) If β = 1, m = 0 and a = I, [Arnold-Carlen]; [F.-Li, 2020]. 25
26. ### Main idea of proof Step 1: We ﬁrst compute the

dissipation of the weighted Fisher information functional (a.k.a. gradient norms in generalized OT space). d dt Ia,z (ρt π) = −2 Γ2,a,z,γ (f, f)ρt dx, where f = log ρt π . Step 2: We next decompose the weak form of information Gamma calculus (a.k.a. Hessian matrices in generalized OT space). Γ2,a,z,γ (f, f)ρt dx = Hessβ f 2 F + R(∇f, ∇f) ρt dx. Step 3: From the information Bochner’s formula, we establish the convergence result. In other words, if R λ(aaT + zzT), then d dt Ia,z (ρt π) ≤ −2λIa,z (ρt π). From the Gronwall’s inequality, we can prove that the weighted Fisher information decays, so as the KL divergence decay. 26
27. ### Details of proof Deﬁne Ia,z (ρ π) = Rn+m ∇

log ρ π , (aaT + zzT)∇ log ρ π ρdx. Consider − 1 2 d dt Ia,z (ρt ) = Γ2 (f, f)ρt dxdx · · · (I) + Γz,π 2 (f, f)ρt dx · · · (II) + ΓIa,z (f, f)ρt dx · · · (III) where f = log ρt π , and Γ2 , Γz 2 , Γγ are designed bilinear forms, coming from the second order calculus in density space. (i) If a is non-degenerate, then (II) = 0; (ii) If b is a gradient vector ﬁeld, then (III) = 0. 27
28. ### Detailed approach For any f ∈ C∞(Rn+m), the generator of

Itˆ o SDE satisﬁes Lf = Lf − γ, ∇f , where Lf = ∇ · (aaT∇f) + aaT∇ log π, ∇f . For a given matrix function a ∈ R(n+m)×n, we construct a matrix function z ∈ R(n+m)×m, and deﬁne a z-direction generator by Lz f = ∇ · (zzT∇f) + zzT∇ log π, ∇f . 28
29. ### Global computation=Information gamma operators Deﬁne Gamma one bilinear forms by

Γ1 (f, f) = aT∇f, aT∇f Rn , Γz 1 (f, f) = zT∇f, zT∇f Rm . Deﬁne Gamma two bilinear forms by (i) Gamma two operator: Γ2 (f, f) = 1 2 LΓ1 (f, f) − Γ1 (Lf, f). (ii) Generalized Gamma z operator: Γz,π 2 (f, f) = 1 2 LΓz 1 (f, f) − Γz 1 (Lf, f) + divπ z Γ1,∇(aaT) (f, f) − divπ a Γ1,∇(zzT) (f, f) . (iii) Irreversible Gamma operator: ΓIa,z (f, f) = (Lf + Lz f) ∇f, γ − 1 2 ∇ Γ1 (f, f) + Γz 1 (f, f) , γ . 29
30. ### Local calculation=Information Bochner’s formula For any f = log ρt

π ∈ C∞(Rn+m, R) and any β ∈ R, under the assumption, we derive − 1 2 d dt Ia,z (ρt π) = Γ2 (f, f) + Γz,π 2 (f, f) + ΓIa,z (f, f) ρt dx = Hessβ f 2 + R(∇f, ∇f) ρt dx ≥ R(∇f, ∇f)ρt dx. Clearly, if R λ(aaT + zzT), then − 1 2 d dt Ia,z (ρt π) ≥ λIa,z (ρt π). From Grownwall’s inequality, we prove the Lyapunov convergence result. 30
31. ### Example I: Non-degenerate Langevin dynamics Consider a stochastic process: dXt

= − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n − γ(Xt ) dt + √ 2a(Xt )dBt , where Xt ∈ Rn, Bt is a standard n dimensional Brownian motion, V ∈ C2(Rn; R) is a potential function with ∇ · (e−V (x)γ(x)) = 0, and the diﬀusion matrix a is a smooth diagonal matrix with a(x) = diag(aii (xi ))1≤i≤n . Its invariant density satisﬁes π(x) = 1 Z e−V (x), Z = e−V (y)dy. 31
32. ### Example I: Hessian matrix The Hessian matrix R has the

following form: R(x) = Ra + βRIa + (1 − β)Rγa (x), where                    Ra,ii = a3 ii ∂xi aii ∂xi V + a4 ii ∂2 xixi V − a3 ii ∂2 xixi aii , Ra,ij = a2 ii a2 jj ∂2 xixj V, RIa,ii = γi [aii ∂xi aii − (aii )2∂xi V ], RIa,ij = 1 2 [γj (2aii ∂xi aii − (aii )2∂xi V ) + γi (2ajj ∂xj ajj − (ajj )2∂xj V )], Rγa,ii = γi aii ∂xi aii − ∂xi γi (aii )2, Rγa,ij = −1 2 [∂xi γj (ajj )2 + ∂xj γi (aii )2]. 32
33. ### Example I: One dimensional case Let n = 1 and

m = 0. R = a3a V + a4V − a3a + βγ(aa − a2V ) + (1 − β)(γaa − a2γ ), where and represent the ﬁrst and second-order derivatives w.r.t. x. If a = 1 and γ = 0, then R = V (x); If a = 1 and γ = 0, β = 0, then R = V (x) − γ (x); If a = 1 and γ = 0, β = 1, then R = V (x) − γ(x)V (x); If a = 1 and γ = 0, β ∈ R, then R = V (x) − βγ(x)V (x) − (1 − β)γ (x). 33
34. ### Example II: Underdamped Langevin dynamics Consider a underdamped Langevin dynamic

by dxt =vt dt dvt =(−T(xt )vt − ∇x U(xt ))dt + 2T(xt )dBt . It can be viewed as, Yt = (xt , vt ), dYt = b(Yt )dt + √ 2a(Yt )dBt , with drift vectors and diﬀusion coeﬃcient satisfying b = v −T(x)v − ∇U(x) , a = 0 T(x) . Its invariant measure has a closed form, π(x, v) = 1 Z e−H(x,v), H(x, v) = v 2 2 + U(x). 34
35. ### Example II: Decomposition Write a vector ﬁeld γ ∈ R2

as γ(x, v) = 0 −rv − v −rv − ∇x U = J∇x,v H(x, v), where J = 0 −1 1 0 . Clearly, ∇x,v · π(x, v)γ(x, v) = 1 Z ∇x,v · e−H(x,v) J∇x,v H(x, v) = 0. 35

37. ### II: Constant diﬀusion -1 -0.5 0 0.5 1 x 1

-1 -0.5 0 0.5 1 x 2 -1 -0.5 0 0.5 1 1.5 Smallest eignvalue: 0.094 0.095 1 1 0.096 0.097 Smallest eignvalue: 0.098 0.099 0.5 0.5 0.1 x 2 x 1 0 0 -0.5 -0.5 -1 -1 Figure: T=1, U(x) = x2/2; Left β = 0 [Arnold-Erb]; Right β = 0.1; z = (1, 0.1)T. 37
38. ### II: Variable diﬀusion -0.2 -0.15 1 1 -0.1 -0.05 Smallest

eignvalue: 0.9 0.9 0 0.05 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 -0.04 -0.02 1 1 0 0.02 0.04 Smallest eignvalue: 0.06 0.9 0.9 0.08 0.1 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 Figure: U(x) = xc−x c(c−1) , T(x) = (∇2 x U(x))−1, c=2.5, z = 1 0.1 . (Left: β = 0; Right: β = 0.6.) 38
39. ### Example III: SDEs on Lie groups There are some other

convergence analysis for SDEs in Lie groups, including the Heisenberg group (quantum SDEs), the displacement group, and the Martinet ﬂat sub-Riemannian structure. Consider dXt = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n dt + √ 2a(Xt )dBt , where a(x) ∈ C∞(Rn+m; R(n+m)×n) and π ∈ C∞(Rn+m; R). They are gradient ﬂow examples with degenerate diﬀusion matrix coeﬃcients. We still manage to derive suitable Hessian matrix conditions. See details in Feng-Li, Entropy dissipation for degenerate stochastic diﬀerential equations via sub-Riemannian density manifold, 88 pages, Entropy, 2023. 39
40. ### Example III.1: Heisenberg group If R λ(aaT + zzT), then

DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 40
41. ### Example III.2: Displacement group If R λ(aaT + zzT), then

DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 41
42. ### Example IV: Mean ﬁeld underdamped Langevin dynamics There are some

other Hessian matrix examples for mean ﬁeld SDEs, whose Kolmogorov forward equation satisﬁes the weakly self-consistent Vlasov-Fokker-Planck equation. See analysis of overdamped case in [Carrillo-McCann-Villani, 2003]. We now consider dxt = vt dt dvt = −vt dt − ( Td×Rd ∇x W(xt , y)f(t, y, v)dvdy + ∇x U(xt ))dt + √ 2dBt , where (xt , vt ) ∈ Ω = Td × Rd presents an identical particle’s position and velocity, Td is a d dimensional torus representing a position domain, and Bt is a standard Brownian motion in Rd. Here f is the probability density of both spatial and velocity variables (x, v). ∂t f + v · ∇x f − ( Td×Rd ∇x W(x, y)f(t, y, v)dvdy + ∇x U(x)) · ∇v f = ∇v · (fv) + ∇v · (∇v f). 42
43. ### Example IV: Decomposition Similarly, we rewrite the above weakly self-consistent

Vlasov-Fokker-Planck equation in the following equivalent form: ∂t f = ∇x,v · (fγ) + ∇x,v · (faaT∇x,v δ δf E(f)), where E(f) = Ω f(x, v) log f(x, v)dxdv + Ω 1 2 v 2f(x, v)dxdv + 1 2 Ω×Ω W(x, y)f(x, v)f(y, v)dxdvdydv + Ω U(x)f(x, v)dxdv, with γ = J∇x,v [ δ δf E(f)], and J = 0 −Id Id 0 2d×2d . 43

46. ### Example IV We apply a similar method in this paper

and derive the Hessian matrix condition for the mean ﬁeld PDE. See details about choices of W and z with convergence rates λ in Bayraktar-Feng-Li, Exponential Entropy dissipation for weakly self-consistent Vlasov-Fokker-Planck equations, Journal of Nonlinear Science, 2024. 46
47. ### Example V: Time dependent stochastic diﬀerential equations Consider Ito type

stochastic diﬀerential equations (SDEs) in Rn+m as follows: dXt = b(t, Xt )dt + √ 2a(t, Xt )dBt . For m, n ∈ Z+ , we assume that a(t, x) ∈ C∞(R+ × Rn+m; R(n+m)×n) is a time dependent diﬀusion matrix, b(t, x) ∈ C∞(R+ × Rn+m; Rn+m) is a time dependent vector ﬁeld, and Bt is a standard Rn-valued Brownian motion. 47
48. ### Example V: Time dependent entropy dissipation Here π(t, x) is

a selected time dependent reference density function. We also deﬁne γ(t, x) := a(t, x)a(t, x)T ∇ log π(x) − b(x) + ∇ · a(t, x)a(t, x)T . Due to this general setting, we may not have ∇ · (π(t, x)γ(t, x)) = 0. 48

50. ### Example V We have shown some time dependent SDE analysis,

and conducted some related numerical simulations. Typical examples are: Time dependent overdamped dynamics (simulated annealing); Time dependent irreversible Langevin dynamics; Time dependent underdamped Langevin dynamics. See details in Feng-Zuo-Li, Fisher information dissipation for time inhomogeneous stochastic diﬀerential equations, arXiv:2402.01036, 2024. The construction of fast and stable stochastic algorithm is a more challenging future work! 50
51. ### TIGA for Stochastic Analysis & Algorithms Optimal choices of auxiliary

directions for capturing sharp convergence rates; Design fast and scalable AI type sampling algorithms with robust convergence guarantees; Flux-gradient ﬂow functional inequalities; Convergence analysis of generalized ﬂux-gradient ﬂows and Generalized time-dependent mean ﬁeld information Gamma calculus. 51