Information Gamma Calculus

Slide 1

Slide 1 text

Information Gamma calculus: Convexity analysis for stochastic diﬀerential equations Wuchen Li University of South Carolina July 31. This is based on a joint work with Qi Feng (FSU). Supported by AFOSR YIP award. 1

Slide 2

Slide 2 text

AI sampling problems 2

Slide 3

Slide 3 text

Sampling images from distributions Figure: Results from Wasserstein Generative adversary networks. They are samples from image distributions. 3

Slide 4

Slide 4 text

Reversible diﬀusions 4

Slide 5

Slide 5 text

Sampling problems Main problem Given a function V : Ω → R, the problem is to sample from π(x) = 1 Z e−V (x), where x ∈ Ω is a discrete or continuous sampling (state) space, π is a probability density function, and Z is a normalization constant. Diﬃculties Fast, eﬃcient, accurate algorithms; High dimensional sampling: Ω ⊂ RN , N >> 3. E.g., Image distributions Ω = R64∗64; Function V = − log π − log Z, is often unknown or with intractable formulations. For simplicity of talk, we assume V or π is analytical known. 5

Slide 6

Slide 6 text

Stochastic differential equations Consider a stochastic differential equation (SDE) by ˙ Xt = b(Xt ) + √ 2a(Xt ) ˙ Bt , where (n, m) ∈ N, Xt ∈ Rn+m, b ∈ Rn+m is a drift vector function, a(Xt ) ∈ R(n+m)×n is a diffusion matrix function, and Bt ∈ Rn is a standard n dimensional Brownian motion. For a diffusion matrix a and drift vector field b, assume that it has an invariant measure of shape π, how can we analyze the convergence speed of stochastic dynamics towards its invariant distribution? This is crucial in designing AI type sampling algorithms. 6

Slide 7

Slide 7 text

Transport, Entropy and Information, Geometry 7

Slide 8

Slide 8 text

Example: Langevin dynamics We review a classical example. Consider a overdamped Langevin dynamics in Rn: ˙ Xt = −∇V (Xt ) + √ 2 ˙ Bt , where V is a given potential function. Let ρ(t, x) be the probability density function of Xt , which satisﬁes the following Fokker-Planck equation ∂t ρ(t, x) = ∇ · (ρ(t, x)∇V (x)) + ∆ρ(t, x). Here π(x) := 1 Z e−V (x), where Z = e−V (y)dy < ∞, is the invariant distribution of the SDE. 8

Slide 9

Slide 9 text

Lyapunov methods To study the dynamical behavior of ρ, we apply a global Lyapunov functional: DKL (ρt π) = ρt (x)log ρt (x) π(x) dx. Along the Fokker-Planck equation, the first order dissipation satisfies d dt DKL (ρt π) = − ∇x log ρt (x) π(x) 2ρt dx. And the second order dissipation satisfies d2 dt2 DKL (ρt π) = 2 ∇2 xx log ρt π 2 F −∇2 xx log π(∇x log ρt π , ∇x log ρt π ) ρt dx, where · F is a matrix Frobenius norm. In literature, DKL is named the Kullback–Leibler divergence (relative entropy, also free energy in statistical physics community) and I = − d dt DKL is called the relative Fisher information functional. 9

Slide 10

Slide 10 text

Lyapunov constant Suppose there exists a “Lyapunov constant” λ > 0, such that −∇2 xx log π(x) λI. Then d2 dt2 DKL (ρt π) ≥ −2λ d dt DKL (ρt π). By integrating on the time variable, one can prove the exponential convergence by DKL (ρt π) ≤ e−2λtDKL (ρ0 π). 10

Slide 11

Slide 11 text

Literature There are several equivalent approaches to establish the Lyapunov constant for gradient dynamics. Log-Sobolev inequality (Gross); Iterative Gamma calculus (Bakry, Emery, Baudoin, Garofalo et.al.); Entropy dissipation and Hypocoercivity (Arnold, Carlen, Carrilo, Villani, Mohout, Jungel, Markowich, Toscani, et.al.); Optimal transport, displacement convexity and Hessian operators in density space. (McCann, Ambrosio, Villani, Otto, Gangbo et.al.); Transport Lyapunov functional (Renesse, Strum et.al.). 11

Slide 12

Slide 12 text

Convergence analysis Recall that ˙ Xt = b(Xt ) + √ 2a(Xt ) ˙ Bt , where b can be a non-gradient drift vector, and a is a degenerate matrix. Its Fokker-Planck equation satisﬁes ∂t ρ(t, x) = −∇ · (ρ(t, x)b(x)) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ(t, x) . Assume that there exists an invariant distribution π ∈ C∞(Rn+m) with an explicit formulation. How fast does ρ converge to the invariant distribution π? 12

Slide 13

Slide 13 text

Goals In this talk, we mainly consider the entropy dissipation analysis for non-gradient and degenerate stochastic dynamical systems. Main diﬃculties. Degeneracy of diﬀusion matrix; Non-gradient drift vectors. Our analysis is based on the construction of auxiliary second order operators in generalized optimal transport spaces. 13

Slide 14

Slide 14 text

Optimal transport distances The optimal transport has a variational formulation (Benamou-Brenier 2000): D(ρ0, ρ1)2 := inf v 1 0 EXt∼ρt v(t, Xt ) 2 dt, where E is the expectation operator and the infimum runs over all vector fields vt , such that ˙ Xt = v(t, Xt ), X0 ∼ ρ0, X1 ∼ ρ1. Under this metric, the probability set has a metric structure1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 14

Slide 15

Slide 15 text

Optimal transport metrics Informally speaking, the optimal transport metric refers to the following bilinear form: ˙ ρ1 , G(ρ) ˙ ρ2 = ( ˙ ρ1 , (−∇ · (ρ∇))−1 ˙ ρ2 )dx. In other words, denote ˙ ρi = −∇ · (ρ∇φi ), i = 1, 2, then φ1 , G(ρ)−1φ2 = (φ1 , −∇ · (ρ∇)φ2 )dx = (∇φ1 , ∇φ2 )ρdx, where ρ ∈ P(Ω), ˙ ρi is the tangent vector in P(Ω), i.e. ˙ ρi dx = 0, and φi ∈ C∞(Ω) are cotangent vectors in P(Ω) at the point ρ. 15

Slide 16

Slide 16 text

Optimal transport first and second order calculus Given a metric space (P, G), there is an optimal transport type calculus method, namely the first order operator (gradient) and the second order operator (Hessian) in the probability density space. They belong to transport information geometric analysis (TIGA). See [Li, TIG, 2018.] Shortly, we design a TIGA method to study non-gradient drift and degenerate diffusion coefficient type stochastic dynamical systems. 16

Slide 17

Slide 17 text

Optimal transport gradient operators The Wasserstein gradient flow of an energy functional F(ρ) leads to ∂t ρ = − G(ρ)−1 δ δρ F(ρ) = ∇ · (ρ∇ δ δρ F(ρ)), where δ δρ is the L2 first variation operator. Example If F(ρ) = DKL (ρ π) = ρ(x)log ρ(x) π(x) dx, then the gradient flow satisfies the gradient-drift Fokker-Planck equation ∂ρ ∂t =∇ · (ρ∇log ρ π ) =∇ · (ρ∇ log ρ) − ∇ · (ρ∇ log π) =∆ρ + ∇ · (ρ∇V ). Here the trick is that ρ∇ log ρ = ∇ρ, ∇ log π = −∇V. 17

Slide 18

Slide 18 text

First and second order calculus In this way, one can study the ﬁrst order entropy dissipation by d dt DKL (ρt π) = log ρt π ∇ · (ρ∇log ρt π )dx = − ∇log ρt π 2ρdx. Similarly, we study the second order entropy dissipation by d2 dt2 DKL (ρt π) = − 2 Ω HessW DKL (ρt π)(log ρt π , log ρt π )ρt dx = − 2 Ω Γ2 (log ρt π , log ρt π )ρt dx, where HessW DKL (ρt |π) := Γ2 is a bilinear form, which can be deﬁned by the optimal transport second order operator. 18

Slide 19

Slide 19 text

Convergence analysis of non-gradient ﬂows? Consider a general Fokker-Planck equation as ∂t ρ = −∇ · (ρb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ . Suppose there exists a smooth invariant distribution π, such that −∇ · (πb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij π = 0. To study the convergence behavior of ρ towards π, we need to understand and estimate optimal transport type operators on both gradient and non-gradient directions. 19

Slide 20

Slide 20 text

Decomposition: Gradient and Non-gradient (ﬂux) One can rewrite the Fokker-Planck equation by ∂t ρ(t, x) = ∇ · (ρ(t, x)a(x)a(x)T∇ log ρ(t, x) π(x) ) + ∇ · (ρ(t, x)γ(x)), Gradient Non-gradient (Flux) where γ(x) :=a(x)a(x)T∇ log π(x) − b(x) + n+m j=1 ∂ ∂xj (a(x)a(x)T)ij 1≤i≤n+m , and ∇ · (π(x)γ(x)) = 0. This is an example of generalized entropy-entropy ﬂux-metric condition. See examples in [Li-Liu-Osher, Controlling conservation laws I, II], and GENERIC or pre-GENERIC systems [Grmela and Ottinger]. 20

Slide 21

Slide 21 text

Main result: Structure condition Assumption: for any i ∈ {1, · · · , n} and k ∈ {1, · · · , m}, we assume zT k ∇aT i ∈ Span{aT 1 , · · · , aT n }. Examples a is a constant vector; a is a matrix function deﬁned by a = a(x1 , · · · , xn ), z ∈ span{en+1 , · · · , en+m }, where ei is the i-th Euclidean basis function. 21

Slide 22

Slide 22 text

Main result: Entropy dissipation [F. and Li, 2020-2022] Under the assumption, for any β ∈ R and a given vector function z, deﬁne matrix functions as R = Ra + Rz + Rπ − MΛ + βRIa + (1 − β)Rγa + Rγz . If there exists a constant λ > 0, such that R λ(aaT + zzT), then the following decay results hold: DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 22

Slide 23

Slide 23 text

Tensors 23

Slide 24

Slide 24 text

Comparisons (i) If γ = 0 and m = 0: [Bakry-Emery, 1985]. (ii) If γ = 0 and m = 0: [Baudoin-Garofalo, 2017], [F.-Li, 2019]. (iii) If β = 0 and m = 0, [Arnold-Carlen-Ju, 2000, 2008]. (iv) If a, z are constants and β = 0, [Arnold-Erb, 2014][Baudoin-Gordina-Herzog, 2019]. (v) If β = 1, m = 0 and a = I, [Arnold-Carlen]; [F.-Li, 2020]. 24

Slide 25

Slide 25 text

Main idea of proof Step 1: We ﬁrst compute the dissipation of the weighted Fisher information functional (a.k.a. gradient norms in generalized OT space). d dt Ia,z (ρt π) = −2 Γ2,a,z,γ (f, f)ρt dx, where f = log ρt π . Step 2: We next decompose the weak form of information Gamma calculus (a.k.a. Hessian matrices in generalized OT space). Γ2,a,z,γ (f, f)ρt dx = Hessβ f 2 F + R(∇f, ∇f) ρt dx. Step 3: From the information Bochner’s formula, we establish the convergence result. In other words, if R λ(aaT + zzT), then d dt Ia,z (ρt π) ≤ −2λIa,z (ρt π). From Gronwall’s inequality, we can prove that the weighted Fisher information decays, so as the KL divergence decay. 25

Slide 26

Slide 26 text

Details of proof Deﬁne Ia,z (ρ π) = Rn+m ∇ log ρ π , (aaT + zzT)∇ log ρ π ρdx. Consider − 1 2 d dt Ia,z (ρt ) = Γ2 (f, f)ρt dxdx · · · (I) + Γz,π 2 (f, f)ρt dx · · · (II) + ΓIa,z (f, f)ρt dx · · · (III) where f = log ρ π , and Γ2 , Γz 2 , Γγ are designed bilinear forms, coming from the second order calculus in density space. (i) If a is non-degenerate, then (II) = 0; (ii) If b is a gradient vector ﬁeld, then (III) = 0. 26

Slide 27

Slide 27 text

Detailed approach For any f ∈ C∞(Rn+m), the generator of Itˆ o SDE satisﬁes Lf = Lf − γ, ∇f , where Lf = ∇ · (aaT∇f) + aaT∇ log π, ∇f . For a given matrix function a ∈ R(n+m)×n, we construct a matrix function z ∈ R(n+m)×m, and deﬁne a z-direction generator by Lz f = ∇ · (zzT∇f) + zzT∇ log π, ∇f . 27

Slide 28

Slide 28 text

Global computation=Information gamma operators Deﬁne Gamma one bilinear forms by Γ1 (f, f) = aT∇f, aT∇f Rn , Γz 1 (f, f) = zT∇f, zT∇f Rm . Deﬁne Gamma two bilinear forms by (i) Gamma two operator: Γ2 (f, f) = 1 2 LΓ1 (f, f) − Γ1 (Lf, f). (ii) Generalized Gamma z operator: Γz,π 2 (f, f) = 1 2 LΓz 1 (f, f) − Γz 1 (Lf, f) + divπ z Γ1,∇(aaT) (f, f) − divπ a Γ1,∇(zzT) (f, f) . (iii) Irreversible Gamma operator: ΓIa,z (f, f) = (Lf + Lz f) ∇f, γ − 1 2 ∇ Γ1 (f, f) + Γz 1 (f, f) , γ . 28

Slide 29

Slide 29 text

Local calculation=Information Bochner’s formula For any f = log ρt π ∈ C∞(Rn+m, R) and any β ∈ R, under the assumption, we derive − 1 2 d dt Ia,z (ρt π) = Γ2 (f, f) + Γz,π 2 (f, f) + ΓIa,z (f, f) ρt dx = Hessβ f 2 + R(∇f, ∇f) ρt dx ≥ R(∇f, ∇f)ρt dx. Clearly, if R λ(aaT + zzT), then − 1 2 d dt Ia,z (ρt π) ≥ λIa,z (ρt π). From Grownwall’s inequality, we prove the Lyapunov convergence result. 29

Slide 30

Slide 30 text

Example I: Non-degenerate Langevin dynamics Consider a stochastic process: dXt = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n − γ(Xt ) dt + √ 2a(Xt )dBt , where Xt ∈ Rn, Bt is a standard n dimensional Brownian motion, V ∈ C2(Rn; R) is a potential function with ∇ · (e−V (x)γ(x)) = 0, and the diﬀusion matrix a is a smooth diagonal matrix with a(x) = diag(aii (xi ))1≤i≤n . Its invariant density satisﬁes π(x) = 1 Z e−V (x), Z = e−V (y)dy. 30

Slide 31

Slide 31 text

Example I: Hessian matrix The Hessian matrix R has the following form: R(x) = Ra + βRIa + (1 − β)Rγa (x), where                    Ra,ii = a3 ii ∂xi aii ∂xi V + a4 ii ∂2 xixi V − a3 ii ∂2 xixi aii , Ra,ij = a2 ii a2 jj ∂2 xixj V, RIa,ii = γi [aii ∂xi aii − (aii )2∂xi V ], RIa,ij = 1 2 [γj (2aii ∂xi aii − (aii )2∂xi V ) + γi (2ajj ∂xj ajj − (ajj )2∂xj V )], Rγa,ii = γi aii ∂xi aii − ∂xi γi (aii )2, Rγa,ij = −1 2 [∂xi γj (ajj )2 + ∂xj γi (aii )2]. 31

Slide 32

Slide 32 text

Example I: One dimensional case Let n = 1 and m = 0. R = a3a V + a4V − a3a + βγ(aa − a2V ) + (1 − β)(γaa − a2γ ), where and represent the ﬁrst and second-order derivatives w.r.t. x. If a = 1 and γ = 0, then R = V (x); If a = 1 and γ = 0, β = 0, then R = V (x) − γ (x); If a = 1 and γ = 0, β = 1, then R = V (x) − γ(x)V (x); If a = 1 and γ = 0, β ∈ R, then R = V (x) − βγ(x)V (x) − (1 − β)γ (x). 32

Slide 33

Slide 33 text

Example II: Underdamped Langevin dynamics Consider a underdamped Langevin dynamic by dxt =vt dt dvt =(−T(xt )vt − ∇x U(xt ))dt + 2T(xt )dBt . It can be viewed as, Yt = (xt , vt ), dYt = b(Yt )dt + √ 2a(Yt )dBt , with drift vectors and diﬀusion coeﬃcient satisfying b = v −T(x)v − ∇U(x) , a = 0 T(x) . Its invariant measure has a closed form, π(x, v) = 1 Z e−H(x,v), H(x, v) = v 2 2 + U(x). 33

Slide 34

Slide 34 text

Example II: Decomposition Write a vector ﬁeld γ ∈ R2 as γ(x, v) = 0 −rv − v −rv − ∇x U = J∇x,v H(x, v), where J = 0 −1 1 0 . Clearly, ∇x,v · π(x, v)γ(x, v) = 1 Z ∇x,v · e−H(x,v) J∇x,v H(x, v) = 0. 34

Slide 35

Slide 35 text

Example II: Hessian matrix 35

Slide 36

Slide 36 text

II: Constant diﬀusion -1 -0.5 0 0.5 1 x 1 -1 -0.5 0 0.5 1 x 2 -1 -0.5 0 0.5 1 1.5 Smallest eignvalue: 0.094 0.095 1 1 0.096 0.097 Smallest eignvalue: 0.098 0.099 0.5 0.5 0.1 x 2 x 1 0 0 -0.5 -0.5 -1 -1 Figure: T=1, U(x) = x2/2; Left β = 0 [Arnold-Erb]; Right β = 0.1; z = (1, 0.1)T. 36

Slide 37

Slide 37 text

II: Variable diﬀusion -0.2 -0.15 1 1 -0.1 -0.05 Smallest eignvalue: 0.9 0.9 0 0.05 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 -0.04 -0.02 1 1 0 0.02 0.04 Smallest eignvalue: 0.06 0.9 0.9 0.08 0.1 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 Figure: U(x) = xc−x c(c−1) , T(x) = (∇2 x U(x))−1, c=2.5, z = 1 0.1 . (Left: β = 0; Right: β = 0.6.) 37

Slide 38

Slide 38 text

Example III: SDEs on Lie groups There are some other convergence analysis for SDEs in Lie groups, including the Heisenberg group (quantum SDEs), the displacement group, and the Martinet flat sub-Riemannian structure. Consider dXt = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n dt + √ 2a(Xt )dBt , where a(x) ∈ C∞(Rn+m; R(n+m)×n) and π ∈ C∞(Rn+m; R). They are gradient flow examples with degenerate diffusion matrix coefficients. We still manage to derive suitable Hessian matrix conditions. See details in F-L, Entropy dissipation for degenerate stochastic differential equations via sub-Riemannian density manifold, 88 pages, Entropy, 2023. 38

Slide 39

Slide 39 text

Example III.1: Heisenberg group If R λ(aaT + zzT), then DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 39

Slide 40

Slide 40 text

Example III.2: Displacement group If R λ(aaT + zzT), then DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 40

Slide 41

Slide 41 text

Example IV: Mean field underdamped Langevin dynamics There are some other Hessian matrix examples for mean field SDEs, whose Kolmogorov forward equation satisfies the weakly self-consistent Vlasov-Fokker-Planck equation. See analysis of overdamped case in [Carrillo-McCann-Villani, 2003]. Consider dxt = vt dt dvt = −vt dt − ( Td×Rd ∇x W(xt , y)f(t, y, v)dvdy + ∇x U(xt ))dt + √ 2dBt , where (xt , vt ) ∈ Ω = Td × Rd presents an identical particle’s position and velocity, Td is a d dimensional torus representing a position domain, and Bt is a standard Brownian motion in Rd. Here f is the probability density of both spatial and velocity variables (x, v). ∂t f + v · ∇x f − ( Td×Rd ∇x W(x, y)f(t, y, v)dvdy + ∇x U(x)) · ∇v f = ∇v · (fv) + ∇v · (∇v f). 41

Slide 42

Slide 42 text

Example IV: Decomposition Similarly, we rewrite the above weakly self-consistent Vlasov-Fokker-Planck equation in the following equivalent form: ∂t f = ∇x,v · (fγ) + ∇x,v · (faaT∇x,v δ δf E(f)), where E(f) = Ω f(x, v) log f(x, v)dxdv + Ω 1 2 v 2f(x, v)dxdv + 1 2 Ω×Ω W(x, y)f(x, v)f(y, v)dxdvdydv + Ω U(x)f(x, v)dxdv, with γ = J∇x,v [ δ δf E(f)], and J = 0 −Id Id 0 2d×2d . 42

Slide 43

Slide 43 text

Example IV: Mean ﬁeld Hessian matrix 43

Slide 44

Slide 44 text

Example IV: Free energy dissipation 44

Slide 45

Slide 45 text

Example IV We apply similar method in this paper and derive the Hessian matrix condition for the mean ﬁeld PDE. See details about choices of W and z with convergence rate λ in B-F-Li, Exponential Entropy dissipation for weakly self-consistent Vlasov-Fokker-Planck equations, arXiv:2204.12049, 2022. 45

Slide 46

Slide 46 text

TIGA for Stochastic Analysis & Algorithms Optimal choices of auxiliary directions for capturing sharp convergence rates; Design fast and scalable AI type sampling algorithms with robust convergence guarantees; Flux-gradient flow functional inequalities; Convergence analysis of generalized flux-gradient flows and Generalized mean field information Gamma calculus. 46