Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wuchen Li (University of South Carolina, Columb...

Jia-Jie Zhu
March 27, 2024
210

Wuchen Li (University of South Carolina, Columbia, USA) Information Gamma Calculus: Convexity Analysis for Stochastic Differential Equations

WORKSHOP ON OPTIMAL TRANSPORT
FROM THEORY TO APPLICATIONS
INTERFACING DYNAMICAL SYSTEMS, OPTIMIZATION, AND MACHINE LEARNING
Venue: Humboldt University of Berlin, Dorotheenstraße 24

Berlin, Germany. March 11th - 15th, 2024

Jia-Jie Zhu

March 27, 2024
Tweet

More Decks by Jia-Jie Zhu

Transcript

  1. Information Gamma calculus: Convexity analysis for stochastic differential equations Wuchen

    Li University of South Carolina, Columbia Humboldt University of Berlin, OT Berlin 2024 March 15, 2024. 1
  2. Sampling problems Main problem Given a function V : Ω

    → R, the problem is to sample from π(x) = 1 Z e−V (x), where x ∈ Ω is a discrete or continuous sampling (state) space, π is a probability density function, and Z is a normalization constant. Difficulties Fast, efficient, accurate algorithms; High dimensional sampling: Ω ⊂ RN , N >> 3. E.g., Image distributions Ω = R64∗64; Function V = − log π − log Z, is often unknown or with intractable formulations. For simplicity of talk, we assume V or π is analytical known. 5
  3. Stochastic differential equations Consider a stochastic differential equation (SDE) by

    ˙ Xt = b(Xt ) + √ 2a(Xt ) ˙ Bt , where (n, m) ∈ N, Xt ∈ Rn+m, b ∈ Rn+m is a drift vector function, a(Xt ) ∈ R(n+m)×n is a diffusion matrix function, and Bt ∈ Rn is a standard n dimensional Brownian motion. For a diffusion matrix a and drift vector field b, assume that it has an invariant measure of shape π, how can we analyze the convergence speed of stochastic dynamics towards its invariant distribution? This is crucial in designing AI type sampling algorithms. 6
  4. Example: Langevin dynamics We review a classical example. Consider an

    overdamped Langevin dynamics in Rn: ˙ Xt = −∇V (Xt ) + √ 2 ˙ Bt , where V is a given potential function. Let ρ(t, x) be the probability density function of Xt , which satisfies the following Fokker-Planck equation ∂t ρ(t, x) = ∇ · (ρ(t, x)∇V (x)) + ∆ρ(t, x). Here π(x) := 1 Z e−V (x), Z = e−V (y)dy < ∞, is the invariant distribution of the SDE. 8
  5. Lyapunov methods To study the dynamical behavior of ρ, we

    apply a global Lyapunov functional: DKL (ρt π) = ρt (x)log ρt (x) π(x) dx. Along the Fokker-Planck equation, the first order dissipation satisfies d dt DKL (ρt π) = − ∇x log ρt (x) π(x) 2ρt dx. And the second order dissipation satisfies d2 dt2 DKL (ρt π) = 2 ∇2 xx log ρt π 2 F −∇2 xx log π(∇x log ρt π , ∇x log ρt π ) ρt dx, where · F is a matrix Frobenius norm. In literature, DKL is named the Kullback–Leibler divergence (relative entropy, also free energy in statistical physics community) and I = − d dt DKL is called the relative Fisher information functional. 9
  6. Lyapunov constant Suppose there exists a “Lyapunov constant” λ >

    0, such that −∇2 xx log π(x) λI. Then d2 dt2 DKL (ρt π) ≥ −2λ d dt DKL (ρt π). By integrating on the time variable, one can prove the exponential convergence below: DKL (ρt π) ≤ e−2λtDKL (ρ0 π). 10
  7. Literature There are several approaches to establish the Lyapunov constant

    for gradient dynamics. Log-Sobolev inequality (Gross); Iterative Gamma calculus (Bakry, Emery, Baudoin, Garofalo et.al.); Entropy dissipation and Hypocoercivity (Arnold, Carlen, Carrilo, Villani, Mohout, Jungel, Markowich, Toscani, et.al.); Optimal transport, displacement convexity and Hessian operators in density space. (McCann, Ambrosio, Villani, Otto, Gangbo, Mielke, Maas, Liero, et.al.); Transport Lyapunov functional (Renesse, Strum et.al.). How can we obtain Polyak-Lojasiewicz (PL) conditions in Wasserstein type spaces? 11
  8. Convergence analysis Recall that ˙ Xt = b(Xt ) +

    √ 2a(Xt ) ˙ Bt , where b can be a non-gradient drift vector, and a is a degenerate matrix. Its Fokker-Planck equation satisfies ∂t ρ(t, x) = −∇ · (ρ(t, x)b(x)) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ(t, x) . Assume that there exists an invariant distribution π ∈ C∞(Rn+m) with an explicit formulation. How fast does ρ converge to the invariant distribution π? 12
  9. Goals In this talk, we mainly consider the entropy dissipation

    analysis for non-gradient and degenerate stochastic dynamical systems. Main difficulties. Degeneracy of diffusion matrix; Non-gradient drift vectors. Our analysis is based on the construction of auxiliary second order operators in generalized optimal transport spaces. 13
  10. Optimal transport distances The optimal transport has a variational formulation

    (Benamou-Brenier 2000): D(ρ0, ρ1)2 := inf v 1 0 EXt∼ρt v(t, Xt ) 2 dt, where E is the expectation operator and the infimum runs over all vector fields vt , such that ˙ Xt = v(t, Xt ), X0 ∼ ρ0, X1 ∼ ρ1. Under this metric, the probability set has a metric structure1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 14
  11. Optimal transport metrics Informally speaking, the optimal transport metric refers

    to the following bilinear form: ˙ ρ1 , G(ρ) ˙ ρ2 = ( ˙ ρ1 , (−∇ · (ρ∇))−1 ˙ ρ2 )dx. In other words, denote ˙ ρi = −∇ · (ρ∇φi ), i = 1, 2, then φ1 , G(ρ)−1φ2 = (φ1 , −∇ · (ρ∇)φ2 )dx = (∇φ1 , ∇φ2 )ρdx, where ρ ∈ P(Ω), ˙ ρi is the tangent vector in P(Ω), i.e. ˙ ρi dx = 0, and φi ∈ C∞(Ω) are cotangent vectors in P(Ω) at the point ρ. 15
  12. Optimal transport first and second order calculus Given a metric

    space (P, G), there is an optimal transport type calculus method, namely the first order operator (gradient) and the second order operator (Hessian) in the probability density space. They belong to transport information geometric analysis (TIGA). See [Li, TIG, 2018.] Shortly, we design a TIGA method to study non-gradient drift and degenerate diffusion coefficient type stochastic dynamical systems. 16
  13. Optimal transport gradient operators The Wasserstein gradient flow of an

    energy functional F(ρ) leads to ∂t ρ = − G(ρ)−1 δ δρ F(ρ) = ∇ · (ρ∇ δ δρ F(ρ)), where δ δρ is the L2 first variation operator. Example If F(ρ) = DKL (ρ π) = ρ(x)log ρ(x) π(x) dx, then the gradient flow satisfies the gradient-drift Fokker-Planck equation ∂ρ ∂t =∇ · (ρ∇log ρ π ) =∇ · (ρ∇ log ρ) − ∇ · (ρ∇ log π) =∆ρ + ∇ · (ρ∇V ). Here the trick is that ρ∇ log ρ = ∇ρ, ∇ log π = −∇V. 17
  14. First and second order optimal transport calculuses In this way,

    one can study the first order entropy dissipation by d dt DKL (ρt π) = log ρt π ∇ · (ρ∇log ρt π )dx = − ∇log ρt π 2ρdx. Similarly, we study the second order entropy dissipation by d2 dt2 DKL (ρt π) =2 HessW DKL (ρt π)(log ρt π , log ρt π )ρt dx =2 Γ2 (log ρt π , log ρt π )ρt dx, where HessW DKL (ρt |π) := Γ2 is a bilinear form, which can be defined by the optimal transport second order operator. 18
  15. Convergence analysis of non-gradient flows? Consider a general Fokker-Planck equation

    as ∂t ρt = −∇ · (ρt b) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρt . Suppose there exists a smooth invariant distribution π, such that −∇ · (πb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij π = 0. To study the convergence behavior of ρ towards π, we need to understand and estimate optimal transport type operators on both gradient and non-gradient directions. 19
  16. Decomposition: Gradient and Non-gradient (flux) One can rewrite the Fokker-Planck

    equation by ∂t ρ(t, x) = ∇ · (ρ(t, x)a(x)a(x)T∇ log ρ(t, x) π(x) ) + ∇ · (ρ(t, x)γ(x)), Gradient Non-gradient (Flux) where γ(x) :=a(x)a(x)T∇ log π(x) − b(x) + n+m j=1 ∂ ∂xj (a(x)a(x)T)ij 1≤i≤n+m , and ∇ · (π(x)γ(x)) = 0. This is an example of generalized entropy-entropy flux-metric condition. See examples in [Li-Liu-Osher, Controlling conservation laws I, II], and GENERIC or pre-GENERIC systems [Grmela and Ottinger]. 20
  17. Orthogonalizations Proposition d dt DKL (ρt π) = − (∇x

    log ρt π , a(x)a(x)T∇ log ρt π )ρt dx := −Ia (ρt ). In many situations, aaT-weighted Fisher information functional Ia (ρt ) is not enough to estimate the convergence of Fokker-Planck equation. 21
  18. Main result: Structure condition Assumption: for any i ∈ {1,

    · · · , n} and k ∈ {1, · · · , m}, we assume zT k ∇aT i ∈ Span{aT 1 , · · · , aT n }. Examples a is a constant vector; a is a matrix function defined by a = a(x1 , · · · , xn ), z ∈ span{en+1 , · · · , en+m }, where ei is the i-th Euclidean basis function. 22
  19. Main result: Entropy dissipation [Feng and Li, 2020-2022] Under the

    assumption, for any β ∈ R and a given vector function z, define matrix functions as R = Ra + Rz + Rπ − MΛ + βRIa + (1 − β)Rγa + Rγz . If there exists a constant λ > 0, such that R λ(aaT + zzT), then the following decay results hold: DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 23
  20. Comparisons (i) If γ = 0 and m = 0:

    [Bakry-Emery, 1985]. (ii) If γ = 0 and m = 0: [Baudoin-Garofalo, 2017], [F.-Li, 2019]. (iii) If β = 0 and m = 0, [Arnold-Carlen-Ju, 2000, 2008]. (iv) If a, z are constants and β = 0, [Arnold-Erb, 2014][Baudoin-Gordina-Herzog, 2019]. (v) If β = 1, m = 0 and a = I, [Arnold-Carlen]; [F.-Li, 2020]. 25
  21. Main idea of proof Step 1: We first compute the

    dissipation of the weighted Fisher information functional (a.k.a. gradient norms in generalized OT space). d dt Ia,z (ρt π) = −2 Γ2,a,z,γ (f, f)ρt dx, where f = log ρt π . Step 2: We next decompose the weak form of information Gamma calculus (a.k.a. Hessian matrices in generalized OT space). Γ2,a,z,γ (f, f)ρt dx = Hessβ f 2 F + R(∇f, ∇f) ρt dx. Step 3: From the information Bochner’s formula, we establish the convergence result. In other words, if R λ(aaT + zzT), then d dt Ia,z (ρt π) ≤ −2λIa,z (ρt π). From the Gronwall’s inequality, we can prove that the weighted Fisher information decays, so as the KL divergence decay. 26
  22. Details of proof Define Ia,z (ρ π) = Rn+m ∇

    log ρ π , (aaT + zzT)∇ log ρ π ρdx. Consider − 1 2 d dt Ia,z (ρt ) = Γ2 (f, f)ρt dxdx · · · (I) + Γz,π 2 (f, f)ρt dx · · · (II) + ΓIa,z (f, f)ρt dx · · · (III) where f = log ρt π , and Γ2 , Γz 2 , Γγ are designed bilinear forms, coming from the second order calculus in density space. (i) If a is non-degenerate, then (II) = 0; (ii) If b is a gradient vector field, then (III) = 0. 27
  23. Detailed approach For any f ∈ C∞(Rn+m), the generator of

    Itˆ o SDE satisfies Lf = Lf − γ, ∇f , where Lf = ∇ · (aaT∇f) + aaT∇ log π, ∇f . For a given matrix function a ∈ R(n+m)×n, we construct a matrix function z ∈ R(n+m)×m, and define a z-direction generator by Lz f = ∇ · (zzT∇f) + zzT∇ log π, ∇f . 28
  24. Global computation=Information gamma operators Define Gamma one bilinear forms by

    Γ1 (f, f) = aT∇f, aT∇f Rn , Γz 1 (f, f) = zT∇f, zT∇f Rm . Define Gamma two bilinear forms by (i) Gamma two operator: Γ2 (f, f) = 1 2 LΓ1 (f, f) − Γ1 (Lf, f). (ii) Generalized Gamma z operator: Γz,π 2 (f, f) = 1 2 LΓz 1 (f, f) − Γz 1 (Lf, f) + divπ z Γ1,∇(aaT) (f, f) − divπ a Γ1,∇(zzT) (f, f) . (iii) Irreversible Gamma operator: ΓIa,z (f, f) = (Lf + Lz f) ∇f, γ − 1 2 ∇ Γ1 (f, f) + Γz 1 (f, f) , γ . 29
  25. Local calculation=Information Bochner’s formula For any f = log ρt

    π ∈ C∞(Rn+m, R) and any β ∈ R, under the assumption, we derive − 1 2 d dt Ia,z (ρt π) = Γ2 (f, f) + Γz,π 2 (f, f) + ΓIa,z (f, f) ρt dx = Hessβ f 2 + R(∇f, ∇f) ρt dx ≥ R(∇f, ∇f)ρt dx. Clearly, if R λ(aaT + zzT), then − 1 2 d dt Ia,z (ρt π) ≥ λIa,z (ρt π). From Grownwall’s inequality, we prove the Lyapunov convergence result. 30
  26. Example I: Non-degenerate Langevin dynamics Consider a stochastic process: dXt

    = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n − γ(Xt ) dt + √ 2a(Xt )dBt , where Xt ∈ Rn, Bt is a standard n dimensional Brownian motion, V ∈ C2(Rn; R) is a potential function with ∇ · (e−V (x)γ(x)) = 0, and the diffusion matrix a is a smooth diagonal matrix with a(x) = diag(aii (xi ))1≤i≤n . Its invariant density satisfies π(x) = 1 Z e−V (x), Z = e−V (y)dy. 31
  27. Example I: Hessian matrix The Hessian matrix R has the

    following form: R(x) = Ra + βRIa + (1 − β)Rγa (x), where                    Ra,ii = a3 ii ∂xi aii ∂xi V + a4 ii ∂2 xixi V − a3 ii ∂2 xixi aii , Ra,ij = a2 ii a2 jj ∂2 xixj V, RIa,ii = γi [aii ∂xi aii − (aii )2∂xi V ], RIa,ij = 1 2 [γj (2aii ∂xi aii − (aii )2∂xi V ) + γi (2ajj ∂xj ajj − (ajj )2∂xj V )], Rγa,ii = γi aii ∂xi aii − ∂xi γi (aii )2, Rγa,ij = −1 2 [∂xi γj (ajj )2 + ∂xj γi (aii )2]. 32
  28. Example I: One dimensional case Let n = 1 and

    m = 0. R = a3a V + a4V − a3a + βγ(aa − a2V ) + (1 − β)(γaa − a2γ ), where and represent the first and second-order derivatives w.r.t. x. If a = 1 and γ = 0, then R = V (x); If a = 1 and γ = 0, β = 0, then R = V (x) − γ (x); If a = 1 and γ = 0, β = 1, then R = V (x) − γ(x)V (x); If a = 1 and γ = 0, β ∈ R, then R = V (x) − βγ(x)V (x) − (1 − β)γ (x). 33
  29. Example II: Underdamped Langevin dynamics Consider a underdamped Langevin dynamic

    by dxt =vt dt dvt =(−T(xt )vt − ∇x U(xt ))dt + 2T(xt )dBt . It can be viewed as, Yt = (xt , vt ), dYt = b(Yt )dt + √ 2a(Yt )dBt , with drift vectors and diffusion coefficient satisfying b = v −T(x)v − ∇U(x) , a = 0 T(x) . Its invariant measure has a closed form, π(x, v) = 1 Z e−H(x,v), H(x, v) = v 2 2 + U(x). 34
  30. Example II: Decomposition Write a vector field γ ∈ R2

    as γ(x, v) = 0 −rv − v −rv − ∇x U = J∇x,v H(x, v), where J = 0 −1 1 0 . Clearly, ∇x,v · π(x, v)γ(x, v) = 1 Z ∇x,v · e−H(x,v) J∇x,v H(x, v) = 0. 35
  31. II: Constant diffusion -1 -0.5 0 0.5 1 x 1

    -1 -0.5 0 0.5 1 x 2 -1 -0.5 0 0.5 1 1.5 Smallest eignvalue: 0.094 0.095 1 1 0.096 0.097 Smallest eignvalue: 0.098 0.099 0.5 0.5 0.1 x 2 x 1 0 0 -0.5 -0.5 -1 -1 Figure: T=1, U(x) = x2/2; Left β = 0 [Arnold-Erb]; Right β = 0.1; z = (1, 0.1)T. 37
  32. II: Variable diffusion -0.2 -0.15 1 1 -0.1 -0.05 Smallest

    eignvalue: 0.9 0.9 0 0.05 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 -0.04 -0.02 1 1 0 0.02 0.04 Smallest eignvalue: 0.06 0.9 0.9 0.08 0.1 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 Figure: U(x) = xc−x c(c−1) , T(x) = (∇2 x U(x))−1, c=2.5, z = 1 0.1 . (Left: β = 0; Right: β = 0.6.) 38
  33. Example III: SDEs on Lie groups There are some other

    convergence analysis for SDEs in Lie groups, including the Heisenberg group (quantum SDEs), the displacement group, and the Martinet flat sub-Riemannian structure. Consider dXt = − a(Xt )a(Xt )T∇V (Xt ) + n j=1 ∂ ∂Xj (a(Xt )a(Xt )T)ij 1≤i≤n dt + √ 2a(Xt )dBt , where a(x) ∈ C∞(Rn+m; R(n+m)×n) and π ∈ C∞(Rn+m; R). They are gradient flow examples with degenerate diffusion matrix coefficients. We still manage to derive suitable Hessian matrix conditions. See details in Feng-Li, Entropy dissipation for degenerate stochastic differential equations via sub-Riemannian density manifold, 88 pages, Entropy, 2023. 39
  34. Example III.1: Heisenberg group If R λ(aaT + zzT), then

    DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 40
  35. Example III.2: Displacement group If R λ(aaT + zzT), then

    DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 41
  36. Example IV: Mean field underdamped Langevin dynamics There are some

    other Hessian matrix examples for mean field SDEs, whose Kolmogorov forward equation satisfies the weakly self-consistent Vlasov-Fokker-Planck equation. See analysis of overdamped case in [Carrillo-McCann-Villani, 2003]. We now consider dxt = vt dt dvt = −vt dt − ( Td×Rd ∇x W(xt , y)f(t, y, v)dvdy + ∇x U(xt ))dt + √ 2dBt , where (xt , vt ) ∈ Ω = Td × Rd presents an identical particle’s position and velocity, Td is a d dimensional torus representing a position domain, and Bt is a standard Brownian motion in Rd. Here f is the probability density of both spatial and velocity variables (x, v). ∂t f + v · ∇x f − ( Td×Rd ∇x W(x, y)f(t, y, v)dvdy + ∇x U(x)) · ∇v f = ∇v · (fv) + ∇v · (∇v f). 42
  37. Example IV: Decomposition Similarly, we rewrite the above weakly self-consistent

    Vlasov-Fokker-Planck equation in the following equivalent form: ∂t f = ∇x,v · (fγ) + ∇x,v · (faaT∇x,v δ δf E(f)), where E(f) = Ω f(x, v) log f(x, v)dxdv + Ω 1 2 v 2f(x, v)dxdv + 1 2 Ω×Ω W(x, y)f(x, v)f(y, v)dxdvdydv + Ω U(x)f(x, v)dxdv, with γ = J∇x,v [ δ δf E(f)], and J = 0 −Id Id 0 2d×2d . 43
  38. Example IV We apply a similar method in this paper

    and derive the Hessian matrix condition for the mean field PDE. See details about choices of W and z with convergence rates λ in Bayraktar-Feng-Li, Exponential Entropy dissipation for weakly self-consistent Vlasov-Fokker-Planck equations, Journal of Nonlinear Science, 2024. 46
  39. Example V: Time dependent stochastic differential equations Consider Ito type

    stochastic differential equations (SDEs) in Rn+m as follows: dXt = b(t, Xt )dt + √ 2a(t, Xt )dBt . For m, n ∈ Z+ , we assume that a(t, x) ∈ C∞(R+ × Rn+m; R(n+m)×n) is a time dependent diffusion matrix, b(t, x) ∈ C∞(R+ × Rn+m; Rn+m) is a time dependent vector field, and Bt is a standard Rn-valued Brownian motion. 47
  40. Example V: Time dependent entropy dissipation Here π(t, x) is

    a selected time dependent reference density function. We also define γ(t, x) := a(t, x)a(t, x)T ∇ log π(x) − b(x) + ∇ · a(t, x)a(t, x)T . Due to this general setting, we may not have ∇ · (π(t, x)γ(t, x)) = 0. 48
  41. Example V We have shown some time dependent SDE analysis,

    and conducted some related numerical simulations. Typical examples are: Time dependent overdamped dynamics (simulated annealing); Time dependent irreversible Langevin dynamics; Time dependent underdamped Langevin dynamics. See details in Feng-Zuo-Li, Fisher information dissipation for time inhomogeneous stochastic differential equations, arXiv:2402.01036, 2024. The construction of fast and stable stochastic algorithm is a more challenging future work! 50
  42. TIGA for Stochastic Analysis & Algorithms Optimal choices of auxiliary

    directions for capturing sharp convergence rates; Design fast and scalable AI type sampling algorithms with robust convergence guarantees; Flux-gradient flow functional inequalities; Convergence analysis of generalized flux-gradient flows and Generalized time-dependent mean field information Gamma calculus. 51