Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Entropy dissipation via information Gamma calculus

Wuchen Li
April 06, 2021

Entropy dissipation via information Gamma calculus

In this talk, we present the convergence behavior for some non-gradient degenerate stochastic differential equations towards their invariant distributions. Our method extends the connection between Gamma calculus and Hessian operators in the Wasserstein space. In detail, we apply Lyapunov methods in the space of probabilities, where the Lyapunov functional is chosen as the relative Fisher information. We derive the Fisher information induced Gamma calculus to handle non-gradient drift vector fields and degenerate diffusion matrix. Several examples are provided for non-reversible Langevin dynamics, sub-Riemannian diffusion process, and variable-dependent underdamped Langevin dynamics.

Wuchen Li

April 06, 2021
Tweet

More Decks by Wuchen Li

Other Decks in Education

Transcript

  1. Entropy dissipation via information Gamma calculus Wuchen Li University of

    South Carolina Analysis Seminar, CMUC, April 9th. This is based on a joint work with Qi Feng (USC). 1
  2. Stochastic differential equations Consider a stochastic differential equation (SDE) by

    ˙ Xt = b(Xt ) + √ 2a(Xt ) ˙ Bt , where (n, m) ∈ N, Xt ∈ Rn+m, b ∈ Rn+m is a drift vector function, a(Xt ) ∈ R(n+m)×n is a diffusion matrix function, and Bt ∈ Rn is a standard Brownian motion. The above SDE has been widely used in practice. Mathematical physics equations; Protein folding. Designing Markov-Chain-Monte-Caro algorithms; 3
  3. Example: Langevin dynamics We review a classical example. We start

    with a gradient drift-diffusion process by ˙ Xt = −∇V (Xt ) + √ 2 ˙ Bt , where V is a given potential function. Let ρ(t, x) be the probability density function of Xt , which satisfies the Fokker-Planck equation by ∂t ρ(t, x) = ∇ · (ρ(t, x)∇V (x)) + ∆ρ(t, x). Here π(x) := 1 Z e−V (x), where Z = e−V dx < ∞, is the invariant distribution of the SDE. Here the main question is that How fast does ρ(t, x) converge to the invariant distribution π? 4
  4. Lyapunov methods To study the dynamical behavior of ρ, we

    apply a Lyapunov functional by DKL (ρt π) = ρt (x)log ρt (x) π(x) dx. Along the Fokker-Planck equation, the first order dissipation satisfies d dt DKL (ρt π) = − ∇x log ρt (x) π(x) 2ρt dx, And the second order dissipation satisfies d2 dt2 DKL (ρt π) = 2 ∇2 xx log ρt π 2 F −∇2 xx log π(∇x log ρt π , ∇x log ρt π ) ρt dx, where · F is a matrix Frobenius norm. In literature, DKL is named the Kullback–Leibler divergence (relative entropy) and I = − d dt DKL is called the relative Fisher information functional. 5
  5. Lyapunov constant Suppose there exists a “Lyapunov constant” λ >

    0, such that −∇2 xx log π(x) λI. Then d2 dt2 DKL (ρt π) ≥ −2λ d dt DKL (ρt π). By integrating on the time variable, one can prove the exponential convergence by DKL (ρt π) ≤ e−2λtDKL (ρ0 π). As a by-product, one can show the log-Sobolev inequality by DKL (ρ π) ≤ 1 2λ I(ρ π). 6
  6. Literature There are several equivalent approaches to establish the Lyapunov

    constant for gradient dynamics. Log-Sobolev inequality (Gross); Iterative Gamma calculus (Bakry, Emery, Baudoin, Garofalo et.al.); Entropy dissipation (Arnold, Carlen, Carrilo, Mohout, Jungel, Markowich, Toscani et.al.); Optimal transport, displacement convexity and Hessian operators in density space. (Mccann, Ambrosio, Villani, Otto, Gangbo et.al.); Transport Lyapunov functional (Renesse, Strum et.al.). 7
  7. Problem Recall that ˙ Xt = b(Xt ) + √

    2a(Xt ) ˙ Bt , where b can be a non-gradient drift vector, and a is a degenerate matrix. And its Fokker-Planck equation (Hypoelliptic) satisfies ∂t ρ = −∇ · (ρb) + n+m i=1 n+m j=1 ∂2 ∂xi ∂xj (a(x)a(x)T)ij ρ . Assume that there exists an invariant distribution π with a given explicit formulation. The major problem is given below. How fast does ρ converge to the invariant distribution π? 8
  8. Goals In this talk, we mainly consider the entropy dissipation

    for perturbed-gradient dynamical systems. Main difficulties. Degeneracy of diffusion matrix; Non-gradient drift vectors. Our method is based on the extended second order calculus in generalized optimal transport space. 9
  9. Review: Optimal transport space The optimal transport has a variational

    formulation (Benamou-Brenier 2000): D(ρ0, ρ1)2 := inf v 1 0 EXt∼ρt v(t, Xt ) 2 dt, where E is the expectation operator and the infimum runs over all vector fields vt , such that ˙ Xt = v(t, Xt ), X0 ∼ ρ0, X1 ∼ ρ1. Under this metric, the probability set has a metric structure1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 10
  10. Review: Optimal transport metric Informally speaking, the optimal transport metric

    refers to the following bilinear form: ˙ ρ1 , G(ρ) ˙ ρ2 = ( ˙ ρ1 , (−∆ρ )−1 ˙ ρ2 )dx. In other words, denote ˙ ρi = −∇ · (ρ∇φi ), i = 1, 2, then φ1 , G(ρ)−1φ2 = (φ1 , −∇ · (ρ∇)φ2 )dx = (∇φ1 , ∇φ2 )ρdx, where ρ ∈ P(Ω), ˙ ρi is the tangent vector in P(Ω), i.e. ˙ ρi dx = 0, and φi ∈ C∞(Ω) are cotangent vectors in P(Ω) at the point ρ. 11
  11. Review: Optimal transport gradient flows The Wasserstein gradient flow of

    an energy functional F(ρ) leads to ∂t ρ = − G(ρ)−1 δ δρ F(ρ) =∇ · (ρ∇ δ δρ F(ρ)). Example If F(ρ) = F(x)ρ(x)dx, then the gradient flow satisfies ∂t ρ = ∇ · (ρ∇F(x)). 12
  12. Entropy dissipation revisited The gradient flow of the KL divergence

    DKL (ρ π) = ρ(x)log ρ(x) π(x) dx, w.r.t. optimal transport metric distance satisfies the Fokker-Planck equation ∂ρ ∂t = ∇ · (ρ∇log ρ π ). Here the major trick is that ρ∇ log ρ = ∇ρ. 13
  13. Entropy dissipation revisited In this way, one can study the

    first order entropy dissipation by d dt DKL (ρt π) = log ρt π ∇ · (ρ∇log ρt π )dx = − ∇log ρt π 2ρdx = − I(ρt π). Similarly, we study the second order entropy dissipation by d dt I(ρt π) = −2 Ω Γ2 (log ρt π , log ρt π )ρt dx, where Γ2 is a bilinear form, which can be defined by the optimal transport second order operator. 14
  14. Lyapunov methods for degenerate non-gradient flows Consider a perturbed gradient

    flow by ˙ ρ = −G(ρ)−1DKL (ρt π) + f( ρt π ), where f is a given function generated by non-gradient drift vector field. How can we study the convergence behavior of ρ? 15
  15. Motivation: Decomposition Assume π is the invariant measure, which is

    with the explicit formulation. We decompose the Fokker-Planck equation by ∂t ρ(t, x) = ∇ · (ρ(t, x)a(x)a(x)T∇ log ρ(t, x) π(x) ) + ∇ · (ρ(t, x)γ(x)), Gradient direction Perturbed direction where γ(x) :=a(x)a(x)T∇ log π(x) − b(x) + n+m j=1 ∂ ∂xj (a(x)a(x)T)ij 1≤i≤n+m . and ∇ · (π(x)γ(x)) = 0. 16
  16. Main result: Structure condition Assumption: for any i ∈ {1,

    · · · , n} and k ∈ {1, · · · , m}, we assume zT k ∇aT i ∈ Span{aT 1 , · · · , aT n }. Examples a is a constant vector; a is a matrix function defined by a = a(x1 , · · · , xn ), z ∈ span{en+1 , · · · , en+m }, where ei is the i-th Euclidean basis function. 17
  17. Main result: Entropy dissipation [F. and Li, 2021] Under the

    assumption, for any β ∈ R and a given vector function z, define matrix functions by R = Ra + Rz + Rπ − MΛ + βRIa + (1 − β)Rγa + Rγz , If there exists a constant λ > 0, such that R λ(aaT + zzT), then the following decay results. DKL (ρt π) ≤ 1 2λ e−2λtIa,z (ρ0 π), where ρt is the solution of Fokker-Planck equation. 18
  18. Comparisons (i) If γ = 0 and m = 0:

    [Bakry-Emery, 1985]. (ii) If γ = 0 and m = 0: [Baudoin-Garofalo, 2017], [F.-Li, 2019]. (iii) If β = 0 and m = 0, [Arnold-Carlen-Ju, 2000, 2008]. (iv) If a, z are constants and β = 0, [Arnold-Erb, 2014][Baudoin-Gordina-Herzog, 2019]. (v) If β = 1, m = 0 and a = I, [Arnold-Carlen]; [F.-Li, 2020]. 20
  19. Idea of proof Define Ia,z (ρ π) = Rn+m ∇

    log ρ π , (aaT + zzT)∇ log ρ π ρdx. Consider − 1 2 d dt Ia,z (ρt ) = Γ2 (f, f)ρt dxdx · · · (I) + Γz,π 2 (f, f)ρt dx · · · (II) + ΓIa,z (f, f)ρt dx · · · (III) where f = log ρ π , and Γ2 , Γz 2 , Γγ are designed bilinear forms, coming from the second order calculation in density space. (i) If a is non-degenerate, then (II) = 0; (ii) If b is a gradient vector field, then (III) = 0. 21
  20. Detailed approach For any f ∈ C∞(Rn+m), the generator of

    Itˆ o SDE satisfies Lf = Lf − γ, ∇f , where Lf = ∇ · (aaT∇f) + aaT∇ log π, ∇f . For a given matrix function a ∈ R(n+m)×n, we construct a matrix function z ∈ R(n+m)×m, and define a z-direction generator by Lz f = ∇ · (zzT∇f) + zzT∇ log π, ∇f . 22
  21. Global in space computation=Gamma operators Define Gamma one bilinear forms

    by Γ1 (f, f) = aT∇f, aT∇f Rn , Γz 1 (f, f) = zT∇f, zT∇f Rm . Define Gamma two bilinear forms by (i) Gamma two operator: Γ2 (f, f) = 1 2 LΓ1 (f, f) − Γ1 (Lf, f). (ii) Generalized Gamma z operator: Γz,π 2 (f, f) = 1 2 LΓz 1 (f, f) − Γz 1 (Lf, f) + divπ z Γ1,∇(aaT) (f, f) − divπ a Γ1,∇(zzT) (f, f) . (iii) Irreversible Gamma operator: ΓIa,z (f, f) = (Lf + Lz f) ∇f, γ − 1 2 ∇ Γ1 (f, f) + Γz 1 (f, f) , γ . 23
  22. Local in space calculation= Bochner’s formula For any f =

    log p π ∈ C∞(Rn+m, R) and any β ∈ R, under the assumption, we derive that − 1 2 d dt Ia,z (ρ π) = Γ2 (f, f) + Γz,π 2 (f, f) + ΓIa,z (f, f) pdx = Hessβ f 2 + R(∇f, ∇f) pdx. Clearly, if R λI, we derive a Lyapunov constant λ for the convergence rate. 24
  23. Example Consider a underdamped Langevin dynamic by dxt =vt dt

    dvt =(−T(xt )vt − ∇x U(xt ))dt + 2T(xt )dBt . (1) It can be viewed as, Yt = (xt , vt ), dYt = b(Yt )dt + √ 2a(Yt )dBt , with matrices b = v −T(x)v − ∇U(x) , a = 0 T(x) . Its invariant measure has a closed form, π(x, v) = 1 Z e−H(x,v), H(x, v) = v 2 2 + U(x). 25
  24. Constant diffusion -1 -0.5 0 0.5 1 x 1 -1

    -0.5 0 0.5 1 x 2 -1 -0.5 0 0.5 1 1.5 Smallest eignvalue: 0.094 0.095 1 1 0.096 0.097 Smallest eignvalue: 0.098 0.099 0.5 0.5 0.1 x 2 x 1 0 0 -0.5 -0.5 -1 -1 Figure: T=1, U(x) = x2/2; Left β = 0 [Arnold-Erb]; Right β = 0.1; z = (1, 0.1)T. 26
  25. Variable diffusion -0.2 -0.15 1 1 -0.1 -0.05 Smallest eignvalue:

    0.9 0.9 0 0.05 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 -0.04 -0.02 1 1 0 0.02 0.04 Smallest eignvalue: 0.06 0.9 0.9 0.08 0.1 0.8 0.8 x 1 x 2 0.7 0.7 0.6 0.6 0.5 0.5 Figure: U(x) = xc−x c(c−1) , T(x) = (∇2 x U(x))−1, c=2.5, z = 1 0.1 . (Left: β = 0; Right: β = 0.6.) 27