130

# Information Newton flow: second order method in probability space

Markov chain Monte Carlo (MCMC) methods nowadays play essential roles in machine learning, Bayesian sampling problems, and inverse problems. To accelerate the MCMC methods, we formulate a high order optimization framework for accelerated them. It can be viewed as Newton’s flows in probability space with information metrics, named information Newton’s flows. Here two information metrics are considered, including both the Fisher-Rao metric and the Wasserstein-2 metric. Several examples of information Newton’s flows for learning objective/loss functions are provided, such as Kullback-Leibler (KL) divergence, Maximum mean discrepancy (MMD), and cross-entropy. The asymptotic convergence results of proposed Newton’s methods are provided. A known fact is that classical MCMC methods, such as overdamped Langevin dynamics, correspond to Wasserstein gradient flows of KL divergence. Extending this fact to Wasserstein Newton’s flows of KL divergence, we derive Newton’s Langevin dynamics. We provide examples of Newton’s Langevin dynamics in both one-dimensional space and Gaussian families. For the numerical implementation, we design sampling efficient variational methods to approximate Wasserstein Newton’s directions. Several numerical examples in Gaussian families and Bayesian logistic regression are shown to demonstrate the effectiveness of the proposed method.

## Wuchen Li

February 13, 2020

## Transcript

1. ### Information Newton ﬂow Wuchen Li This is a joint work

with Yifei Wang. 1

3. ### Bayesian inference A powerful tool in – Modeling complex data

– Quantifying uncertainty Of great interests in inverse problems, information science, physics and scientiﬁc computing Main problem – Given a prior distribution, generate samples from posterior distributions – Generate samples from an intractable distribution ρ∗(x) ∝ exp(−f(x)). 3
4. ### Application of Bayesian inference Figure: Prediction of future oil production.

Upper panels: typical draws from the prior (left) and posterior (right). Lower panels: uncertainty in oil production under the prior (left) and posterior (right). 4
5. ### Langevin MCMC How to generate these samples? Classical Langevin MCMC

algorithms evolves samples following over-damped Langevin dynamics (OLD) dXt = −∇f(Xt )dt + √ 2dBt , where Bt is the standard Brownian motion. Denote ρt as the density function of the distribution of Xt . The evolution of ρt satisﬁes the Fokker-Planck equation ∂t ρt = ∇ · (ρt ∇f) + ∆ρt . 5
6. ### Optimization in probability space Consider the optimization problem in probability

space min ρ∈P(Ω) E(ρ), where P(Ω) = ρ ∈ F(Ω): Ω ρdx = 1, ρ ≥ 0. KL divergence: E(ρ) = DKL (ρ ρ∗) = (f + log ρ)ρdx. 6
7. ### Information metrics in probability space Tangent space Tρ P(Ω) =

σ ∈ F(Ω) : σdx = 0. . Cotangent space T∗ ρ P(Ω) – Equivalent to F(Ω)/R Metric tensor G(ρ) : Tρ P(Ω) → T∗ ρ P(Ω) Metric: inner product in tangent space gρ (σ1 , σ2 ) = σ1 G(ρ)σ2 dx = Φ1 G(ρ)−1Φ2 dx, Here Φi is the solution to σi = G(ρ)−1Φi , i = 1, 2. 7
8. ### Example: Wasserstein metric Inverse of Wasserstein metric tensor GW (ρ)−1Φ

= −∇ · (ρ∇Φ), Φ ∈ T∗ ρ P(Ω). Wasserstein metric: for σ1 , σ2 ∈ Tρ P(Ω), gW ρ (σ1 , σ2 ) = ρ ∇Φ1 , ∇Φ2 dx, where Φi is the solution to σi = −∇ · (ρ∇Φi ), i = 1, 2. 8
9. ### Gradient ﬂows Gradient ﬂow for E(ρ) in (P(Ω), gρ )

∂t ρt = −G(ρt )−1 δE(ρt ) δρt . Example: Wasserstein gradient ﬂow of KL divergence ∂t ρt = − gradW DKL (ρt ρ∗) =GW (ρt )−1 δ δρt DKL (ρt ρ∗) =∇ · (ρt ∇(f + log ρt + 1)) =∇ · (ρt ∇f) + ∆ρt . – This is the Fokker Planck equation. – Langevin MCMC is a gradient descent method! 9
10. ### Lagrangian Langevin dynamics A perspective of ﬂuid dynamics Lagrangian formulation

of Wasserstein gradient ﬂow: Lagrangian Langevin Dynamics (LLD) dXt = −∇f(Xt )dt − ∇ log ρt (Xt )dt. The evolution of corresponding density follows the Wasserstein gradient ﬂow. – Langevin MCMC (unadjusted Langevin algorithm) — OLD – The Particle-based Variational Inference (ParVI) methods1 — LLD 1Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accel- erating particle-based variational inference. In International Conference on Machine Learning, pages 40824092, 2019. 10
11. ### First-order sampling algorithm Various sampling methods are ﬁrst-order methods in

probability space based on various metrics. Stein variational gradient descent (Stein metric)1: ˙ Xt = (−K(Xt , y)∇f(y) + ∇x K(Xt , y))ρ(y)dy Ensemble Kalman sampling (Wasserstein-Kalman metric)2: ˙ Xt = −C(ρt )∇f(Xt ) + 2C(ρt ) ˙ Bt where C(ρ) = (x − m(ρ)) ⊗ (x − m(ρ))ρ(x)dx and m(ρ) = xρ(x)dx 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, 2016. 2Alfredo Garbuno-Inigo, Franca Hoﬀmann, Wuchen Li, and Andrew M Stuart. Interacting langevin diﬀusions: Gradient structure and ensemble kalman sampler. arXiv preprint arXiv:1903.08866, 2019. 11
12. ### Overview of sampling algorithms Wasserstein metric + Gradient descent method

= Langevin MCMC Stein metric + Gradient descent method = Stein variational gradient descent Wasserstein-Kalman metric + Gradient descent method = Ensemble Kalman sampling Wasserstein metric + Newton’s method = ? 12
13. ### Goal: Newton’s method Fundamental second-order method to accelerate optimization computations

For optimization problems in probability space – What is the Newton’s method in probability space? – What is the ”Newton’s” Langevin dynamics? 13
14. ### Theorem (Wasserstein Newton’s ﬂow of KL divergence) For a density

ρ∗(x) ∝ exp(−f(x)) , where f is a given function, denote the KL divergence between ρ and ρ∗ by DKL (ρ ρ∗) = ρ log ρ e−f dx − log Z, where Z = exp(−f(x))dx. Then the Wasserstein Newton’s ﬂow of KL divergence follows ∂t ρt + ∇ · (ρt ∇ΦNewton t ) = 0, where ΦNewton t satisﬁes the following equation ∇2 : (ρt ∇2Φt ) − ∇ · (ρt ∇2f∇Φt ) − ∇ · (ρt ∇f) − ∆ρt = 0. 14
15. ### Newton’s Langevin dynamics Theorem Consider the Newton’s Langevin dynamics dXt

= ∇ΦNewton t (Xt )dt, where ΦNewton t (x) follows Wasserstein Newton’s direction equation: ∇2 : (ρt ∇2Φt ) − ∇ · (ρt ∇2f∇Φt ) − ∇ · (ρt ∇f) − ∆ρt = 0. Here X0 follows an initial distribution ρ0 and ρt is the distribution of Xt . Then, ρt satisﬁes Wasserstein Newton’s ﬂow with an initial value ρ0 = ρ0. 15

17. ### Literature reviews on second-order methods Optimization problems on Riemannian manifold1

Discrete probability simplex with Fisher-Rao metric and exponential family models2 Second-order methods for the Stein variational gradient descent direction3 4 Newton-type MCMC method (HAMCMC)5 1Steven T Smith. Optimization techniques on Riemannian manifolds. Fields institute communications, 1994. 2Luigi Malag´ o and Giovanni Pistone. Combinatorial optimization with information geometry: The newton method. Entropy, 2014. 3Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl. A Stein variational Newton method. In Advances in Neural Information Processing Systems, 2018. 4Peng Chen, Keyi Wu, Joshua Chen, Tom OLeary-Roseberry, and Omar Ghattas. Projected Stein variational Newton: A fast and scalable bayesian inference method in high dimensions. In Advances in Neural Information Processing Systems, 2019. 5Umut Simsekli, Roland Badeau, Taylan Cemgil, and Ga¨ el Richard. Stochastic quasi-Newton Langevin Monte Carlo. In International Conference on Machine Learning (ICML), 2016. 17
18. ### Connections and diﬀerences with HAMCMC HAMCMC approximates the dynamics of

dXt = −(∇2f(Xt ))−1 (∇f(Xt ) + Γ(Xt )) dt + 2∇2f(Xt )−1dBt , where Γi (x) = j=1 ∂ ∂xj ∇2f(Xt ) −1 i,j . The evolution of ρt ∂t ρt = ∇ · (ρt vt ), where vt satisﬁes −∇ · (ρt ∇2fvt ) − ∇ · (ρt ∇f) − ∆ρt = 0. Wasserstein Newton’s direction equation ∇2 : (ρt ∇vNewton t ) − ∇ · (ρt ∇2fvNewton t ) − ∇ · (ρt ∇f) − ∆ρt = 0. vt = vNewton t ! 18
19. ### Proposition (Wasserstein Newton’s ﬂows in Gaussian families) Suppose that ρ0,

ρ∗ are Gaussian distributions with zero means and their covariance matrices are Σ0 and Σ∗. E(Σ) evaluates the KL divergence from ρ to ρ∗: E(Σ) = 1 2 tr(Σ(Σ∗)−1) − n − log det Σ(Σ∗)−1 . Let (Σt, St) satisfy ˙ Σt − 2(SΣt + ΣSt) = 0, 2ΣtSt(Σ∗)−1 + 2(Σ∗)−1StΣt + 4St = −(Σt(Σ∗)−1 + (Σ∗)−1Σt − 2I). with initial values Σt|t=0 = Σ0 and St|t=0 = 0. Thus, for any t ≥ 0, Σt is well-deﬁned and stays positive deﬁnite. We denote ρt(x) = (2π)−n/2 det(Σt) exp − 1 2 xT Σ−1 t x , Φt(x) = xT Stx + C(t), where C(t) = −t + 1 2 t 0 log det(Σs(Σ∗)−1)ds. Then, ρt and Φt follow the Wasserstein Newton’s ﬂow with initial values ρt|t=0 = ρ0 and Φt|t=0 = 0. 19
20. ### Proposition (Newton’s Langevin dynamics in 1D Gaussian families) Assume that

f(x) = (2Σ∗)−1(x − µ∗)2, where Σ∗ > 0 and µ∗ are given. Suppose that the particle system X0 follows the Gaussian distribution. Then Xt follows a Gaussian distribution with mean µt and variance Σt . The corresponding NLD satisﬁes dXt = Σ∗ − Σ Σ∗ + Σt Xt − 2Σ∗ Σ∗ + Σt µt + µ∗ dt. And the evolution of µt and Σt satisﬁes dµt = (−µt + µ∗)dt, dΣt = 2 Σ∗ − Σt Σ∗ + Σt Σt dt. The explicit solutions of µt and Σt satisfy µt = e−t(µ0 − µ∗) + µ∗, Σt = Σ∗ + (Σ0 − Σ∗)e−t e−2t(Σ0 − Σ∗)2 4Σ2 0 + 1 Σ0Σ∗ . 20
21. ### Dynamics Particle Mean and variance NLD dXt = Σ∗−Σt Σ∗+Σt

Xt − 2Σ∗ Σ∗+Σt µt + µ∗ dt µt = µ∗ + e−t(µ0 − µ∗) Σt−Σ∗ Σ0−Σ∗ = e−2t(Σ0−Σ∗) 2Σ0 +e−t e−2t(Σ0−Σ∗)2 4Σ2 0 + Σ∗ Σ0 OLD dXt = −(Σ∗)−1(Xt − µ∗)dt + √ 2dBt µt = µ∗ + e−(Σ∗)−1t(µ0 − µ∗) LLD dXt = −(Σ∗)−1(Xt − µ∗)dt + Σ−1 t (Xt − µt)dt Σt = Σ∗ + e−2(Σ∗)−1t(Σ0 − Σ∗) HAMCMC dXt = −(Xt − µ∗)dt + √ 2Σ∗dBt. µt = µ∗ + e−t(µ0 − µ∗) Σt = Σ∗ + e−2t(Σ0 − Σ∗) Table: Comparison among diﬀerent Langevin dynamics on 1D Gaussian family. 21
22. ### Information Newton’s method General update rule of Information Newton’s method

ρk+1 = Expρk (αk Φk ), HE (ρk )Φk + G(ρk )−1 δE δρk = 0. Expρk (·) is the exponential map at ρk . 22
23. ### Riemannian structure of probability space Deﬁne the distance D(ρ0 ,

ρ1 ) D(ρ0, ρ1)2 = inf ˆ ρs,s∈[0,1] 1 0 ∂s ˆ ρsG(ˆ ρs)−1∂s ˆ ρsdxds : ˆ ρs|s=0 = ρ0, ˆ ρs|s=1 = ρ1 . Denote the inner product on cotangent space T∗ ρ P(Ω) by Φ1 , Φ2 ρ = Φ1 G(ρ)−1Φ2 dx, Φ1 , Φ2 ∈ T∗ ρ P(Ω), and Φ 2 ρ = Φ, Φ ρ . 23
24. ### Parallelism and high-order derivative Deﬁnition (Parallelism) We say that τ

: T∗ ρ0 P(Ω) → T∗ ρ1 P(Ω) is a parallelism from ρ0 to ρ1 , if for all Φ1 , Φ2 ∈ Tρ0 P(Ω), it follows Φ1 , Φ2 ρ0 = τΦ1 , τΦ2 ρ1 . ∇nE(ρ) is a n-form on the cotangent space T∗ ρ P(Ω) ∇nE(ρ)(Φ1, . . . , Φn) = ∂ ∂s ∇n−1E(Expρ (sΦn))(τsΦ1, . . . , τsΦn−1) s=0 , where τs is the parallelism from ρ to Expρ (sΦn ). 24
25. ### Assumption for convergence analysis Assumption Assume that there exists ,

δ1 , δ2 , δ3 > 0, such that for all ρ satisfying D(ρ, ρ∗) < , it follows ∇2E(ρ)(Φ1 , Φ1 ) ≥ δ1 Φ1 2 ρ , ∇2E(ρ)(Φ1 , Φ1 ) ≤ δ2 Φ1 2 ρ , |∇3E(ρ)(Φ1 , Φ1 , Φ2 )| ≤ δ3 Φ1 2 ρ Φ2 ρ , holds for all Φ1 , Φ2 ∈ T∗ ρ P(Ω). 25
26. ### Convergence analysis Theorem Suppose that the assumption holds, ρk satisﬁes

D(ρk , ρ∗) < and the step size τk = 1. Then, we have D(ρk+1 , ρ∗) = O(D(ρk , ρ∗)2). 26
27. ### Sketch of proof Denote Tk = Exp−1 ρk (ρ∗). Proposition

Suppose that the assumption holds. Let τ be the parallelism from ρk to ρk+1 . There exists a unique Rk ∈ T∗ ρk P(Ω) such that Tk = τ−1Tk+1 + Φk + Rk . Then, we have Tk+1 ρk+1 ≤ δ3 δ1 Tk 2 ρk + δ2 δ1 Rk ρk . 27
28. ### Sketch of proof Lemma For all Ψ ∈ T ∗

ρk P(Ω), it follows ΨG(ρk )−1Rk dx = O( Ψ ρk Tk 2 ρk ). Taking Ψ = Rk in Lemma yields Rk ρk = O( Tk 2 ρk ). Because the geodesic curve has constant speed, Tk 2 ρk = D(ρk , ρ∗)2. As a result, we have D(ρk+1 , ρ∗) ≤ δ2 δ1 D(ρk , ρ∗)2 + δ3 δ1 Rk ρk = O(D(ρk , ρ∗)2). 28
29. ### Implementation in probability space The distribution {xi k }N i=1

follows ρk (x). Update each particle by xi k+1 = xi k + αk ∇Φk (xi k ), i = 1, 2 . . . N. Φk is the solution to Wasserstein Newton’s direction equation. How to compute Φk based on {xi k }N i=1 ? 29
30. ### Compute Wasserstein Newton’s direction Proposition Suppose that H : T∗

ρ P(Ω) → Tρ P is a linear self-adjoint operator and H is positive deﬁnite. Let u ∈ Tρ P. Then the minimizer of variational problem min Φ∈T ∗ ρ P(Ω) J(Φ) = (ΦHΦ − 2uΦ) dx, satisﬁes HΦ = u, where Φ ∈ T∗ ρ P(Ω). 30
31. ### Compute Wasserstein Newton’s direction For strongly convex f Equivalent to

optimizing the variational problem: min Φ∈T ∗ ρk P(Ω) J(Φ) = ∇2Φ 2 F + ∇Φ 2 ∇2f + 2 ∇f + ∇ log ρk, ∇Φ ρkdx. Here we denote v 2 A = vT Av. For general f, suppose that ∇2f(x) + I is strictly convex for x ∈ Ω. Consider a regularized problem min Φ∈T ∗ ρk P(Ω) J(Φ) = ∇2Φ 2 F + ∇Φ 2 ∇2f+ I + 2 ∇f + ∇ log ρk, ∇Φ ρkdx. 31
32. ### Aﬃne Wasserstein Newton’s method Φ(x) takes the form Φ(x) =

1 2 xSx + bT x, where S = diag(s) ∈ Rn×n is a diagonal matrix. The variational problem turns to be min S∈Sn,b∈Rn J(S, b) = tr(S2)+ 1 N N i=1 Sxi k + b 2 ∇2f(xi k )+ I + 2 Sxi k + b, vi k . 32
33. ### Aﬃne Wasserstein Newton’s method Rewrite the objective function to be

J(s, b) = s 2 + 1 N N i=1 tr((diag(xi k )s + b)T (∇2f(xi k ) + I)(diag(xi k )s + b)) + 2 diag(xi k )s + b, vi k = s b T Hk s b + 2 s b T uk. where Hk = I + 1 N N i=1 diag(xi k)(∇2f(xi k) + I) diag(xi k) 1 N N i=1 diag(xi k)(∇2f(xi k) + I) 1 N N i=1(∇2f(xi k) + I) diag(xi k) 1 N N i=1(∇2f(xi k) + I) , uk = 1 N N i=1 diag(xi k)vi k 1 N N i=1 vi k . 33
34. ### Algorithm 1 Aﬃne Wasserstein Newton’s method Require: initial positions {xi

0}N i=1, ≥ 0, step sizes αk, maximum iteration K. 1: Set k = 0. 2: while k < K and the convergence criterion is not met do 3: Compute vi k = ∇f(xi k) + ξk(xi k). Here ξk is an approximation of ∇ log ρk. 4: Calculate Hk by Hk = I + 1 N N i=1 diag(xi k)(∇2f(xi k) + I) diag(xi k) 1 N N i=1 diag(xi k)(∇2f(xi k) + I) 1 N N i=1(∇2f(xi k) + I) diag(xi k) 1 N N i=1(∇2f(xi k) + I) , and formulate uk by uk = 1 N N i=1 diag(xi k)vi k 1 N N i=1 vi k . 5: Compute sk and bk by sk bk = −(Hk) −1 uk. 6: Update particle positions by x i k+1 = x i k + αk(diag(sk)x i k + bk). 7: Set k = k + 1. 8: end while 34
35. ### Modiﬁed aﬃne Wasserstein Newton’s method What about symmetric matrix S?

Too diﬃcult! Consider the modiﬁed problem: min S∈Sn,b∈Rn ˆ J(S, b) = tr(S2)+ 1 N N i=1 Sxi k + b 2 Fk + 2 Sxi k + b, vi k . Here Fk = 1 N N i=1 ∇2f(xi k ) + I. The optimal solution follows bk = − 1 N N i=1 (Sk xi k + F−1 k vi k ), and Sk is the solution to the quadratic matrix optimization problem min S∈Sn 2 tr(S2) + tr(S ¯ Σk SFk ) + tr(S ¯ Fk S ¯ Σk ) − 2 tr( ¯ Tk S). Here ¯ Σk is an empirical covariance matrix of xi k and ¯ Tk is an empirical covariance matrix between xi k and vi k . 35
36. ### Algorithm 2 Modiﬁed aﬃne Wasserstein Newton’s method Require: initial positions

{xi 1 }N i=1 , ≥ 0, step sizes αk, maximum iteration K. 1: Set k = 1 and S0 = I. 2: while k ≤ K and the convergence criterion is not met do 3: Calculate vi k = ∇f(xi k ) + ξk(xi k ). Here ξk is an approximation of ∇ log ρk. 4: Compute the averaged Hessian with regularization by Fk = I + 1 N N i=1 ∇2f(xi k ). 5: Compute the means ¯ xk = 1 N N i=1 xi k and ¯ vk = 1 N N i=1 vi k . 6: Let ˜ xi k = xi k − ¯ xk and ˜ vi k = vi k − ¯ vk. 7: Compute ¯ Σk = 1 N N i=1 ˜ xi k (˜ xi k )T and ¯ Tk = 1 2N N i=1 ˜ vi k (˜ xi k )T + ˜ xi k (˜ vi k )T . 8: Solve Sk from the problem via the CG method using the initial guess Sk−1. 9: Update particle positions by xi k+1 = xi k + αk(Sk ˜ xi k − F−1 k ¯ vk). 10: end while 36
37. ### Hybrid method Overdamped Langevin dynamics as gradient direction: xi k+1

= xi k + αk (Sk xi k + bk ) + 2λk αk Bk , where Bk ∼ N(0, I). Lagrangian Langevin dynamics as gradient direction: xi k+1 = xi k + αk (Sk xi k + bk − λk vi k ). Hybrid Langvien dynamics dXt = (∇Φ −Newton t − λt ∇f)dt + 2λt dBt , 37

40. ### Gaussian families Figure: Comparison among WNewton, WGF, AIG and HALLD

in Gaussian families. The conditional number κ = 2 × 104. For Newton and HALLD 0.5, the markers are marked for every 5 iterations. 40
41. ### Bayesian logistic regression Figure: Comparison of diﬀerent methods on Bayesian

logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Left: Test accuracy; Right: Test log-likelihood. 41
42. ### Conclusion and discussion Design high-order optimization methods for Bayesian sampling,

machine learning and inverse problems. High-order derivatives from information metrics are very useful! Sampling eﬃcient Quasi-Newton’s method for information metrics Other eﬃcient method to approximate the Wasserstein Newton’s direction Information Newton’s ﬂow in probability models 42