Information Newton flow: second order method in probability space

Slide 1

Slide 1 text

Information Newton ﬂow Wuchen Li This is a joint work with Yifei Wang. 1

Slide 2

Slide 2 text

Distance, divergence and optimization 2

Slide 3

Slide 3 text

Bayesian inference A powerful tool in – Modeling complex data – Quantifying uncertainty Of great interests in inverse problems, information science, physics and scientiﬁc computing Main problem – Given a prior distribution, generate samples from posterior distributions – Generate samples from an intractable distribution ρ∗(x) ∝ exp(−f(x)). 3

Slide 4

Slide 4 text

Application of Bayesian inference Figure: Prediction of future oil production. Upper panels: typical draws from the prior (left) and posterior (right). Lower panels: uncertainty in oil production under the prior (left) and posterior (right). 4

Slide 5

Slide 5 text

Langevin MCMC How to generate these samples? Classical Langevin MCMC algorithms evolves samples following over-damped Langevin dynamics (OLD) dXt = −∇f(Xt )dt + √ 2dBt , where Bt is the standard Brownian motion. Denote ρt as the density function of the distribution of Xt . The evolution of ρt satisﬁes the Fokker-Planck equation ∂t ρt = ∇ · (ρt ∇f) + ∆ρt . 5

Slide 6

Slide 6 text

Optimization in probability space Consider the optimization problem in probability space min ρ∈P(Ω) E(ρ), where P(Ω) = ρ ∈ F(Ω): Ω ρdx = 1, ρ ≥ 0. KL divergence: E(ρ) = DKL (ρ ρ∗) = (f + log ρ)ρdx. 6

Slide 7

Slide 7 text

Information metrics in probability space Tangent space Tρ P(Ω) = σ ∈ F(Ω) : σdx = 0. . Cotangent space T∗ ρ P(Ω) – Equivalent to F(Ω)/R Metric tensor G(ρ) : Tρ P(Ω) → T∗ ρ P(Ω) Metric: inner product in tangent space gρ (σ1 , σ2 ) = σ1 G(ρ)σ2 dx = Φ1 G(ρ)−1Φ2 dx, Here Φi is the solution to σi = G(ρ)−1Φi , i = 1, 2. 7

Slide 8

Slide 8 text

Example: Wasserstein metric Inverse of Wasserstein metric tensor GW (ρ)−1Φ = −∇ · (ρ∇Φ), Φ ∈ T∗ ρ P(Ω). Wasserstein metric: for σ1 , σ2 ∈ Tρ P(Ω), gW ρ (σ1 , σ2 ) = ρ ∇Φ1 , ∇Φ2 dx, where Φi is the solution to σi = −∇ · (ρ∇Φi ), i = 1, 2. 8

Slide 9

Slide 9 text

Gradient flows Gradient flow for E(ρ) in (P(Ω), gρ ) ∂t ρt = −G(ρt )−1 δE(ρt ) δρt . Example: Wasserstein gradient flow of KL divergence ∂t ρt = − gradW DKL (ρt ρ∗) =GW (ρt )−1 δ δρt DKL (ρt ρ∗) =∇ · (ρt ∇(f + log ρt + 1)) =∇ · (ρt ∇f) + ∆ρt . – This is the Fokker Planck equation. – Langevin MCMC is a gradient descent method! 9

Slide 10

Slide 10 text

Lagrangian Langevin dynamics A perspective of fluid dynamics Lagrangian formulation of Wasserstein gradient flow: Lagrangian Langevin Dynamics (LLD) dXt = −∇f(Xt )dt − ∇ log ρt (Xt )dt. The evolution of corresponding density follows the Wasserstein gradient flow. – Langevin MCMC (unadjusted Langevin algorithm) — OLD – The Particle-based Variational Inference (ParVI) methods1 — LLD 1Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accel- erating particle-based variational inference. In International Conference on Machine Learning, pages 40824092, 2019. 10

Slide 11

Slide 11 text

First-order sampling algorithm Various sampling methods are first-order methods in probability space based on various metrics. Stein variational gradient descent (Stein metric)1: ˙ Xt = (−K(Xt , y)∇f(y) + ∇x K(Xt , y))ρ(y)dy Ensemble Kalman sampling (Wasserstein-Kalman metric)2: ˙ Xt = −C(ρt )∇f(Xt ) + 2C(ρt ) ˙ Bt where C(ρ) = (x − m(ρ)) ⊗ (x − m(ρ))ρ(x)dx and m(ρ) = xρ(x)dx 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, 2016. 2Alfredo Garbuno-Inigo, Franca Hoffmann, Wuchen Li, and Andrew M Stuart. Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. arXiv preprint arXiv:1903.08866, 2019. 11

Slide 12

Slide 12 text

Overview of sampling algorithms Wasserstein metric + Gradient descent method = Langevin MCMC Stein metric + Gradient descent method = Stein variational gradient descent Wasserstein-Kalman metric + Gradient descent method = Ensemble Kalman sampling Wasserstein metric + Newton’s method = ? 12

Slide 13

Slide 13 text

Goal: Newton’s method Fundamental second-order method to accelerate optimization computations For optimization problems in probability space – What is the Newton’s method in probability space? – What is the ”Newton’s” Langevin dynamics? 13

Slide 14

Slide 14 text

Theorem (Wasserstein Newton’s flow of KL divergence) For a density ρ∗(x) ∝ exp(−f(x)) , where f is a given function, denote the KL divergence between ρ and ρ∗ by DKL (ρ ρ∗) = ρ log ρ e−f dx − log Z, where Z = exp(−f(x))dx. Then the Wasserstein Newton’s flow of KL divergence follows ∂t ρt + ∇ · (ρt ∇ΦNewton t ) = 0, where ΦNewton t satisfies the following equation ∇2 : (ρt ∇2Φt ) − ∇ · (ρt ∇2f∇Φt ) − ∇ · (ρt ∇f) − ∆ρt = 0. 14

Slide 15

Slide 15 text

Newton’s Langevin dynamics Theorem Consider the Newton’s Langevin dynamics dXt = ∇ΦNewton t (Xt )dt, where ΦNewton t (x) follows Wasserstein Newton’s direction equation: ∇2 : (ρt ∇2Φt ) − ∇ · (ρt ∇2f∇Φt ) − ∇ · (ρt ∇f) − ∆ρt = 0. Here X0 follows an initial distribution ρ0 and ρt is the distribution of Xt . Then, ρt satisﬁes Wasserstein Newton’s ﬂow with an initial value ρ0 = ρ0. 15

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Literature reviews on second-order methods Optimization problems on Riemannian manifold1 Discrete probability simplex with Fisher-Rao metric and exponential family models2 Second-order methods for the Stein variational gradient descent direction3 4 Newton-type MCMC method (HAMCMC)5 1Steven T Smith. Optimization techniques on Riemannian manifolds. Fields institute communications, 1994. 2Luigi Malag´ o and Giovanni Pistone. Combinatorial optimization with information geometry: The newton method. Entropy, 2014. 3Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl. A Stein variational Newton method. In Advances in Neural Information Processing Systems, 2018. 4Peng Chen, Keyi Wu, Joshua Chen, Tom OLeary-Roseberry, and Omar Ghattas. Projected Stein variational Newton: A fast and scalable bayesian inference method in high dimensions. In Advances in Neural Information Processing Systems, 2019. 5Umut Simsekli, Roland Badeau, Taylan Cemgil, and Ga¨ el Richard. Stochastic quasi-Newton Langevin Monte Carlo. In International Conference on Machine Learning (ICML), 2016. 17

Slide 18

Slide 18 text

Connections and diﬀerences with HAMCMC HAMCMC approximates the dynamics of dXt = −(∇2f(Xt ))−1 (∇f(Xt ) + Γ(Xt )) dt + 2∇2f(Xt )−1dBt , where Γi (x) = j=1 ∂ ∂xj ∇2f(Xt ) −1 i,j . The evolution of ρt ∂t ρt = ∇ · (ρt vt ), where vt satisﬁes −∇ · (ρt ∇2fvt ) − ∇ · (ρt ∇f) − ∆ρt = 0. Wasserstein Newton’s direction equation ∇2 : (ρt ∇vNewton t ) − ∇ · (ρt ∇2fvNewton t ) − ∇ · (ρt ∇f) − ∆ρt = 0. vt = vNewton t ! 18

Slide 19

Slide 19 text

Proposition (Wasserstein Newton’s flows in Gaussian families) Suppose that ρ0, ρ∗ are Gaussian distributions with zero means and their covariance matrices are Σ0 and Σ∗. E(Σ) evaluates the KL divergence from ρ to ρ∗: E(Σ) = 1 2 tr(Σ(Σ∗)−1) − n − log det Σ(Σ∗)−1 . Let (Σt, St) satisfy ˙ Σt − 2(SΣt + ΣSt) = 0, 2ΣtSt(Σ∗)−1 + 2(Σ∗)−1StΣt + 4St = −(Σt(Σ∗)−1 + (Σ∗)−1Σt − 2I). with initial values Σt|t=0 = Σ0 and St|t=0 = 0. Thus, for any t ≥ 0, Σt is well-defined and stays positive definite. We denote ρt(x) = (2π)−n/2 det(Σt) exp − 1 2 xT Σ−1 t x , Φt(x) = xT Stx + C(t), where C(t) = −t + 1 2 t 0 log det(Σs(Σ∗)−1)ds. Then, ρt and Φt follow the Wasserstein Newton’s flow with initial values ρt|t=0 = ρ0 and Φt|t=0 = 0. 19

Slide 20

Slide 20 text

Proposition (Newton’s Langevin dynamics in 1D Gaussian families) Assume that f(x) = (2Σ∗)−1(x − µ∗)2, where Σ∗ > 0 and µ∗ are given. Suppose that the particle system X0 follows the Gaussian distribution. Then Xt follows a Gaussian distribution with mean µt and variance Σt . The corresponding NLD satisﬁes dXt = Σ∗ − Σ Σ∗ + Σt Xt − 2Σ∗ Σ∗ + Σt µt + µ∗ dt. And the evolution of µt and Σt satisﬁes dµt = (−µt + µ∗)dt, dΣt = 2 Σ∗ − Σt Σ∗ + Σt Σt dt. The explicit solutions of µt and Σt satisfy µt = e−t(µ0 − µ∗) + µ∗, Σt = Σ∗ + (Σ0 − Σ∗)e−t e−2t(Σ0 − Σ∗)2 4Σ2 0 + 1 Σ0Σ∗ . 20

Slide 21

Slide 21 text

Dynamics Particle Mean and variance NLD dXt = Σ∗−Σt Σ∗+Σt Xt − 2Σ∗ Σ∗+Σt µt + µ∗ dt µt = µ∗ + e−t(µ0 − µ∗) Σt−Σ∗ Σ0−Σ∗ = e−2t(Σ0−Σ∗) 2Σ0 +e−t e−2t(Σ0−Σ∗)2 4Σ2 0 + Σ∗ Σ0 OLD dXt = −(Σ∗)−1(Xt − µ∗)dt + √ 2dBt µt = µ∗ + e−(Σ∗)−1t(µ0 − µ∗) LLD dXt = −(Σ∗)−1(Xt − µ∗)dt + Σ−1 t (Xt − µt)dt Σt = Σ∗ + e−2(Σ∗)−1t(Σ0 − Σ∗) HAMCMC dXt = −(Xt − µ∗)dt + √ 2Σ∗dBt. µt = µ∗ + e−t(µ0 − µ∗) Σt = Σ∗ + e−2t(Σ0 − Σ∗) Table: Comparison among diﬀerent Langevin dynamics on 1D Gaussian family. 21

Slide 22

Slide 22 text

Information Newton’s method General update rule of Information Newton’s method ρk+1 = Expρk (αk Φk ), HE (ρk )Φk + G(ρk )−1 δE δρk = 0. Expρk (·) is the exponential map at ρk . 22

Slide 23

Slide 23 text

Riemannian structure of probability space Deﬁne the distance D(ρ0 , ρ1 ) D(ρ0, ρ1)2 = inf ˆ ρs,s∈[0,1] 1 0 ∂s ˆ ρsG(ˆ ρs)−1∂s ˆ ρsdxds : ˆ ρs|s=0 = ρ0, ˆ ρs|s=1 = ρ1 . Denote the inner product on cotangent space T∗ ρ P(Ω) by Φ1 , Φ2 ρ = Φ1 G(ρ)−1Φ2 dx, Φ1 , Φ2 ∈ T∗ ρ P(Ω), and Φ 2 ρ = Φ, Φ ρ . 23

Slide 24

Slide 24 text

Parallelism and high-order derivative Deﬁnition (Parallelism) We say that τ : T∗ ρ0 P(Ω) → T∗ ρ1 P(Ω) is a parallelism from ρ0 to ρ1 , if for all Φ1 , Φ2 ∈ Tρ0 P(Ω), it follows Φ1 , Φ2 ρ0 = τΦ1 , τΦ2 ρ1 . ∇nE(ρ) is a n-form on the cotangent space T∗ ρ P(Ω) ∇nE(ρ)(Φ1, . . . , Φn) = ∂ ∂s ∇n−1E(Expρ (sΦn))(τsΦ1, . . . , τsΦn−1) s=0 , where τs is the parallelism from ρ to Expρ (sΦn ). 24

Slide 25

Slide 25 text

Assumption for convergence analysis Assumption Assume that there exists , δ1 , δ2 , δ3 > 0, such that for all ρ satisfying D(ρ, ρ∗) < , it follows ∇2E(ρ)(Φ1 , Φ1 ) ≥ δ1 Φ1 2 ρ , ∇2E(ρ)(Φ1 , Φ1 ) ≤ δ2 Φ1 2 ρ , |∇3E(ρ)(Φ1 , Φ1 , Φ2 )| ≤ δ3 Φ1 2 ρ Φ2 ρ , holds for all Φ1 , Φ2 ∈ T∗ ρ P(Ω). 25

Slide 26

Slide 26 text

Convergence analysis Theorem Suppose that the assumption holds, ρk satisﬁes D(ρk , ρ∗) < and the step size τk = 1. Then, we have D(ρk+1 , ρ∗) = O(D(ρk , ρ∗)2). 26

Slide 27

Slide 27 text

Sketch of proof Denote Tk = Exp−1 ρk (ρ∗). Proposition Suppose that the assumption holds. Let τ be the parallelism from ρk to ρk+1 . There exists a unique Rk ∈ T∗ ρk P(Ω) such that Tk = τ−1Tk+1 + Φk + Rk . Then, we have Tk+1 ρk+1 ≤ δ3 δ1 Tk 2 ρk + δ2 δ1 Rk ρk . 27

Slide 28

Slide 28 text

Sketch of proof Lemma For all Ψ ∈ T ∗ ρk P(Ω), it follows ΨG(ρk )−1Rk dx = O( Ψ ρk Tk 2 ρk ). Taking Ψ = Rk in Lemma yields Rk ρk = O( Tk 2 ρk ). Because the geodesic curve has constant speed, Tk 2 ρk = D(ρk , ρ∗)2. As a result, we have D(ρk+1 , ρ∗) ≤ δ2 δ1 D(ρk , ρ∗)2 + δ3 δ1 Rk ρk = O(D(ρk , ρ∗)2). 28

Slide 29

Slide 29 text

Implementation in probability space The distribution {xi k }N i=1 follows ρk (x). Update each particle by xi k+1 = xi k + αk ∇Φk (xi k ), i = 1, 2 . . . N. Φk is the solution to Wasserstein Newton’s direction equation. How to compute Φk based on {xi k }N i=1 ? 29

Slide 30

Slide 30 text

Compute Wasserstein Newton’s direction Proposition Suppose that H : T∗ ρ P(Ω) → Tρ P is a linear self-adjoint operator and H is positive deﬁnite. Let u ∈ Tρ P. Then the minimizer of variational problem min Φ∈T ∗ ρ P(Ω) J(Φ) = (ΦHΦ − 2uΦ) dx, satisﬁes HΦ = u, where Φ ∈ T∗ ρ P(Ω). 30

Slide 31

Slide 31 text

Compute Wasserstein Newton’s direction For strongly convex f Equivalent to optimizing the variational problem: min Φ∈T ∗ ρk P(Ω) J(Φ) = ∇2Φ 2 F + ∇Φ 2 ∇2f + 2 ∇f + ∇ log ρk, ∇Φ ρkdx. Here we denote v 2 A = vT Av. For general f, suppose that ∇2f(x) + I is strictly convex for x ∈ Ω. Consider a regularized problem min Φ∈T ∗ ρk P(Ω) J(Φ) = ∇2Φ 2 F + ∇Φ 2 ∇2f+ I + 2 ∇f + ∇ log ρk, ∇Φ ρkdx. 31

Slide 32

Slide 32 text

Aﬃne Wasserstein Newton’s method Φ(x) takes the form Φ(x) = 1 2 xSx + bT x, where S = diag(s) ∈ Rn×n is a diagonal matrix. The variational problem turns to be min S∈Sn,b∈Rn J(S, b) = tr(S2)+ 1 N N i=1 Sxi k + b 2 ∇2f(xi k )+ I + 2 Sxi k + b, vi k . 32

Slide 33

Slide 33 text

Aﬃne Wasserstein Newton’s method Rewrite the objective function to be J(s, b) = s 2 + 1 N N i=1 tr((diag(xi k )s + b)T (∇2f(xi k ) + I)(diag(xi k )s + b)) + 2 diag(xi k )s + b, vi k = s b T Hk s b + 2 s b T uk. where Hk = I + 1 N N i=1 diag(xi k)(∇2f(xi k) + I) diag(xi k) 1 N N i=1 diag(xi k)(∇2f(xi k) + I) 1 N N i=1(∇2f(xi k) + I) diag(xi k) 1 N N i=1(∇2f(xi k) + I) , uk = 1 N N i=1 diag(xi k)vi k 1 N N i=1 vi k . 33

Slide 34

Slide 34 text

Algorithm 1 Aﬃne Wasserstein Newton’s method Require: initial positions {xi 0}N i=1, ≥ 0, step sizes αk, maximum iteration K. 1: Set k = 0. 2: while k < K and the convergence criterion is not met do 3: Compute vi k = ∇f(xi k) + ξk(xi k). Here ξk is an approximation of ∇ log ρk. 4: Calculate Hk by Hk = I + 1 N N i=1 diag(xi k)(∇2f(xi k) + I) diag(xi k) 1 N N i=1 diag(xi k)(∇2f(xi k) + I) 1 N N i=1(∇2f(xi k) + I) diag(xi k) 1 N N i=1(∇2f(xi k) + I) , and formulate uk by uk = 1 N N i=1 diag(xi k)vi k 1 N N i=1 vi k . 5: Compute sk and bk by sk bk = −(Hk) −1 uk. 6: Update particle positions by x i k+1 = x i k + αk(diag(sk)x i k + bk). 7: Set k = k + 1. 8: end while 34

Slide 35

Slide 35 text

Modified affine Wasserstein Newton’s method What about symmetric matrix S? Too difficult! Consider the modified problem: min S∈Sn,b∈Rn ˆ J(S, b) = tr(S2)+ 1 N N i=1 Sxi k + b 2 Fk + 2 Sxi k + b, vi k . Here Fk = 1 N N i=1 ∇2f(xi k ) + I. The optimal solution follows bk = − 1 N N i=1 (Sk xi k + F−1 k vi k ), and Sk is the solution to the quadratic matrix optimization problem min S∈Sn 2 tr(S2) + tr(S ¯ Σk SFk ) + tr(S ¯ Fk S ¯ Σk ) − 2 tr( ¯ Tk S). Here ¯ Σk is an empirical covariance matrix of xi k and ¯ Tk is an empirical covariance matrix between xi k and vi k . 35

Slide 36

Slide 36 text

Algorithm 2 Modiﬁed aﬃne Wasserstein Newton’s method Require: initial positions {xi 1 }N i=1 , ≥ 0, step sizes αk, maximum iteration K. 1: Set k = 1 and S0 = I. 2: while k ≤ K and the convergence criterion is not met do 3: Calculate vi k = ∇f(xi k ) + ξk(xi k ). Here ξk is an approximation of ∇ log ρk. 4: Compute the averaged Hessian with regularization by Fk = I + 1 N N i=1 ∇2f(xi k ). 5: Compute the means ¯ xk = 1 N N i=1 xi k and ¯ vk = 1 N N i=1 vi k . 6: Let ˜ xi k = xi k − ¯ xk and ˜ vi k = vi k − ¯ vk. 7: Compute ¯ Σk = 1 N N i=1 ˜ xi k (˜ xi k )T and ¯ Tk = 1 2N N i=1 ˜ vi k (˜ xi k )T + ˜ xi k (˜ vi k )T . 8: Solve Sk from the problem via the CG method using the initial guess Sk−1. 9: Update particle positions by xi k+1 = xi k + αk(Sk ˜ xi k − F−1 k ¯ vk). 10: end while 36

Slide 37

Slide 37 text

Hybrid method Overdamped Langevin dynamics as gradient direction: xi k+1 = xi k + αk (Sk xi k + bk ) + 2λk αk Bk , where Bk ∼ N(0, I). Lagrangian Langevin dynamics as gradient direction: xi k+1 = xi k + αk (Sk xi k + bk − λk vi k ). Hybrid Langvien dynamics dXt = (∇Φ −Newton t − λt ∇f)dt + 2λt dBt , 37

Slide 38

Slide 38 text

1D Toy example 38

Slide 39

Slide 39 text

2D toy example 39

Slide 40

Slide 40 text

Gaussian families Figure: Comparison among WNewton, WGF, AIG and HALLD in Gaussian families. The conditional number κ = 2 × 104. For Newton and HALLD 0.5, the markers are marked for every 5 iterations. 40

Slide 41

Slide 41 text

Bayesian logistic regression Figure: Comparison of diﬀerent methods on Bayesian logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Left: Test accuracy; Right: Test log-likelihood. 41

Slide 42

Slide 42 text

Conclusion and discussion Design high-order optimization methods for Bayesian sampling, machine learning and inverse problems. High-order derivatives from information metrics are very useful! Sampling efficient Quasi-Newton’s method for information metrics Other efficient method to approximate the Wasserstein Newton’s direction Information Newton’s flow in probability models 42