Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Newton flow: second order method in probability space

7a507f364fce7547f94b9a5b4a072c87?s=47 Wuchen Li
February 13, 2020

Information Newton flow: second order method in probability space

Markov chain Monte Carlo (MCMC) methods nowadays play essential roles in machine learning, Bayesian sampling problems, and inverse problems. To accelerate the MCMC methods, we formulate a high order optimization framework for accelerated them. It can be viewed as Newton’s flows in probability space with information metrics, named information Newton’s flows. Here two information metrics are considered, including both the Fisher-Rao metric and the Wasserstein-2 metric. Several examples of information Newton’s flows for learning objective/loss functions are provided, such as Kullback-Leibler (KL) divergence, Maximum mean discrepancy (MMD), and cross-entropy. The asymptotic convergence results of proposed Newton’s methods are provided. A known fact is that classical MCMC methods, such as overdamped Langevin dynamics, correspond to Wasserstein gradient flows of KL divergence. Extending this fact to Wasserstein Newton’s flows of KL divergence, we derive Newton’s Langevin dynamics. We provide examples of Newton’s Langevin dynamics in both one-dimensional space and Gaussian families. For the numerical implementation, we design sampling efficient variational methods to approximate Wasserstein Newton’s directions. Several numerical examples in Gaussian families and Bayesian logistic regression are shown to demonstrate the effectiveness of the proposed method.

7a507f364fce7547f94b9a5b4a072c87?s=128

Wuchen Li

February 13, 2020
Tweet

Transcript

  1. Information Newton flow Wuchen Li This is a joint work

    with Yifei Wang. 1
  2. Distance, divergence and optimization 2

  3. Bayesian inference A powerful tool in – Modeling complex data

    – Quantifying uncertainty Of great interests in inverse problems, information science, physics and scientific computing Main problem – Given a prior distribution, generate samples from posterior distributions – Generate samples from an intractable distribution ρ∗(x) ∝ exp(−f(x)). 3
  4. Application of Bayesian inference Figure: Prediction of future oil production.

    Upper panels: typical draws from the prior (left) and posterior (right). Lower panels: uncertainty in oil production under the prior (left) and posterior (right). 4
  5. Langevin MCMC How to generate these samples? Classical Langevin MCMC

    algorithms evolves samples following over-damped Langevin dynamics (OLD) dXt = −∇f(Xt )dt + √ 2dBt , where Bt is the standard Brownian motion. Denote ρt as the density function of the distribution of Xt . The evolution of ρt satisfies the Fokker-Planck equation ∂t ρt = ∇ · (ρt ∇f) + ∆ρt . 5
  6. Optimization in probability space Consider the optimization problem in probability

    space min ρ∈P(Ω) E(ρ), where P(Ω) = ρ ∈ F(Ω): Ω ρdx = 1, ρ ≥ 0. KL divergence: E(ρ) = DKL (ρ ρ∗) = (f + log ρ)ρdx. 6
  7. Information metrics in probability space Tangent space Tρ P(Ω) =

    σ ∈ F(Ω) : σdx = 0. . Cotangent space T∗ ρ P(Ω) – Equivalent to F(Ω)/R Metric tensor G(ρ) : Tρ P(Ω) → T∗ ρ P(Ω) Metric: inner product in tangent space gρ (σ1 , σ2 ) = σ1 G(ρ)σ2 dx = Φ1 G(ρ)−1Φ2 dx, Here Φi is the solution to σi = G(ρ)−1Φi , i = 1, 2. 7
  8. Example: Wasserstein metric Inverse of Wasserstein metric tensor GW (ρ)−1Φ

    = −∇ · (ρ∇Φ), Φ ∈ T∗ ρ P(Ω). Wasserstein metric: for σ1 , σ2 ∈ Tρ P(Ω), gW ρ (σ1 , σ2 ) = ρ ∇Φ1 , ∇Φ2 dx, where Φi is the solution to σi = −∇ · (ρ∇Φi ), i = 1, 2. 8
  9. Gradient flows Gradient flow for E(ρ) in (P(Ω), gρ )

    ∂t ρt = −G(ρt )−1 δE(ρt ) δρt . Example: Wasserstein gradient flow of KL divergence ∂t ρt = − gradW DKL (ρt ρ∗) =GW (ρt )−1 δ δρt DKL (ρt ρ∗) =∇ · (ρt ∇(f + log ρt + 1)) =∇ · (ρt ∇f) + ∆ρt . – This is the Fokker Planck equation. – Langevin MCMC is a gradient descent method! 9
  10. Lagrangian Langevin dynamics A perspective of fluid dynamics Lagrangian formulation

    of Wasserstein gradient flow: Lagrangian Langevin Dynamics (LLD) dXt = −∇f(Xt )dt − ∇ log ρt (Xt )dt. The evolution of corresponding density follows the Wasserstein gradient flow. – Langevin MCMC (unadjusted Langevin algorithm) — OLD – The Particle-based Variational Inference (ParVI) methods1 — LLD 1Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accel- erating particle-based variational inference. In International Conference on Machine Learning, pages 40824092, 2019. 10
  11. First-order sampling algorithm Various sampling methods are first-order methods in

    probability space based on various metrics. Stein variational gradient descent (Stein metric)1: ˙ Xt = (−K(Xt , y)∇f(y) + ∇x K(Xt , y))ρ(y)dy Ensemble Kalman sampling (Wasserstein-Kalman metric)2: ˙ Xt = −C(ρt )∇f(Xt ) + 2C(ρt ) ˙ Bt where C(ρ) = (x − m(ρ)) ⊗ (x − m(ρ))ρ(x)dx and m(ρ) = xρ(x)dx 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, 2016. 2Alfredo Garbuno-Inigo, Franca Hoffmann, Wuchen Li, and Andrew M Stuart. Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. arXiv preprint arXiv:1903.08866, 2019. 11
  12. Overview of sampling algorithms Wasserstein metric + Gradient descent method

    = Langevin MCMC Stein metric + Gradient descent method = Stein variational gradient descent Wasserstein-Kalman metric + Gradient descent method = Ensemble Kalman sampling Wasserstein metric + Newton’s method = ? 12
  13. Goal: Newton’s method Fundamental second-order method to accelerate optimization computations

    For optimization problems in probability space – What is the Newton’s method in probability space? – What is the ”Newton’s” Langevin dynamics? 13
  14. Theorem (Wasserstein Newton’s flow of KL divergence) For a density

    ρ∗(x) ∝ exp(−f(x)) , where f is a given function, denote the KL divergence between ρ and ρ∗ by DKL (ρ ρ∗) = ρ log ρ e−f dx − log Z, where Z = exp(−f(x))dx. Then the Wasserstein Newton’s flow of KL divergence follows ∂t ρt + ∇ · (ρt ∇ΦNewton t ) = 0, where ΦNewton t satisfies the following equation ∇2 : (ρt ∇2Φt ) − ∇ · (ρt ∇2f∇Φt ) − ∇ · (ρt ∇f) − ∆ρt = 0. 14
  15. Newton’s Langevin dynamics Theorem Consider the Newton’s Langevin dynamics dXt

    = ∇ΦNewton t (Xt )dt, where ΦNewton t (x) follows Wasserstein Newton’s direction equation: ∇2 : (ρt ∇2Φt ) − ∇ · (ρt ∇2f∇Φt ) − ∇ · (ρt ∇f) − ∆ρt = 0. Here X0 follows an initial distribution ρ0 and ρt is the distribution of Xt . Then, ρt satisfies Wasserstein Newton’s flow with an initial value ρ0 = ρ0. 15
  16. 16

  17. Literature reviews on second-order methods Optimization problems on Riemannian manifold1

    Discrete probability simplex with Fisher-Rao metric and exponential family models2 Second-order methods for the Stein variational gradient descent direction3 4 Newton-type MCMC method (HAMCMC)5 1Steven T Smith. Optimization techniques on Riemannian manifolds. Fields institute communications, 1994. 2Luigi Malag´ o and Giovanni Pistone. Combinatorial optimization with information geometry: The newton method. Entropy, 2014. 3Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl. A Stein variational Newton method. In Advances in Neural Information Processing Systems, 2018. 4Peng Chen, Keyi Wu, Joshua Chen, Tom OLeary-Roseberry, and Omar Ghattas. Projected Stein variational Newton: A fast and scalable bayesian inference method in high dimensions. In Advances in Neural Information Processing Systems, 2019. 5Umut Simsekli, Roland Badeau, Taylan Cemgil, and Ga¨ el Richard. Stochastic quasi-Newton Langevin Monte Carlo. In International Conference on Machine Learning (ICML), 2016. 17
  18. Connections and differences with HAMCMC HAMCMC approximates the dynamics of

    dXt = −(∇2f(Xt ))−1 (∇f(Xt ) + Γ(Xt )) dt + 2∇2f(Xt )−1dBt , where Γi (x) = j=1 ∂ ∂xj ∇2f(Xt ) −1 i,j . The evolution of ρt ∂t ρt = ∇ · (ρt vt ), where vt satisfies −∇ · (ρt ∇2fvt ) − ∇ · (ρt ∇f) − ∆ρt = 0. Wasserstein Newton’s direction equation ∇2 : (ρt ∇vNewton t ) − ∇ · (ρt ∇2fvNewton t ) − ∇ · (ρt ∇f) − ∆ρt = 0. vt = vNewton t ! 18
  19. Proposition (Wasserstein Newton’s flows in Gaussian families) Suppose that ρ0,

    ρ∗ are Gaussian distributions with zero means and their covariance matrices are Σ0 and Σ∗. E(Σ) evaluates the KL divergence from ρ to ρ∗: E(Σ) = 1 2 tr(Σ(Σ∗)−1) − n − log det Σ(Σ∗)−1 . Let (Σt, St) satisfy ˙ Σt − 2(SΣt + ΣSt) = 0, 2ΣtSt(Σ∗)−1 + 2(Σ∗)−1StΣt + 4St = −(Σt(Σ∗)−1 + (Σ∗)−1Σt − 2I). with initial values Σt|t=0 = Σ0 and St|t=0 = 0. Thus, for any t ≥ 0, Σt is well-defined and stays positive definite. We denote ρt(x) = (2π)−n/2 det(Σt) exp − 1 2 xT Σ−1 t x , Φt(x) = xT Stx + C(t), where C(t) = −t + 1 2 t 0 log det(Σs(Σ∗)−1)ds. Then, ρt and Φt follow the Wasserstein Newton’s flow with initial values ρt|t=0 = ρ0 and Φt|t=0 = 0. 19
  20. Proposition (Newton’s Langevin dynamics in 1D Gaussian families) Assume that

    f(x) = (2Σ∗)−1(x − µ∗)2, where Σ∗ > 0 and µ∗ are given. Suppose that the particle system X0 follows the Gaussian distribution. Then Xt follows a Gaussian distribution with mean µt and variance Σt . The corresponding NLD satisfies dXt = Σ∗ − Σ Σ∗ + Σt Xt − 2Σ∗ Σ∗ + Σt µt + µ∗ dt. And the evolution of µt and Σt satisfies dµt = (−µt + µ∗)dt, dΣt = 2 Σ∗ − Σt Σ∗ + Σt Σt dt. The explicit solutions of µt and Σt satisfy µt = e−t(µ0 − µ∗) + µ∗, Σt = Σ∗ + (Σ0 − Σ∗)e−t e−2t(Σ0 − Σ∗)2 4Σ2 0 + 1 Σ0Σ∗ . 20
  21. Dynamics Particle Mean and variance NLD dXt = Σ∗−Σt Σ∗+Σt

    Xt − 2Σ∗ Σ∗+Σt µt + µ∗ dt µt = µ∗ + e−t(µ0 − µ∗) Σt−Σ∗ Σ0−Σ∗ = e−2t(Σ0−Σ∗) 2Σ0 +e−t e−2t(Σ0−Σ∗)2 4Σ2 0 + Σ∗ Σ0 OLD dXt = −(Σ∗)−1(Xt − µ∗)dt + √ 2dBt µt = µ∗ + e−(Σ∗)−1t(µ0 − µ∗) LLD dXt = −(Σ∗)−1(Xt − µ∗)dt + Σ−1 t (Xt − µt)dt Σt = Σ∗ + e−2(Σ∗)−1t(Σ0 − Σ∗) HAMCMC dXt = −(Xt − µ∗)dt + √ 2Σ∗dBt. µt = µ∗ + e−t(µ0 − µ∗) Σt = Σ∗ + e−2t(Σ0 − Σ∗) Table: Comparison among different Langevin dynamics on 1D Gaussian family. 21
  22. Information Newton’s method General update rule of Information Newton’s method

    ρk+1 = Expρk (αk Φk ), HE (ρk )Φk + G(ρk )−1 δE δρk = 0. Expρk (·) is the exponential map at ρk . 22
  23. Riemannian structure of probability space Define the distance D(ρ0 ,

    ρ1 ) D(ρ0, ρ1)2 = inf ˆ ρs,s∈[0,1] 1 0 ∂s ˆ ρsG(ˆ ρs)−1∂s ˆ ρsdxds : ˆ ρs|s=0 = ρ0, ˆ ρs|s=1 = ρ1 . Denote the inner product on cotangent space T∗ ρ P(Ω) by Φ1 , Φ2 ρ = Φ1 G(ρ)−1Φ2 dx, Φ1 , Φ2 ∈ T∗ ρ P(Ω), and Φ 2 ρ = Φ, Φ ρ . 23
  24. Parallelism and high-order derivative Definition (Parallelism) We say that τ

    : T∗ ρ0 P(Ω) → T∗ ρ1 P(Ω) is a parallelism from ρ0 to ρ1 , if for all Φ1 , Φ2 ∈ Tρ0 P(Ω), it follows Φ1 , Φ2 ρ0 = τΦ1 , τΦ2 ρ1 . ∇nE(ρ) is a n-form on the cotangent space T∗ ρ P(Ω) ∇nE(ρ)(Φ1, . . . , Φn) = ∂ ∂s ∇n−1E(Expρ (sΦn))(τsΦ1, . . . , τsΦn−1) s=0 , where τs is the parallelism from ρ to Expρ (sΦn ). 24
  25. Assumption for convergence analysis Assumption Assume that there exists ,

    δ1 , δ2 , δ3 > 0, such that for all ρ satisfying D(ρ, ρ∗) < , it follows ∇2E(ρ)(Φ1 , Φ1 ) ≥ δ1 Φ1 2 ρ , ∇2E(ρ)(Φ1 , Φ1 ) ≤ δ2 Φ1 2 ρ , |∇3E(ρ)(Φ1 , Φ1 , Φ2 )| ≤ δ3 Φ1 2 ρ Φ2 ρ , holds for all Φ1 , Φ2 ∈ T∗ ρ P(Ω). 25
  26. Convergence analysis Theorem Suppose that the assumption holds, ρk satisfies

    D(ρk , ρ∗) < and the step size τk = 1. Then, we have D(ρk+1 , ρ∗) = O(D(ρk , ρ∗)2). 26
  27. Sketch of proof Denote Tk = Exp−1 ρk (ρ∗). Proposition

    Suppose that the assumption holds. Let τ be the parallelism from ρk to ρk+1 . There exists a unique Rk ∈ T∗ ρk P(Ω) such that Tk = τ−1Tk+1 + Φk + Rk . Then, we have Tk+1 ρk+1 ≤ δ3 δ1 Tk 2 ρk + δ2 δ1 Rk ρk . 27
  28. Sketch of proof Lemma For all Ψ ∈ T ∗

    ρk P(Ω), it follows ΨG(ρk )−1Rk dx = O( Ψ ρk Tk 2 ρk ). Taking Ψ = Rk in Lemma yields Rk ρk = O( Tk 2 ρk ). Because the geodesic curve has constant speed, Tk 2 ρk = D(ρk , ρ∗)2. As a result, we have D(ρk+1 , ρ∗) ≤ δ2 δ1 D(ρk , ρ∗)2 + δ3 δ1 Rk ρk = O(D(ρk , ρ∗)2). 28
  29. Implementation in probability space The distribution {xi k }N i=1

    follows ρk (x). Update each particle by xi k+1 = xi k + αk ∇Φk (xi k ), i = 1, 2 . . . N. Φk is the solution to Wasserstein Newton’s direction equation. How to compute Φk based on {xi k }N i=1 ? 29
  30. Compute Wasserstein Newton’s direction Proposition Suppose that H : T∗

    ρ P(Ω) → Tρ P is a linear self-adjoint operator and H is positive definite. Let u ∈ Tρ P. Then the minimizer of variational problem min Φ∈T ∗ ρ P(Ω) J(Φ) = (ΦHΦ − 2uΦ) dx, satisfies HΦ = u, where Φ ∈ T∗ ρ P(Ω). 30
  31. Compute Wasserstein Newton’s direction For strongly convex f Equivalent to

    optimizing the variational problem: min Φ∈T ∗ ρk P(Ω) J(Φ) = ∇2Φ 2 F + ∇Φ 2 ∇2f + 2 ∇f + ∇ log ρk, ∇Φ ρkdx. Here we denote v 2 A = vT Av. For general f, suppose that ∇2f(x) + I is strictly convex for x ∈ Ω. Consider a regularized problem min Φ∈T ∗ ρk P(Ω) J(Φ) = ∇2Φ 2 F + ∇Φ 2 ∇2f+ I + 2 ∇f + ∇ log ρk, ∇Φ ρkdx. 31
  32. Affine Wasserstein Newton’s method Φ(x) takes the form Φ(x) =

    1 2 xSx + bT x, where S = diag(s) ∈ Rn×n is a diagonal matrix. The variational problem turns to be min S∈Sn,b∈Rn J(S, b) = tr(S2)+ 1 N N i=1 Sxi k + b 2 ∇2f(xi k )+ I + 2 Sxi k + b, vi k . 32
  33. Affine Wasserstein Newton’s method Rewrite the objective function to be

    J(s, b) = s 2 + 1 N N i=1 tr((diag(xi k )s + b)T (∇2f(xi k ) + I)(diag(xi k )s + b)) + 2 diag(xi k )s + b, vi k = s b T Hk s b + 2 s b T uk. where Hk = I + 1 N N i=1 diag(xi k)(∇2f(xi k) + I) diag(xi k) 1 N N i=1 diag(xi k)(∇2f(xi k) + I) 1 N N i=1(∇2f(xi k) + I) diag(xi k) 1 N N i=1(∇2f(xi k) + I) , uk = 1 N N i=1 diag(xi k)vi k 1 N N i=1 vi k . 33
  34. Algorithm 1 Affine Wasserstein Newton’s method Require: initial positions {xi

    0}N i=1, ≥ 0, step sizes αk, maximum iteration K. 1: Set k = 0. 2: while k < K and the convergence criterion is not met do 3: Compute vi k = ∇f(xi k) + ξk(xi k). Here ξk is an approximation of ∇ log ρk. 4: Calculate Hk by Hk = I + 1 N N i=1 diag(xi k)(∇2f(xi k) + I) diag(xi k) 1 N N i=1 diag(xi k)(∇2f(xi k) + I) 1 N N i=1(∇2f(xi k) + I) diag(xi k) 1 N N i=1(∇2f(xi k) + I) , and formulate uk by uk = 1 N N i=1 diag(xi k)vi k 1 N N i=1 vi k . 5: Compute sk and bk by sk bk = −(Hk) −1 uk. 6: Update particle positions by x i k+1 = x i k + αk(diag(sk)x i k + bk). 7: Set k = k + 1. 8: end while 34
  35. Modified affine Wasserstein Newton’s method What about symmetric matrix S?

    Too difficult! Consider the modified problem: min S∈Sn,b∈Rn ˆ J(S, b) = tr(S2)+ 1 N N i=1 Sxi k + b 2 Fk + 2 Sxi k + b, vi k . Here Fk = 1 N N i=1 ∇2f(xi k ) + I. The optimal solution follows bk = − 1 N N i=1 (Sk xi k + F−1 k vi k ), and Sk is the solution to the quadratic matrix optimization problem min S∈Sn 2 tr(S2) + tr(S ¯ Σk SFk ) + tr(S ¯ Fk S ¯ Σk ) − 2 tr( ¯ Tk S). Here ¯ Σk is an empirical covariance matrix of xi k and ¯ Tk is an empirical covariance matrix between xi k and vi k . 35
  36. Algorithm 2 Modified affine Wasserstein Newton’s method Require: initial positions

    {xi 1 }N i=1 , ≥ 0, step sizes αk, maximum iteration K. 1: Set k = 1 and S0 = I. 2: while k ≤ K and the convergence criterion is not met do 3: Calculate vi k = ∇f(xi k ) + ξk(xi k ). Here ξk is an approximation of ∇ log ρk. 4: Compute the averaged Hessian with regularization by Fk = I + 1 N N i=1 ∇2f(xi k ). 5: Compute the means ¯ xk = 1 N N i=1 xi k and ¯ vk = 1 N N i=1 vi k . 6: Let ˜ xi k = xi k − ¯ xk and ˜ vi k = vi k − ¯ vk. 7: Compute ¯ Σk = 1 N N i=1 ˜ xi k (˜ xi k )T and ¯ Tk = 1 2N N i=1 ˜ vi k (˜ xi k )T + ˜ xi k (˜ vi k )T . 8: Solve Sk from the problem via the CG method using the initial guess Sk−1. 9: Update particle positions by xi k+1 = xi k + αk(Sk ˜ xi k − F−1 k ¯ vk). 10: end while 36
  37. Hybrid method Overdamped Langevin dynamics as gradient direction: xi k+1

    = xi k + αk (Sk xi k + bk ) + 2λk αk Bk , where Bk ∼ N(0, I). Lagrangian Langevin dynamics as gradient direction: xi k+1 = xi k + αk (Sk xi k + bk − λk vi k ). Hybrid Langvien dynamics dXt = (∇Φ −Newton t − λt ∇f)dt + 2λt dBt , 37
  38. 1D Toy example 38

  39. 2D toy example 39

  40. Gaussian families Figure: Comparison among WNewton, WGF, AIG and HALLD

    in Gaussian families. The conditional number κ = 2 × 104. For Newton and HALLD 0.5, the markers are marked for every 5 iterations. 40
  41. Bayesian logistic regression Figure: Comparison of different methods on Bayesian

    logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Left: Test accuracy; Right: Test log-likelihood. 41
  42. Conclusion and discussion Design high-order optimization methods for Bayesian sampling,

    machine learning and inverse problems. High-order derivatives from information metrics are very useful! Sampling efficient Quasi-Newton’s method for information metrics Other efficient method to approximate the Wasserstein Newton’s direction Information Newton’s flow in probability models 42