Accelerated Information Gradient flow

7a507f364fce7547f94b9a5b4a072c87?s=47 Wuchen Li
September 22, 2019

Accelerated Information Gradient flow

We present a systematic framework for the Nesterov's accelerated gradient flows in the spaces of probabilities embedded with information metrics. Here two metrics are considered, including both the Fisher-Rao metric and the Wasserstein-2 metric. For the Wasserstein-2 metric case, we prove the convergence properties of the accelerated gradient flows, and introduce their formulations in Gaussian families. Furthermore, we propose a practical discrete-time algorithm in particle implementations with an adaptive restart technique. We formulate a novel bandwidth selection method, which learns the Wasserstein-2 gradient direction from Brownian-motion samples. Experimental results including Bayesian inference show the strength of the current method compared with the state-of-the-art.

7a507f364fce7547f94b9a5b4a072c87?s=128

Wuchen Li

September 22, 2019
Tweet

Transcript

  1. Accelerated Information Gradient flow Wuchen Li, UCLA This is a

    joint work with Yifei Wang.
  2. Distance and Optimization 2

  3. Problem background Recently, optimization problems on the space of probability

    and probability models attract increasing attentions from machine learning communities. Include: variational inference, Bayesian inference, Generative Adversary Networks (GAN), and policy optimizations, etc. Figure: Results from Wasserstein GAN. 3
  4. Optimization problem in learning In learning, many problems can be

    formulated as the optimization problem in the probability space, min ρ∈P(Ω) E(ρ). Here Ω is a region in Rn. E(ρ) is a divergence or metric loss functional between ρ and a target density ρ∗ ∈ P(Ω). – The KL divergence from ρ to ρ∗: E(ρ) = DKL (ρ ρ∗) = log ρ ρ∗ ρdx. 4
  5. Classical methods Gradient descent methods with sampling efficient properties play

    essential roles to solve these optimization problems. – The gradient descent direction often relies on the information metric over the probability space. – This direction naturally reflects the change of the loss function with respect to the metric. In literature, two important information metrics, such as the Fisher-Rao metric and the Wasserstein-2 (in short, Wasserstein) metric, are of great interests. – Fisher-Rao gradient: Adam and K-FAC. – Wasserstein gradient: Markov chain Monte Carlo (MCMC) methods, particle-based variational inference (ParVI) methods and GANs. 5
  6. Optimization problems in the Euclidean space The methods for optimization

    problems in the Euclidean space have been well-studied. min x∈Rn f(x). The classical gradient descent method follows xk+1 = xk − s∇f(xk ). It can be formulated into a continuous time gradient flow, ˙ x = −∇f(x). 6
  7. Accelerate Gradient flows in the Euclidean space The Nesterov’s accelerated

    method is widely applied in accelerating the vanilla gradient descent. Su et al.1 show that it corresponds to the so-called accelerated gradient flow ¨ x + αt ˙ x + ∇f(x) = 0. This is equivalent to a damped Hamiltonian system ˙ x ˙ p + 0 αt p − 0 I −I 0 ∇x HE(x, p) ∇p HE(x, p) = 0, HE(x, p) = 1 2 p 2 2 +f(x). with initial values x(0) = x0 and p(0) = 0. The choice of αt depends on the property of f(x). If f(x) is β-strongly convex, then αt = 2 √ β; if f(x) is convex, then αt = 3/t. 1Weijie Su et al. A differential equation for modeling Nesterov accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 2016. 7
  8. Accelerate Gradient flows in learning? Does there exist a counterpart

    of the accelerated gradient flow in the probability space under information metrics? Several previous works explore accelerated methods under the Wasserstein metric. – Liu et al. (ICML2019): acceleration framework of ParVI methods based on manifold optimization. – Taghvaei and Mehta (ICML2019): the accelerated flow from an optimal control perspective. – Cheng et al. and Ma et al.: underdamped Langevin dynamics. 8
  9. Contributions A unified framework of accelerated gradient flows in the

    probability space embedded with information metrics: Accelerated Information Gradient (AIG) flows. In theory, – Verify the existence of the solution to AIG flows in Gaussian families. – Rigorously prove the convergence rate of AIG flow and remove the unnecessary assumption in Taghvaei and Mehta (ICML2019). In numerics, – Propose a novel kernel selection method (BM method). – Introduce an adaptive restart technique to handle the stiffness of AIG flows. 9
  10. Metric Definition (Metric in the probability space) A metric tensor

    G(ρ) : Tρ P(Ω) → T∗ ρ P(Ω) is an invertible mapping from the tangent space at ρ to the cotangent space at ρ. This metric tensor defines the metric (inner product) on the tangent space Tρ P(Ω). Namely, for σ1 , σ2 ∈ Tρ P(Ω), we define gρ (σ1 , σ2 ) = σ1 G(ρ)σ2 dx = Φ1 G(ρ)−1Φ2 dx, where Φi is the solution to σi = G(ρ)−1Φi , i = 1, 2. 10
  11. Example: Fisher-Rao metric The Fisher-Rao metric comes from Information Geometry.

    The inverse of the Fisher-Rao metric tensor is defined by GF (ρ)−1Φ = ρ Φ − Φρdx , Φ ∈ T∗ ρ P(Ω). For σ1 , σ2 ∈ Tρ P(Ω), the Fisher-Rao metric on the tangent space follows gF ρ (σ1 , σ2 ) = Φ1 Φ2 ρdx − Φ1 ρdx Φ2 ρdx , where Φi is the solution to σi = ρ Φi − Φi ρdx , i = 1, 2. 11
  12. Example: Wasserstein metric The Wasserstein metric comes from Optimal Transport1.

    The inverse of the Wasserstein metric tensor is defined by GW (ρ)−1Φ = −∇ · (ρ∇Φ), Φ ∈ T∗ ρ P(Ω). The Wasserstein metric on the tangent space is given by gW ρ (σ1 , σ2 ) = ρ ∇Φ1 , ∇Φ2 dx, σ1 , σ2 ∈ Tρ P(Ω), where Φi is the solution to σi = −∇ · (ρ∇Φi ), i = 1, 2. 1C´ edric Villani. Topics in optimal transportation. American Mathematical Soc., 2003. 12
  13. Gradient flows The gradient flow for E(ρ) in (P(Ω), gρ

    ) takes the form ∂t ρt = −G(ρt )−1 δE(ρt ) δρt . The Fisher-Rao gradient flow is given by ∂t ρt = −ρt δE δρt − δE δρt ρt dx . The Wasserstein gradient flow writes ∂t ρt = ∇ · ρt ∇ δE δρt . 13
  14. Hamiltonian flows The Euler-Lagrange equation in the probability space can

    be formulated as a system of (ρt , Φt ), i.e.,      ∂t ρt − G(ρt )−1Φt = 0, ∂t Φt + 1 2 δ δρt Φt G(ρt )−1Φt dx + δE δρt = 0, where Φt is up to a spatially-constant function shrift. Here the above PDE system is the Hamiltonian flow ∂t ρt Φt − 0 1 −1 0 δ δρt H(ρt , Φt ) δ δΦt H(ρt , Φt ) = 0, with respect to the Hamiltonian H(ρt , Φt ) = 1 2 Φt G(ρt )−1Φt dx + E(ρt ). 14
  15. Fisher-Rao Hamiltonian flow The Fisher-Rao Hamiltonian flow follows  

         ∂t ρt − Φt ρt + Φt ρt dx = 0, ∂t Φt + 1 2 Φ2 t − ρt Φt dx Φt + δE δρt = 0. The corresponding Hamiltonian is HF (ρt , Φt ) = 1 2 Φ2 t ρt dx − ρt Φt dx 2 + E(ρt ). 15
  16. Wasserstein Hamiltonian flow The Wasserstein Hamiltonian flow writes  

     ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ∂t Φt + 1 2 ∇Φt 2 + δE δρt = 0. The corresponding Hamiltonian is HW (ρt , Φt ) = 1 2 ∇Φt 2ρt dx + E(ρt ). This is identical to the Wasserstein Hamiltonian flow introduced by Chow et al. 1. 1Chow et.al. Wasserstein Hamiltonian flows. Journal of differential equations, 2019. 16
  17. Accelerate Information Gradient flow Let αt ≥ 0 be a

    scalar function of t. We add a damping term αt Φt to the Hamiltonian flow. ∂t ρt Φt + 0 αt Φt − 0 1 −1 0 δ δρt H(ρt , Φt ) δ δΦt H(ρt , Φt ) = 0. This renders the Accelerated Information Gradient (AIG) flow      ∂t ρt − G(ρt )−1Φt = 0, ∂t Φt + αt Φt + 1 2 δ δρt Φt G(ρt )−1Φt dx + δE δρt = 0, (AIG) with initial values ρ0 = ρ0 and Φ0 = 0. 17
  18. The choice of αt The choice of αt depends on

    the geodesic convexity of E(ρ), which has an equivalent definition as follows. Definition For a functional E(ρ) defined on the probability space, we say that E(ρ) has a β-positive Hessian (in short, Hess(β)) w.r.t. the metric gρ if there exists a constant β ≥ 0 such that for any ρ ∈ P(Ω) and any σ ∈ Tρ P(Ω), we have gρ (HessE(ρ)σ, σ) ≥ βgρ (σ, σ). Here Hess is the Hessian operator w.r.t. gρ . If E(ρ) is Hess(β) for β > 0, then αt = 2 √ β. If E(ρ) is Hess(0), then αt = 3/t. 18
  19. Fisher-Rao AIG flow The Fisher-Rao AIG flow writes  

         ∂t ρt − Φt ρt + Φt ρt dx ρt = 0, ∂t Φt + αt Φt + 1 2 Φ2 t − ρt Φt dx Φt + δE δρt = 0. (F-AIG) 19
  20. Wasserstein AIG flow The Wasserstein AIG flow in the Eulerian

    formulation follows    ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ∂t Φt + αt Φt + 1 2 ∇Φt 2 + δE δρt = 0. (W-AIG) The Wasserstein AIG flow in the Lagrangian formulation writes      dXt = Vt dt, dVt = −αt Vt dt − ∇ δE δρt (Xt )dt. This is a particular type of initial-valued mean-field game systems. 20
  21. Examples in Gaussian families Theorem Suppose that ρ0, ρ∗ ∈

    Nn 0 and their covariance matrices are Σ0 and Σ∗. E(Σ) evaluates the KL divergence from ρ to ρ∗. The solution to (W-AIG) with initial values ρ0 = ρ0 takes the form ρt (x) = (2π)−n/2 det(Σt ) exp − 1 2 xT Σ−1 t x , Φt (x) = 1 2 xT St x + C(t), where C(t) = −t + 1 2 t 0 log det(Σs (Σ∗)−1)ds. Here (Σt , St ) satisfies ˙ Σt − (St Σt + Σt St ) = 0, ˙ St + αt St + S2 t + 2∇Σt E(Σt ) = 0, (W-AIG-G) with initial values Σ0 = Σ0 and S0 = 0. Especially, for any t ≥ 0, Σt is well-defined and stays positive definite. 21
  22. Convergence rate of W-AIG flows Theorem Suppose that E(ρ) satisfies

    Hess(β) for β > 0. The solution ρt to (W-AIG) with αt = 2 √ β satisfies E(ρt ) ≤ C0 e− √ βt = O(e− √ βt). If E(ρ) only satisfies Hess(0), then the solution ρt to (W-AIG) with αt = 3/t satisfies E(ρt ) ≤ C0 t−2 = O(t−2). Here the constants C0 , C0 only depend on ρ0 . 22
  23. Remarks Hess(β) is equivalent to the β-geodesic convexity in the

    probability space w.r.t. gρ . For the Wasserstein metric, it is also known as β-displacement convexity. Consider the case where E(ρ) is the KL divergence and the target density takes the form ρ∗ ∝ exp(−f(x)). – A sufficient condition for Hess(β) is that f(x) is β-strongly convex. – The W-AIG flow improves the O(e−2βt) convergence rate of the gradient flow to O(e− √ βt), especially when β is close to 0. 23
  24. Sketch of the proof Given ρt , we can find

    the optimal transport plan Tt from ρt to ρ∗. Proposition Denote the geodesic curve γ(s) that connects ρt and ρ∗ by γ(s) = (sTt + (1 − s)Id)#ρt , s ∈ [0, 1]. Here Id is the identity mapping from Rn to itself. Then, ˙ γ(0) corresponds to a tangent vector −∇ · (ρt (x)(Tt (x) − x)) ∈ Tρt P(Ω). 24
  25. Sketch of the proof We first consider the case where

    E(ρ) satisfies Hess(β) for β > 0. Construct the following Lyapunov function. E(t) = e √ βt 2 − β(Tt (x) − x) + ∇Φt (x) 2 ρt (x)dx + e √ βt(E(ρt ) − E(ρ∗)). Proposition Suppose that E(ρ) satisfies Hess(β) for β > 0. ρt is the solution to (W-AIG) with αt = 2 √ β. Then, E(t) defined above satisfies ˙ E(t) ≤ 0. As a result, E(ρt ) ≤ e− √ βtE(t) ≤ e− √ βtE(0) = O(e− √ βt). 25
  26. Sketch of the proof We now consider the case where

    E(ρ) satisfies Hess(0). Construct the following Lyapunov function. E(t) = 1 2 −(Tt (x) − x) + t 2 ∇Φt (x) 2 ρt (x)dx + t2 4 (E(ρt ) − E(ρ∗)). Proposition Suppose that E(ρ) satisfies Hess(0). ρt is the solution to (W-AIG) with αt = 3/t. Then, E(t) defined above satisfies ˙ E(t) ≤ 0. Hence, E(ρt ) ≤ 4 t2 E(t) ≤ 4 t2 E(0) = O(t−2). 26
  27. Sketch of the proof In the proof of previous two

    propositions, we show the following relationship ∇Φt , ∂t Tt + ∇Tt ∇Φt ρt dx ≥ 0. Only for 1-dimensional case, the equality holds. This is based on the Hodge decomposition. Taghvaei and Mehta falsely assume that this quantity is 0. 27
  28. W-AIG flows in the particle level Proposition Suppose that Xt

    ∼ ρt and Vt = ∇Φt (Xt ) are the position and the velocity of a particle at time t. Then, the differential equation of the particle system corresponding to (W-AIG) writes dXt = Vt dt, dVt = −αt Vt dt − ∇f(Xt )dt − ∇ log ρt (Xt )dt. 28
  29. Discrete-time algorithm for W-AIG flow For i = 1, 2

    . . . N, the update rule follows V i k+1 = αk V i k − √ τ(∇f(Xi k ) + ξk (Xi k )), Xi k+1 = Xi k + √ τV i k+1 . – If E(ρ) is Hess(β), then αk = 1− √ βτ 1+ √ βτ ; if E(ρ) is Hess(0) or β is unknown, then αk = k−1 k+2 . – Here ξk (x) is an approximation of ∇ log ρt (x). Two difficulties. – The approximation of ∇ log ρt (x). – The stiffness of AIG flows. 29
  30. Reviews on the choices of ξk If Xi k follows

    a Gaussian distribution, then ξk (x) = −Σ−1 k (x − mk ), where mk and Σk are the mean and the covariance of {Xi k }N i=1 . For the non-Gaussian case, we use the kernel density estimation (KDE1). ξk writes ξk (x) = ∇ log ˜ ρk (x) = N i=1 ∇x K(x, Xi k ) N i=1 K(x, Xi k ) . A common choice of K(x, y) is a Gaussian kernel with the bandwidth h, KG(x, y; h) = (2πh)−n/2 exp x − y 2/(2h) . 1Radhey S Singh. Improvement on some known nonparametric uniformly consistent estimators of derivatives of a density. The Annals of Statistics, pp. 394399, 1977. 30
  31. Selection of the bandwidth SVGD1 uses a median (MED) method

    to choose the bandwidth, i.e., hk = 1 2 log(N + 1) median { Xi k − Xj k 2}N i,j=1 . Liu et al.2 propose a Heat Equation (HE) method to adaptively adjust bandwidth. Motivated by the HE method, we introduce the Brownian motion (BM) method to adaptively learn the bandwidth. 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 23782386, 2016. 2Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and Lawrence Carin. Accelerated first-order methods on the Wasserstein space for bayesian inference. arXiv preprint arXiv:1807.01750, 2018. 31
  32. The BM method Given the bandwidth h, {Xi k }N

    i=1 and a step size s, we can compute two particle systems Y i k (h) = Xi k − sξk (x; h), Zi k = Xi k + √ 2sBi, i = 1, . . . N, where Bi is the standard Brownian motion. Minimize MMD(ˆ ρY , ˆ ρZ ), the maximum mean discrepancy between the empirical distribution ˆ ρY (x) = N i=1 δ(x − Y i k (h)) and ˆ ρZ (x) = N i=1 δ(x − Zi k ). Optimize over h, using the bandwidth hk−1 from the last iteration as the initialization. 32
  33. Adaptive restart The second equation in (W-AIG) is the Hamilton-Jacobi

    equation, which usually has the strong stiffness. We propose an adaptive restart technique to deal with this problem. Consider ϕk = − N i=1 V i k+1 , ∇f(Xi k ) + ξk (Xi k ) , which can be viewed as discrete-time approximation of −∂t E(ρt ). If ϕk < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. This essentially keeps ∂t E(ρt ) negative along the trajectory. The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 33
  34. Overall algorithm Algorithm 1 Particle implementation of Wasserstein AIG flow

    Require: initial positions {Xi 0 }N i=1 , step size τ, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl−1 , {Xi k }N i=1 , √ τ). 4: Calculate ξk (Xi k ) with the bandwidth hl . 5: Set αk based on whether E(ρ) is Hess(β) or Hess(0). Update V i k+1 = αk V i k − √ τ(∇f(Xi k ) + ξk (Xi k )), Xi k+1 = Xi k + √ τV i k+1 . 6: if RESTART then 7: Compute ϕk = − N i=1 V i k+1 , ∇f(Xi k ) + ξk (Xi k ) . 8: If ϕk < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 34
  35. Toy example The target density ρ∗ is a toy bimodal

    distribution1. We compare two types of particle implementations of the Wasserstein gradient flow over the KL divergence: Xi k+1 = Xi k − τ∇f(Xi k )+ √ 2τBi k . Xi k+1 = Xi k − τ∇f(Xi k )−τξk (Xi k ). Here Bi k ∼ N(0, 1) is the standard Brownian motion and ξk is computed by KDE. The first method is known as the MCMC method. The second method is called the ParVI method, whose bandwidth h is selected by MED/HE/BM respectively. 1Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015. 35
  36. Toy example 36

  37. Gaussian examples: ODE level The initial value is set to

    be Σ0 = I. The following figure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG flow and the effect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1, κ ≈ 3.8 × 103. Right: β = 1, κ ≈ 4.0 × 103. 37
  38. Gaussian examples: Particle level The initial distribution of samples follows

    N(0, I) and the number of samples is N = 600. For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ Σk ) using the empirical covariance matrix ˆ Σk . Figure: The acceleration of W-AIG flow and the effect of adaptive restart (particle level). The setting of the target density is identical to the previous figure. 38
  39. Bayesian logistic regression We perform the standard Bayesian logistic regression

    experiment on the Covertype dataset, following the same settings as Liu et al.1. We compare our methods with MCMC, SVGD, WNAG2 and WNes3. We select the bandwidth using either the MED method or the proposed BM method. The BM method is not applied to SVGD. 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 23782386, 2016. 2Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and Lawrence Carin. Accelerated first-order methods on the wasserstein space for bayesian inference. arXiv preprint arXiv:1807.01750, 2018. 3Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning, pp. 40824092, 2019. 39
  40. Bayesian logistic regression Figure: Comparison of different methods on Bayesian

    logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 40
  41. Future works Apply AIG flows for Hessian transport metrics. Investigate

    the application of our results in matrix optimization. Systematically explain the stiffness of AIG flows and the effect of the adaptive restart Apply our results to general information metrics, especially for the Fisher-Rao metric. Study the related sampling efficient optimization methods and discrete-time algorithms on probability models. 41