Slide 1

Slide 1 text

Accelerated Information Gradient flow Wuchen Li, UCLA This is a joint work with Yifei Wang.

Slide 2

Slide 2 text

Distance and Optimization 2

Slide 3

Slide 3 text

Problem background Recently, optimization problems on the space of probability and probability models attract increasing attentions from machine learning communities. Include: variational inference, Bayesian inference, Generative Adversary Networks (GAN), and policy optimizations, etc. Figure: Results from Wasserstein GAN. 3

Slide 4

Slide 4 text

Optimization problem in learning In learning, many problems can be formulated as the optimization problem in the probability space, min ρ∈P(Ω) E(ρ). Here Ω is a region in Rn. E(ρ) is a divergence or metric loss functional between ρ and a target density ρ∗ ∈ P(Ω). – The KL divergence from ρ to ρ∗: E(ρ) = DKL (ρ ρ∗) = log ρ ρ∗ ρdx. 4

Slide 5

Slide 5 text

Classical methods Gradient descent methods with sampling efficient properties play essential roles to solve these optimization problems. – The gradient descent direction often relies on the information metric over the probability space. – This direction naturally reflects the change of the loss function with respect to the metric. In literature, two important information metrics, such as the Fisher-Rao metric and the Wasserstein-2 (in short, Wasserstein) metric, are of great interests. – Fisher-Rao gradient: Adam and K-FAC. – Wasserstein gradient: Markov chain Monte Carlo (MCMC) methods, particle-based variational inference (ParVI) methods and GANs. 5

Slide 6

Slide 6 text

Optimization problems in the Euclidean space The methods for optimization problems in the Euclidean space have been well-studied. min x∈Rn f(x). The classical gradient descent method follows xk+1 = xk − s∇f(xk ). It can be formulated into a continuous time gradient flow, ˙ x = −∇f(x). 6

Slide 7

Slide 7 text

Accelerate Gradient flows in the Euclidean space The Nesterov’s accelerated method is widely applied in accelerating the vanilla gradient descent. Su et al.1 show that it corresponds to the so-called accelerated gradient flow ¨ x + αt ˙ x + ∇f(x) = 0. This is equivalent to a damped Hamiltonian system ˙ x ˙ p + 0 αt p − 0 I −I 0 ∇x HE(x, p) ∇p HE(x, p) = 0, HE(x, p) = 1 2 p 2 2 +f(x). with initial values x(0) = x0 and p(0) = 0. The choice of αt depends on the property of f(x). If f(x) is β-strongly convex, then αt = 2 √ β; if f(x) is convex, then αt = 3/t. 1Weijie Su et al. A differential equation for modeling Nesterov accelerated gradient method: Theory and insights. Journal of Machine Learning Research, 2016. 7

Slide 8

Slide 8 text

Accelerate Gradient flows in learning? Does there exist a counterpart of the accelerated gradient flow in the probability space under information metrics? Several previous works explore accelerated methods under the Wasserstein metric. – Liu et al. (ICML2019): acceleration framework of ParVI methods based on manifold optimization. – Taghvaei and Mehta (ICML2019): the accelerated flow from an optimal control perspective. – Cheng et al. and Ma et al.: underdamped Langevin dynamics. 8

Slide 9

Slide 9 text

Contributions A unified framework of accelerated gradient flows in the probability space embedded with information metrics: Accelerated Information Gradient (AIG) flows. In theory, – Verify the existence of the solution to AIG flows in Gaussian families. – Rigorously prove the convergence rate of AIG flow and remove the unnecessary assumption in Taghvaei and Mehta (ICML2019). In numerics, – Propose a novel kernel selection method (BM method). – Introduce an adaptive restart technique to handle the stiffness of AIG flows. 9

Slide 10

Slide 10 text

Metric Definition (Metric in the probability space) A metric tensor G(ρ) : Tρ P(Ω) → T∗ ρ P(Ω) is an invertible mapping from the tangent space at ρ to the cotangent space at ρ. This metric tensor defines the metric (inner product) on the tangent space Tρ P(Ω). Namely, for σ1 , σ2 ∈ Tρ P(Ω), we define gρ (σ1 , σ2 ) = σ1 G(ρ)σ2 dx = Φ1 G(ρ)−1Φ2 dx, where Φi is the solution to σi = G(ρ)−1Φi , i = 1, 2. 10

Slide 11

Slide 11 text

Example: Fisher-Rao metric The Fisher-Rao metric comes from Information Geometry. The inverse of the Fisher-Rao metric tensor is defined by GF (ρ)−1Φ = ρ Φ − Φρdx , Φ ∈ T∗ ρ P(Ω). For σ1 , σ2 ∈ Tρ P(Ω), the Fisher-Rao metric on the tangent space follows gF ρ (σ1 , σ2 ) = Φ1 Φ2 ρdx − Φ1 ρdx Φ2 ρdx , where Φi is the solution to σi = ρ Φi − Φi ρdx , i = 1, 2. 11

Slide 12

Slide 12 text

Example: Wasserstein metric The Wasserstein metric comes from Optimal Transport1. The inverse of the Wasserstein metric tensor is defined by GW (ρ)−1Φ = −∇ · (ρ∇Φ), Φ ∈ T∗ ρ P(Ω). The Wasserstein metric on the tangent space is given by gW ρ (σ1 , σ2 ) = ρ ∇Φ1 , ∇Φ2 dx, σ1 , σ2 ∈ Tρ P(Ω), where Φi is the solution to σi = −∇ · (ρ∇Φi ), i = 1, 2. 1C´ edric Villani. Topics in optimal transportation. American Mathematical Soc., 2003. 12

Slide 13

Slide 13 text

Gradient flows The gradient flow for E(ρ) in (P(Ω), gρ ) takes the form ∂t ρt = −G(ρt )−1 δE(ρt ) δρt . The Fisher-Rao gradient flow is given by ∂t ρt = −ρt δE δρt − δE δρt ρt dx . The Wasserstein gradient flow writes ∂t ρt = ∇ · ρt ∇ δE δρt . 13

Slide 14

Slide 14 text

Hamiltonian flows The Euler-Lagrange equation in the probability space can be formulated as a system of (ρt , Φt ), i.e.,      ∂t ρt − G(ρt )−1Φt = 0, ∂t Φt + 1 2 δ δρt Φt G(ρt )−1Φt dx + δE δρt = 0, where Φt is up to a spatially-constant function shrift. Here the above PDE system is the Hamiltonian flow ∂t ρt Φt − 0 1 −1 0 δ δρt H(ρt , Φt ) δ δΦt H(ρt , Φt ) = 0, with respect to the Hamiltonian H(ρt , Φt ) = 1 2 Φt G(ρt )−1Φt dx + E(ρt ). 14

Slide 15

Slide 15 text

Fisher-Rao Hamiltonian flow The Fisher-Rao Hamiltonian flow follows        ∂t ρt − Φt ρt + Φt ρt dx = 0, ∂t Φt + 1 2 Φ2 t − ρt Φt dx Φt + δE δρt = 0. The corresponding Hamiltonian is HF (ρt , Φt ) = 1 2 Φ2 t ρt dx − ρt Φt dx 2 + E(ρt ). 15

Slide 16

Slide 16 text

Wasserstein Hamiltonian flow The Wasserstein Hamiltonian flow writes    ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ∂t Φt + 1 2 ∇Φt 2 + δE δρt = 0. The corresponding Hamiltonian is HW (ρt , Φt ) = 1 2 ∇Φt 2ρt dx + E(ρt ). This is identical to the Wasserstein Hamiltonian flow introduced by Chow et al. 1. 1Chow et.al. Wasserstein Hamiltonian flows. Journal of differential equations, 2019. 16

Slide 17

Slide 17 text

Accelerate Information Gradient flow Let αt ≥ 0 be a scalar function of t. We add a damping term αt Φt to the Hamiltonian flow. ∂t ρt Φt + 0 αt Φt − 0 1 −1 0 δ δρt H(ρt , Φt ) δ δΦt H(ρt , Φt ) = 0. This renders the Accelerated Information Gradient (AIG) flow      ∂t ρt − G(ρt )−1Φt = 0, ∂t Φt + αt Φt + 1 2 δ δρt Φt G(ρt )−1Φt dx + δE δρt = 0, (AIG) with initial values ρ0 = ρ0 and Φ0 = 0. 17

Slide 18

Slide 18 text

The choice of αt The choice of αt depends on the geodesic convexity of E(ρ), which has an equivalent definition as follows. Definition For a functional E(ρ) defined on the probability space, we say that E(ρ) has a β-positive Hessian (in short, Hess(β)) w.r.t. the metric gρ if there exists a constant β ≥ 0 such that for any ρ ∈ P(Ω) and any σ ∈ Tρ P(Ω), we have gρ (HessE(ρ)σ, σ) ≥ βgρ (σ, σ). Here Hess is the Hessian operator w.r.t. gρ . If E(ρ) is Hess(β) for β > 0, then αt = 2 √ β. If E(ρ) is Hess(0), then αt = 3/t. 18

Slide 19

Slide 19 text

Fisher-Rao AIG flow The Fisher-Rao AIG flow writes        ∂t ρt − Φt ρt + Φt ρt dx ρt = 0, ∂t Φt + αt Φt + 1 2 Φ2 t − ρt Φt dx Φt + δE δρt = 0. (F-AIG) 19

Slide 20

Slide 20 text

Wasserstein AIG flow The Wasserstein AIG flow in the Eulerian formulation follows    ∂t ρt + ∇ · (ρt ∇Φt ) = 0, ∂t Φt + αt Φt + 1 2 ∇Φt 2 + δE δρt = 0. (W-AIG) The Wasserstein AIG flow in the Lagrangian formulation writes      dXt = Vt dt, dVt = −αt Vt dt − ∇ δE δρt (Xt )dt. This is a particular type of initial-valued mean-field game systems. 20

Slide 21

Slide 21 text

Examples in Gaussian families Theorem Suppose that ρ0, ρ∗ ∈ Nn 0 and their covariance matrices are Σ0 and Σ∗. E(Σ) evaluates the KL divergence from ρ to ρ∗. The solution to (W-AIG) with initial values ρ0 = ρ0 takes the form ρt (x) = (2π)−n/2 det(Σt ) exp − 1 2 xT Σ−1 t x , Φt (x) = 1 2 xT St x + C(t), where C(t) = −t + 1 2 t 0 log det(Σs (Σ∗)−1)ds. Here (Σt , St ) satisfies ˙ Σt − (St Σt + Σt St ) = 0, ˙ St + αt St + S2 t + 2∇Σt E(Σt ) = 0, (W-AIG-G) with initial values Σ0 = Σ0 and S0 = 0. Especially, for any t ≥ 0, Σt is well-defined and stays positive definite. 21

Slide 22

Slide 22 text

Convergence rate of W-AIG flows Theorem Suppose that E(ρ) satisfies Hess(β) for β > 0. The solution ρt to (W-AIG) with αt = 2 √ β satisfies E(ρt ) ≤ C0 e− √ βt = O(e− √ βt). If E(ρ) only satisfies Hess(0), then the solution ρt to (W-AIG) with αt = 3/t satisfies E(ρt ) ≤ C0 t−2 = O(t−2). Here the constants C0 , C0 only depend on ρ0 . 22

Slide 23

Slide 23 text

Remarks Hess(β) is equivalent to the β-geodesic convexity in the probability space w.r.t. gρ . For the Wasserstein metric, it is also known as β-displacement convexity. Consider the case where E(ρ) is the KL divergence and the target density takes the form ρ∗ ∝ exp(−f(x)). – A sufficient condition for Hess(β) is that f(x) is β-strongly convex. – The W-AIG flow improves the O(e−2βt) convergence rate of the gradient flow to O(e− √ βt), especially when β is close to 0. 23

Slide 24

Slide 24 text

Sketch of the proof Given ρt , we can find the optimal transport plan Tt from ρt to ρ∗. Proposition Denote the geodesic curve γ(s) that connects ρt and ρ∗ by γ(s) = (sTt + (1 − s)Id)#ρt , s ∈ [0, 1]. Here Id is the identity mapping from Rn to itself. Then, ˙ γ(0) corresponds to a tangent vector −∇ · (ρt (x)(Tt (x) − x)) ∈ Tρt P(Ω). 24

Slide 25

Slide 25 text

Sketch of the proof We first consider the case where E(ρ) satisfies Hess(β) for β > 0. Construct the following Lyapunov function. E(t) = e √ βt 2 − β(Tt (x) − x) + ∇Φt (x) 2 ρt (x)dx + e √ βt(E(ρt ) − E(ρ∗)). Proposition Suppose that E(ρ) satisfies Hess(β) for β > 0. ρt is the solution to (W-AIG) with αt = 2 √ β. Then, E(t) defined above satisfies ˙ E(t) ≤ 0. As a result, E(ρt ) ≤ e− √ βtE(t) ≤ e− √ βtE(0) = O(e− √ βt). 25

Slide 26

Slide 26 text

Sketch of the proof We now consider the case where E(ρ) satisfies Hess(0). Construct the following Lyapunov function. E(t) = 1 2 −(Tt (x) − x) + t 2 ∇Φt (x) 2 ρt (x)dx + t2 4 (E(ρt ) − E(ρ∗)). Proposition Suppose that E(ρ) satisfies Hess(0). ρt is the solution to (W-AIG) with αt = 3/t. Then, E(t) defined above satisfies ˙ E(t) ≤ 0. Hence, E(ρt ) ≤ 4 t2 E(t) ≤ 4 t2 E(0) = O(t−2). 26

Slide 27

Slide 27 text

Sketch of the proof In the proof of previous two propositions, we show the following relationship ∇Φt , ∂t Tt + ∇Tt ∇Φt ρt dx ≥ 0. Only for 1-dimensional case, the equality holds. This is based on the Hodge decomposition. Taghvaei and Mehta falsely assume that this quantity is 0. 27

Slide 28

Slide 28 text

W-AIG flows in the particle level Proposition Suppose that Xt ∼ ρt and Vt = ∇Φt (Xt ) are the position and the velocity of a particle at time t. Then, the differential equation of the particle system corresponding to (W-AIG) writes dXt = Vt dt, dVt = −αt Vt dt − ∇f(Xt )dt − ∇ log ρt (Xt )dt. 28

Slide 29

Slide 29 text

Discrete-time algorithm for W-AIG flow For i = 1, 2 . . . N, the update rule follows V i k+1 = αk V i k − √ τ(∇f(Xi k ) + ξk (Xi k )), Xi k+1 = Xi k + √ τV i k+1 . – If E(ρ) is Hess(β), then αk = 1− √ βτ 1+ √ βτ ; if E(ρ) is Hess(0) or β is unknown, then αk = k−1 k+2 . – Here ξk (x) is an approximation of ∇ log ρt (x). Two difficulties. – The approximation of ∇ log ρt (x). – The stiffness of AIG flows. 29

Slide 30

Slide 30 text

Reviews on the choices of ξk If Xi k follows a Gaussian distribution, then ξk (x) = −Σ−1 k (x − mk ), where mk and Σk are the mean and the covariance of {Xi k }N i=1 . For the non-Gaussian case, we use the kernel density estimation (KDE1). ξk writes ξk (x) = ∇ log ˜ ρk (x) = N i=1 ∇x K(x, Xi k ) N i=1 K(x, Xi k ) . A common choice of K(x, y) is a Gaussian kernel with the bandwidth h, KG(x, y; h) = (2πh)−n/2 exp x − y 2/(2h) . 1Radhey S Singh. Improvement on some known nonparametric uniformly consistent estimators of derivatives of a density. The Annals of Statistics, pp. 394399, 1977. 30

Slide 31

Slide 31 text

Selection of the bandwidth SVGD1 uses a median (MED) method to choose the bandwidth, i.e., hk = 1 2 log(N + 1) median { Xi k − Xj k 2}N i,j=1 . Liu et al.2 propose a Heat Equation (HE) method to adaptively adjust bandwidth. Motivated by the HE method, we introduce the Brownian motion (BM) method to adaptively learn the bandwidth. 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 23782386, 2016. 2Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and Lawrence Carin. Accelerated first-order methods on the Wasserstein space for bayesian inference. arXiv preprint arXiv:1807.01750, 2018. 31

Slide 32

Slide 32 text

The BM method Given the bandwidth h, {Xi k }N i=1 and a step size s, we can compute two particle systems Y i k (h) = Xi k − sξk (x; h), Zi k = Xi k + √ 2sBi, i = 1, . . . N, where Bi is the standard Brownian motion. Minimize MMD(ˆ ρY , ˆ ρZ ), the maximum mean discrepancy between the empirical distribution ˆ ρY (x) = N i=1 δ(x − Y i k (h)) and ˆ ρZ (x) = N i=1 δ(x − Zi k ). Optimize over h, using the bandwidth hk−1 from the last iteration as the initialization. 32

Slide 33

Slide 33 text

Adaptive restart The second equation in (W-AIG) is the Hamilton-Jacobi equation, which usually has the strong stiffness. We propose an adaptive restart technique to deal with this problem. Consider ϕk = − N i=1 V i k+1 , ∇f(Xi k ) + ξk (Xi k ) , which can be viewed as discrete-time approximation of −∂t E(ρt ). If ϕk < 0, then we restart the algorithm with initial values Xi 0 = Xi k and V i 0 = 0. This essentially keeps ∂t E(ρt ) negative along the trajectory. The numerical results show that the adaptive restart accelerates and stabilizes the discrete-time algorithm. 33

Slide 34

Slide 34 text

Overall algorithm Algorithm 1 Particle implementation of Wasserstein AIG flow Require: initial positions {Xi 0 }N i=1 , step size τ, number of iteration L. 1: Set k = 0, V i 0 = 0, i = 1, . . . N. Set the bandwidth h0 by MED. 2: for l = 1, 2, . . . L do 3: Compute hl based on BM: hl = BM(hl−1 , {Xi k }N i=1 , √ τ). 4: Calculate ξk (Xi k ) with the bandwidth hl . 5: Set αk based on whether E(ρ) is Hess(β) or Hess(0). Update V i k+1 = αk V i k − √ τ(∇f(Xi k ) + ξk (Xi k )), Xi k+1 = Xi k + √ τV i k+1 . 6: if RESTART then 7: Compute ϕk = − N i=1 V i k+1 , ∇f(Xi k ) + ξk (Xi k ) . 8: If ϕk < 0, set Xi 0 = Xi k and V i 0 = 0 and k = 0; otherwise set k = k+1. 9: else 10: Set k = k + 1. 11: end if 12: end for 34

Slide 35

Slide 35 text

Toy example The target density ρ∗ is a toy bimodal distribution1. We compare two types of particle implementations of the Wasserstein gradient flow over the KL divergence: Xi k+1 = Xi k − τ∇f(Xi k )+ √ 2τBi k . Xi k+1 = Xi k − τ∇f(Xi k )−τξk (Xi k ). Here Bi k ∼ N(0, 1) is the standard Brownian motion and ξk is computed by KDE. The first method is known as the MCMC method. The second method is called the ParVI method, whose bandwidth h is selected by MED/HE/BM respectively. 1Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015. 35

Slide 36

Slide 36 text

Toy example 36

Slide 37

Slide 37 text

Gaussian examples: ODE level The initial value is set to be Σ0 = I. The following figure presents the convergence of the KL divergence on two target distributions with small/large L. Figure: The acceleration of W-AIG flow and the effect of adaptive restart (ODE level). The target density is a Gaussian distribution with zero mean on R100. Left: L = 1, κ ≈ 3.8 × 103. Right: β = 1, κ ≈ 4.0 × 103. 37

Slide 38

Slide 38 text

Gaussian examples: Particle level The initial distribution of samples follows N(0, I) and the number of samples is N = 600. For a particle system {Xi k }N i=1 , we record the KL divergence E(ˆ Σk ) using the empirical covariance matrix ˆ Σk . Figure: The acceleration of W-AIG flow and the effect of adaptive restart (particle level). The setting of the target density is identical to the previous figure. 38

Slide 39

Slide 39 text

Bayesian logistic regression We perform the standard Bayesian logistic regression experiment on the Covertype dataset, following the same settings as Liu et al.1. We compare our methods with MCMC, SVGD, WNAG2 and WNes3. We select the bandwidth using either the MED method or the proposed BM method. The BM method is not applied to SVGD. 1Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 23782386, 2016. 2Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and Lawrence Carin. Accelerated first-order methods on the wasserstein space for bayesian inference. arXiv preprint arXiv:1807.01750, 2018. 3Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and accelerating particle-based variational inference. In International Conference on Machine Learning, pp. 40824092, 2019. 39

Slide 40

Slide 40 text

Bayesian logistic regression Figure: Comparison of different methods on Bayesian logistic regression, averaged over 10 independent trials. The shaded areas show the variance over 10 trials. Top: BM Bottom: MED. Left: Test accuracy; Right: Test log-likelihood. 40

Slide 41

Slide 41 text

Future works Apply AIG flows for Hessian transport metrics. Investigate the application of our results in matrix optimization. Systematically explain the stiffness of AIG flows and the effect of the adaptive restart Apply our results to general information metrics, especially for the Fisher-Rao metric. Study the related sampling efficient optimization methods and discrete-time algorithms on probability models. 41