Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning via Wasserstein Information Geometry

Learning via Wasserstein Information Geometry

Recently optimal transport has many applications in machine learning. In this talk, we introduce dynamical optimal transport on machine learning models. We proposed to study these models as a Riemannian manifold with a Wasserstein metric. We call it Wasserstein information geometry. Various developments, especially the Fokker-Planck equation and the mean-field games on learning models, will be introduced. The entropy production of Shannon entropy in AI models will be established. Many numerical examples, including restricted Boltzmann machine and generative adversary network, will be presented.

Wuchen Li

May 22, 2018
Tweet

More Decks by Wuchen Li

Other Decks in Research

Transcript

  1. Information geometry Information geometry (a.k.a Fisher-Rao metric) play important roles

    in information science, statistics and machine learning: Population games via Replicator dynamics (Shahshahani, Smith); Reinforcement learning (Sutton and Barto; Alpha Go; Montufar); Machine learning: Natural gradient (Amari); ADAM (Kingma 2014); stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.). 3
  2. Learning problems Given a data measure ρdata (x) = 1

    N N i=1 δXi (x) and a parameterized model ρ(x, θ). Machine Learning problems often refer to min ρθ∈ρ(Θ) D(ρdata , ρθ ) . One typical choice of D is the Kullback–Leibler divergence (relative entropy) D(ρdata , ρθ ) = Ω ρdata (x) log ρdata (x) ρ(x, θ) dx. 4
  3. Natural gradient The natural gradient method refers to θk+1 =

    θk − hGF (θ)−1∇θ D(ρdata , ρθ ) , where h > 0 is a stepsize and GF (θ) =Eρθ (∇θ log ρ(X, θ))T (∇θ log ρ(X, θ)) is the Fisher-Rao metric tensor. Why natural gradient? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; 5
  4. Optimal transport In recent years, optimal transport (a.k.a Earth mover’s

    distance, Monge-Kantorovich problem, Wasserstein metric) has witnessed a lot of applications: Population games via Fokker-Planck equations (Degond et. al. 2014, Li et.al. 2016); Mean field games (Larsy, Lions, Gangbo); Machine learning: Wasserstein Training of Boltzmann Machines (Cuturi et.al. 2015); Learning from Wasserstein Loss (Frogner et.al. 2015); Wasserstein GAN (Bottou et.al. 2017) and many more in NIPS 2015, 2016, 2017. 6
  5. Why Optimal transport? Optimal transport provides a particular distance (W)

    among histograms, which relies on the distance on sample spaces (ground cost c). E.g. Denote X0 ∼ ρ0 = δx0 , X1 ∼ ρ1 = δx1 . Compare W(ρ0, ρ1) = inf π∈Π(ρ0,ρ1) Eπ c(X0 , X1 ) = c(x0 , x1 ) ; Vs TV(ρ0, ρ1) = Ω |ρ0(x) − ρ1(x)|dx = 2 ; Vs KL(ρ0 ρ1) = Ω ρ0(x) log ρ0(x) ρ1(x) dx = ∞ . 7
  6. Wasserstein Loss function Given a data distribution ρ0 and probability

    model ρ1(θ). Consider min θ∈Θ W(ρ0, ρ1(θ)) . This is a double minimization problem, i.e. W(ρ0, ρ1(θ)) = min π∈Π(ρ0,ρ1(θ)) Eπ c(X0 , X1 ) . Many applications, such as Wasserstein GAN, Wasserstein Loss, are built on the above formulation. 8
  7. Goals Main Question: Instead of looking at the Loss function,

    we propose the optimal transport induced gradient operator in probability models, and study its properties on machine learning problems. Motivations: F. Otto. Information Geometry and its Applications (IGAIA), 2017; S. Amari. Geometric Science of Information (GSI), 2017. Related studies Constrained gradient: Carlen, Gangbo (2003); Linear programming: Wong (2017), Amari, Karakida, Oizumi (2017); Gaussian measures: Takatsu (2011), Malago et.al. (2017), Modin (2017), Yongxin et.al (2017), Sanctis (2017) and many more. 9
  8. Problem formulation Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation

    ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Lafferty, Gangbo, Otto, Villani, Chow, Zhou, Osher) . In this talk, we will apply density optimal control into learning problems. 10
  9. Density manifold Optimal transport has an optimal control reformulation by

    the dual of dual of linear programming: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt , under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0 , ρ0 = ρ0 , ρ1 = ρ1 . Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 12
  10. Density submanifold Consider a Riemannian metric gθ on Θ as

    the pull-back of gW on P(Ω). In other words, denote ξ, η ∈ Tθ Θ, then gθ (ξ, η) = gW (dθ ρ(ξ), dθ ρ(η)), where dθ ρ(·, ξ) = ∇θ ρ(·, θ), ξ , dθ ρ(·, η) = ∇θ ρ(·, θ), η , with ·, · representing the Euclidean inner product in Rd. 13
  11. Wasserstein statistical manifold Denote gθ (ξ, η) = ξT GW

    (θ)η, where GW (θ) ∈ Rd×d. Definition (Wasserstein metric tensor) Define the Wasserstein metric tensor GW (θ) on Tθ (Θ) as GW (θ)ij = Ω ρ(x, θ)∇x Φi (x) · ∇x Φj (x)dx , where ∂ ∂θi ρ(x, θ) = −∇x · (ρ(x, θ)∇x Φi (x)). This inner product gθ is consistent with the restriction of the Wasserstein metric gW on ρ(Θ). For this reason, we call (ρ(Θ), gθ ) Wasserstein statistical manifold. 14
  12. Wasserstein natural gradient Given a Loss function F : P(Ω)

    → R and probability model ρ(·, θ), the associated gradient flow on a Riemannian manifold is defined by dθ dt = −∇g F(ρ(·, θ)) . Here ∇g is the Riemannian gradient operator satisfying gθ (∇g F(ρ(·, θ)), ξ) = dθ F(ρ(·, θ)) · ξ for any tangent vector ξ ∈ Tθ Θ, where dθ represents the differential operator (Euclidean gradient). 15
  13. Wasserstein natural gradient The gradient flow of function Loss function

    F(ρ(·, θ)) in (Θ, gθ ) satisfies dθ dt = −GW (θ)−1∇θ F(ρ(·, θ)) . If ρ(·, θ) = ρ(x) is an identity map, then we recover the standard Wasserstein gradient flow: ∂t ρt = ∇ · (ρt ∇ δ δρ F(ρt )) . 16
  14. Example I: Hierarchical Log-linear models For an inclusion closed set

    S of subsets of {1, . . . , n}, the hierarchical model ES for n binary variables is the set of distributions of the form px (θ) = 1 Z(θ) exp λ∈S θλ φλ (x) , x ∈ {0, 1}n, for all possible choices of parameters θλ ∈ R, λ ∈ S. Here the φλ are real valued functions with φλ (x) = φλ (y) whenever xi = yi for all i ∈ λ. 18
  15. Maximal likelihood via Wasserstein gradient descent Consider θk+1 = θk

    − hG(θk )−1∇ KL(q pθt ), where G(θ) =      I Euclidean metric tensor; GF (θ) Fisher-Rao metric tensor; GW (θ) Wasserstein metric tensor. Figure: Compared results for Wasserstein natural gradient, Fisher-Rao natural gradient and Euclidean gradient. 19
  16. Example II: One dimensional case In one dimensional sample space,

    the Wasserstein metric tensor has a non-trivial explicit solution: GW (θ) = R 1 ρ(x, θ) (∇θ F(x, θ))T ∇θ F(x, θ)dx , where F(x, θ) = x −∞ ρ(x, θ)dx is the cumulative distribution function. Compare it with the Fisher-Rao metric tensor: GF (θ) = R 1 ρ(x, θ) (∇θ ρ(x, θ))T ∇θ ρ(x, θ)dx . 20
  17. Mixed model Consider the mixture model aN(µ1 , σ1 )

    + (1 − a)N(µ2 , σ2 ) with density functions: ρ(x, θ) = a σ1 √ 2π e − (x−µ1)2 2σ2 1 + 1 − a σ2 √ 2π e − (x−µ2)2 2σ2 2 , where θ = (a, µ1 , σ2 1 , µ2 , σ2 2 ) and a ∈ [0, 1]. 0 -10 1 0.2 0.8 -5 0.4 density ρ 0.6 time t position x 0 0.6 0.4 0.8 5 0.2 0 10 0 1 -10 0.2 0.8 -5 0.4 density ρ 0.6 time t position x 0 0.6 0.4 0.8 5 0.2 0 10 Figure: Geodesic: Gaussian mixtures; left: in the sub-mainford; right: in the whole space 21
  18. Wasserstein fitting problems Consider the fitting problem min θ W

    ρ(·, θ), 1 n N i=1 δxi . We perform the following iterative algorithms to solve the optimization problem: Gradient descent (GD) : θn+1 = θn − h∇θ ( 1 2 W2)|θn Wasserstein GD : θn+1 = θn − hGW (θn )−1∇θ ( 1 2 W2)|θn Fisher-Rao GD : θn+1 = θn − hGF (θn )−1∇θ ( 1 2 W2)|θn 22
  19. Preconditioner 0 10 20 30 40 50 60 70 80

    90 100 num of iteration 10-4 10-3 10-2 10-1 100 101 102 103 objective value Gaussian mixture distribution fitting: Wasserstein model GD GD with diag-preconditioning Wasserstein GD Hybrid Wasserstein GD Fisher-Rao GD Figure: Objective value 23
  20. Example III: Generative Adversary Networks For each parameter θ ∈

    Rd and given neural network parameterized mapping function gθ , consider ρθ = gθ #p(z). 24
  21. Wasserstein natural proximal The update scheme follows: θk+1 = arg

    min θ∈Θ F(ρθ ) + 1 2h dW (θ, θk)2. where θ is the parameters of the generator, F(ρθ ) is the loss function, and dW is the Wasserstein metric. In practice, we approximate the Wasserstein metric to obtain the following update: θk+1 = arg min θ∈Θ F(ρθ ) + 1 B B i 1 2h gθ (zi ) − gθk (zi ) 2 where gθ is the generator, B is the batch size, and zi are inputs to the generator. 25
  22. Examples Figure: The Relaxed Wasserstein Proximal of GANs, on the

    CIFA10 (left), CelebA (right)datasets. 26
  23. Discussions In this talk, we demonstrate the possibility for applying

    dynamical optimal transport for probability models. Many interesting questions are waiting for us. They include but not limited to      Natural proximal learning Scientific computing via Wasserstein information geometry · · · 27
  24. Main references Wuchen Li Geometry of probability simplex via optimal

    transport, 2018. Wuchen Li and Guido Montufar Natural gradient via optimal transport, Information Geometry, 2018. Yifan Chen and Wuchen Li Natural gradient in Wasserstein statistical manifold, 2018. Alex Lin, Wuchen Li, Stanley Osher and Guido Montufar Wasserstein proximal of GANs, 2018. 28
  25. Conferences ICIAM 2019 on Optimal Transport for Nonlinear Problems; Geometry

    Science Information 2019 special session: Wasserstein Information Geometry; IPAM: Mean field games 2020. 29