Learning via Wasserstein Information Geometry

Slide 1

Slide 1 text

Learning via Wasserstein Information Geometry Wuchen Li Applied Mathematics and Statistics Youth Forum, 2018

Slide 2

Slide 2 text

History of distance 2

Slide 3

Slide 3 text

Information geometry Information geometry (a.k.a Fisher-Rao metric) play important roles in information science, statistics and machine learning: Population games via Replicator dynamics (Shahshahani, Smith); Reinforcement learning (Sutton and Barto; Alpha Go; Montufar); Machine learning: Natural gradient (Amari); ADAM (Kingma 2014); stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.). 3

Slide 4

Slide 4 text

Learning problems Given a data measure ρdata (x) = 1 N N i=1 δXi (x) and a parameterized model ρ(x, θ). Machine Learning problems often refer to min ρθ∈ρ(Θ) D(ρdata , ρθ ) . One typical choice of D is the Kullback–Leibler divergence (relative entropy) D(ρdata , ρθ ) = Ω ρdata (x) log ρdata (x) ρ(x, θ) dx. 4

Slide 5

Slide 5 text

Natural gradient The natural gradient method refers to θk+1 = θk − hGF (θ)−1∇θ D(ρdata , ρθ ) , where h > 0 is a stepsize and GF (θ) =Eρθ (∇θ log ρ(X, θ))T (∇θ log ρ(X, θ)) is the Fisher-Rao metric tensor. Why natural gradient? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; 5

Slide 6

Slide 6 text

Optimal transport In recent years, optimal transport (a.k.a Earth mover’s distance, Monge-Kantorovich problem, Wasserstein metric) has witnessed a lot of applications: Population games via Fokker-Planck equations (Degond et. al. 2014, Li et.al. 2016); Mean ﬁeld games (Larsy, Lions, Gangbo); Machine learning: Wasserstein Training of Boltzmann Machines (Cuturi et.al. 2015); Learning from Wasserstein Loss (Frogner et.al. 2015); Wasserstein GAN (Bottou et.al. 2017) and many more in NIPS 2015, 2016, 2017. 6

Slide 7

Slide 7 text

Why Optimal transport? Optimal transport provides a particular distance (W) among histograms, which relies on the distance on sample spaces (ground cost c). E.g. Denote X0 ∼ ρ0 = δx0 , X1 ∼ ρ1 = δx1 . Compare W(ρ0, ρ1) = inf π∈Π(ρ0,ρ1) Eπ c(X0 , X1 ) = c(x0 , x1 ) ; Vs TV(ρ0, ρ1) = Ω |ρ0(x) − ρ1(x)|dx = 2 ; Vs KL(ρ0 ρ1) = Ω ρ0(x) log ρ0(x) ρ1(x) dx = ∞ . 7

Slide 8

Slide 8 text

Wasserstein Loss function Given a data distribution ρ0 and probability model ρ1(θ). Consider min θ∈Θ W(ρ0, ρ1(θ)) . This is a double minimization problem, i.e. W(ρ0, ρ1(θ)) = min π∈Π(ρ0,ρ1(θ)) Eπ c(X0 , X1 ) . Many applications, such as Wasserstein GAN, Wasserstein Loss, are built on the above formulation. 8

Slide 9

Slide 9 text

Goals Main Question: Instead of looking at the Loss function, we propose the optimal transport induced gradient operator in probability models, and study its properties on machine learning problems. Motivations: F. Otto. Information Geometry and its Applications (IGAIA), 2017; S. Amari. Geometric Science of Information (GSI), 2017. Related studies Constrained gradient: Carlen, Gangbo (2003); Linear programming: Wong (2017), Amari, Karakida, Oizumi (2017); Gaussian measures: Takatsu (2011), Malago et.al. (2017), Modin (2017), Yongxin et.al (2017), Sanctis (2017) and many more. 9

Slide 10

Slide 10 text

Problem formulation Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation ; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Laﬀerty, Gangbo, Otto, Villani, Chow, Zhou, Osher) . In this talk, we will apply density optimal control into learning problems. 10

Slide 11

Slide 11 text

Dynamical optimal transport 11

Slide 12

Slide 12 text

Density manifold Optimal transport has an optimal control reformulation by the dual of dual of linear programming: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt , under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0 , ρ0 = ρ0 , ρ1 = ρ1 . Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 12

Slide 13

Slide 13 text

Density submanifold Consider a Riemannian metric gθ on Θ as the pull-back of gW on P(Ω). In other words, denote ξ, η ∈ Tθ Θ, then gθ (ξ, η) = gW (dθ ρ(ξ), dθ ρ(η)), where dθ ρ(·, ξ) = ∇θ ρ(·, θ), ξ , dθ ρ(·, η) = ∇θ ρ(·, θ), η , with ·, · representing the Euclidean inner product in Rd. 13

Slide 14

Slide 14 text

Wasserstein statistical manifold Denote gθ (ξ, η) = ξT GW (θ)η, where GW (θ) ∈ Rd×d. Deﬁnition (Wasserstein metric tensor) Deﬁne the Wasserstein metric tensor GW (θ) on Tθ (Θ) as GW (θ)ij = Ω ρ(x, θ)∇x Φi (x) · ∇x Φj (x)dx , where ∂ ∂θi ρ(x, θ) = −∇x · (ρ(x, θ)∇x Φi (x)). This inner product gθ is consistent with the restriction of the Wasserstein metric gW on ρ(Θ). For this reason, we call (ρ(Θ), gθ ) Wasserstein statistical manifold. 14

Slide 15

Slide 15 text

Wasserstein natural gradient Given a Loss function F : P(Ω) → R and probability model ρ(·, θ), the associated gradient flow on a Riemannian manifold is defined by dθ dt = −∇g F(ρ(·, θ)) . Here ∇g is the Riemannian gradient operator satisfying gθ (∇g F(ρ(·, θ)), ξ) = dθ F(ρ(·, θ)) · ξ for any tangent vector ξ ∈ Tθ Θ, where dθ represents the differential operator (Euclidean gradient). 15

Slide 16

Slide 16 text

Wasserstein natural gradient The gradient flow of function Loss function F(ρ(·, θ)) in (Θ, gθ ) satisfies dθ dt = −GW (θ)−1∇θ F(ρ(·, θ)) . If ρ(·, θ) = ρ(x) is an identity map, then we recover the standard Wasserstein gradient flow: ∂t ρt = ∇ · (ρt ∇ δ δρ F(ρt )) . 16

Slide 17

Slide 17 text

Why Wasserstein gradient? 17

Slide 18

Slide 18 text

Example I: Hierarchical Log-linear models For an inclusion closed set S of subsets of {1, . . . , n}, the hierarchical model ES for n binary variables is the set of distributions of the form px (θ) = 1 Z(θ) exp λ∈S θλ φλ (x) , x ∈ {0, 1}n, for all possible choices of parameters θλ ∈ R, λ ∈ S. Here the φλ are real valued functions with φλ (x) = φλ (y) whenever xi = yi for all i ∈ λ. 18

Slide 19

Slide 19 text

Maximal likelihood via Wasserstein gradient descent Consider θk+1 = θk − hG(θk )−1∇ KL(q pθt ), where G(θ) =      I Euclidean metric tensor; GF (θ) Fisher-Rao metric tensor; GW (θ) Wasserstein metric tensor. Figure: Compared results for Wasserstein natural gradient, Fisher-Rao natural gradient and Euclidean gradient. 19

Slide 20

Slide 20 text

Example II: One dimensional case In one dimensional sample space, the Wasserstein metric tensor has a non-trivial explicit solution: GW (θ) = R 1 ρ(x, θ) (∇θ F(x, θ))T ∇θ F(x, θ)dx , where F(x, θ) = x −∞ ρ(x, θ)dx is the cumulative distribution function. Compare it with the Fisher-Rao metric tensor: GF (θ) = R 1 ρ(x, θ) (∇θ ρ(x, θ))T ∇θ ρ(x, θ)dx . 20

Slide 21

Slide 21 text

Mixed model Consider the mixture model aN(µ1 , σ1 ) + (1 − a)N(µ2 , σ2 ) with density functions: ρ(x, θ) = a σ1 √ 2π e − (x−µ1)2 2σ2 1 + 1 − a σ2 √ 2π e − (x−µ2)2 2σ2 2 , where θ = (a, µ1 , σ2 1 , µ2 , σ2 2 ) and a ∈ [0, 1]. 0 -10 1 0.2 0.8 -5 0.4 density ρ 0.6 time t position x 0 0.6 0.4 0.8 5 0.2 0 10 0 1 -10 0.2 0.8 -5 0.4 density ρ 0.6 time t position x 0 0.6 0.4 0.8 5 0.2 0 10 Figure: Geodesic: Gaussian mixtures; left: in the sub-mainford; right: in the whole space 21

Slide 22

Slide 22 text

Wasserstein ﬁtting problems Consider the ﬁtting problem min θ W ρ(·, θ), 1 n N i=1 δxi . We perform the following iterative algorithms to solve the optimization problem: Gradient descent (GD) : θn+1 = θn − h∇θ ( 1 2 W2)|θn Wasserstein GD : θn+1 = θn − hGW (θn )−1∇θ ( 1 2 W2)|θn Fisher-Rao GD : θn+1 = θn − hGF (θn )−1∇θ ( 1 2 W2)|θn 22

Slide 23

Slide 23 text

Preconditioner 0 10 20 30 40 50 60 70 80 90 100 num of iteration 10-4 10-3 10-2 10-1 100 101 102 103 objective value Gaussian mixture distribution fitting: Wasserstein model GD GD with diag-preconditioning Wasserstein GD Hybrid Wasserstein GD Fisher-Rao GD Figure: Objective value 23

Slide 24

Slide 24 text

Example III: Generative Adversary Networks For each parameter θ ∈ Rd and given neural network parameterized mapping function gθ , consider ρθ = gθ #p(z). 24

Slide 25

Slide 25 text

Wasserstein natural proximal The update scheme follows: θk+1 = arg min θ∈Θ F(ρθ ) + 1 2h dW (θ, θk)2. where θ is the parameters of the generator, F(ρθ ) is the loss function, and dW is the Wasserstein metric. In practice, we approximate the Wasserstein metric to obtain the following update: θk+1 = arg min θ∈Θ F(ρθ ) + 1 B B i 1 2h gθ (zi ) − gθk (zi ) 2 where gθ is the generator, B is the batch size, and zi are inputs to the generator. 25

Slide 26

Slide 26 text

Examples Figure: The Relaxed Wasserstein Proximal of GANs, on the CIFA10 (left), CelebA (right)datasets. 26

Slide 27

Slide 27 text

Discussions In this talk, we demonstrate the possibility for applying dynamical optimal transport for probability models. Many interesting questions are waiting for us. They include but not limited to      Natural proximal learning Scientiﬁc computing via Wasserstein information geometry · · · 27

Slide 28

Slide 28 text

Main references Wuchen Li Geometry of probability simplex via optimal transport, 2018. Wuchen Li and Guido Montufar Natural gradient via optimal transport, Information Geometry, 2018. Yifan Chen and Wuchen Li Natural gradient in Wasserstein statistical manifold, 2018. Alex Lin, Wuchen Li, Stanley Osher and Guido Montufar Wasserstein proximal of GANs, 2018. 28

Slide 29

Slide 29 text

Conferences ICIAM 2019 on Optimal Transport for Nonlinear Problems; Geometry Science Information 2019 special session: Wasserstein Information Geometry; IPAM: Mean ﬁeld games 2020. 29