Learning via Wasserstein Information Geometry

Learning via Wasserstein Information Geometry Wuchen Li Applied Mathematics and
Statistics Youth Forum, 2018

History of distance 2

Information geometry Information geometry (a.k.a Fisher-Rao metric) play important roles
in information science, statistics and machine learning: Population games via Replicator dynamics (Shahshahani, Smith); Reinforcement learning (Sutton and Barto; Alpha Go; Montufar); Machine learning: Natural gradient (Amari); ADAM (Kingma 2014); stochastic relaxation (Malago) and many more in book Information geometry (Ay et.al.). 3

Learning problems Given a data measure ρdata (x) = 1
N N i=1 δXi (x) and a parameterized model ρ(x, θ). Machine Learning problems often refer to min ρθ∈ρ(Θ) D(ρdata , ρθ ) . One typical choice of D is the Kullback–Leibler divergence (relative entropy) D(ρdata , ρθ ) = Ω ρdata (x) log ρdata (x) ρ(x, θ) dx. 4

Natural gradient The natural gradient method refers to θk+1 =
θk − hGF (θ)−1∇θ D(ρdata , ρθ ) , where h > 0 is a stepsize and GF (θ) =Eρθ (∇θ log ρ(X, θ))T (∇θ log ρ(X, θ)) is the Fisher-Rao metric tensor. Why natural gradient? Parameterization invariant; Pre-conditioners for KL divergence related learning problems; 5

Optimal transport In recent years, optimal transport (a.k.a Earth mover’s
distance, Monge-Kantorovich problem, Wasserstein metric) has witnessed a lot of applications: Population games via Fokker-Planck equations (Degond et. al. 2014, Li et.al. 2016); Mean ﬁeld games (Larsy, Lions, Gangbo); Machine learning: Wasserstein Training of Boltzmann Machines (Cuturi et.al. 2015); Learning from Wasserstein Loss (Frogner et.al. 2015); Wasserstein GAN (Bottou et.al. 2017) and many more in NIPS 2015, 2016, 2017. 6

Why Optimal transport? Optimal transport provides a particular distance (W)
among histograms, which relies on the distance on sample spaces (ground cost c). E.g. Denote X0 ∼ ρ0 = δx0 , X1 ∼ ρ1 = δx1 . Compare W(ρ0, ρ1) = inf π∈Π(ρ0,ρ1) Eπ c(X0 , X1 ) = c(x0 , x1 ) ; Vs TV(ρ0, ρ1) = Ω |ρ0(x) − ρ1(x)|dx = 2 ; Vs KL(ρ0 ρ1) = Ω ρ0(x) log ρ0(x) ρ1(x) dx = ∞ . 7

Wasserstein Loss function Given a data distribution ρ0 and probability
model ρ1(θ). Consider min θ∈Θ W(ρ0, ρ1(θ)) . This is a double minimization problem, i.e. W(ρ0, ρ1(θ)) = min π∈Π(ρ0,ρ1(θ)) Eπ c(X0 , X1 ) . Many applications, such as Wasserstein GAN, Wasserstein Loss, are built on the above formulation. 8

Goals Main Question: Instead of looking at the Loss function,
we propose the optimal transport induced gradient operator in probability models, and study its properties on machine learning problems. Motivations: F. Otto. Information Geometry and its Applications (IGAIA), 2017; S. Amari. Geometric Science of Information (GSI), 2017. Related studies Constrained gradient: Carlen, Gangbo (2003); Linear programming: Wong (2017), Amari, Karakida, Oizumi (2017); Gaussian measures: Takatsu (2011), Malago et.al. (2017), Modin (2017), Yongxin et.al (2017), Sanctis (2017) and many more. 9

Problem formulation Mapping formulation: Monge problem (1781): Monge-Amp´ ere equation
; Statical formulation: Kantorovich problem (1940): Linear programming ; Dynamical formulation: Density optimal control (Nelson, Laﬀerty, Gangbo, Otto, Villani, Chow, Zhou, Osher) . In this talk, we will apply density optimal control into learning problems. 10

Dynamical optimal transport 11

Density manifold Optimal transport has an optimal control reformulation by
the dual of dual of linear programming: inf ρt 1 0 gW (∂t ρt , ∂t ρt )dt = 1 0 Ω (∇Φt , ∇Φt )ρt dxdt , under the dynamical constraint, i.e. continuity equation: ∂t ρt + ∇ · (ρt ∇Φt ) = 0 , ρ0 = ρ0 , ρ1 = ρ1 . Here, (P(Ω), gW ) forms an infinite-dimensional Riemannian manifold1. 1John D. Lafferty: the density manifold and configuration space quantization, 1988. 12

Density submanifold Consider a Riemannian metric gθ on Θ as
the pull-back of gW on P(Ω). In other words, denote ξ, η ∈ Tθ Θ, then gθ (ξ, η) = gW (dθ ρ(ξ), dθ ρ(η)), where dθ ρ(·, ξ) = ∇θ ρ(·, θ), ξ , dθ ρ(·, η) = ∇θ ρ(·, θ), η , with ·, · representing the Euclidean inner product in Rd. 13

Wasserstein statistical manifold Denote gθ (ξ, η) = ξT GW
(θ)η, where GW (θ) ∈ Rd×d. Deﬁnition (Wasserstein metric tensor) Deﬁne the Wasserstein metric tensor GW (θ) on Tθ (Θ) as GW (θ)ij = Ω ρ(x, θ)∇x Φi (x) · ∇x Φj (x)dx , where ∂ ∂θi ρ(x, θ) = −∇x · (ρ(x, θ)∇x Φi (x)). This inner product gθ is consistent with the restriction of the Wasserstein metric gW on ρ(Θ). For this reason, we call (ρ(Θ), gθ ) Wasserstein statistical manifold. 14

Wasserstein natural gradient Given a Loss function F : P(Ω)
→ R and probability model ρ(·, θ), the associated gradient flow on a Riemannian manifold is defined by dθ dt = −∇g F(ρ(·, θ)) . Here ∇g is the Riemannian gradient operator satisfying gθ (∇g F(ρ(·, θ)), ξ) = dθ F(ρ(·, θ)) · ξ for any tangent vector ξ ∈ Tθ Θ, where dθ represents the differential operator (Euclidean gradient). 15

Wasserstein natural gradient The gradient flow of function Loss function
F(ρ(·, θ)) in (Θ, gθ ) satisfies dθ dt = −GW (θ)−1∇θ F(ρ(·, θ)) . If ρ(·, θ) = ρ(x) is an identity map, then we recover the standard Wasserstein gradient flow: ∂t ρt = ∇ · (ρt ∇ δ δρ F(ρt )) . 16

Why Wasserstein gradient? 17

Example I: Hierarchical Log-linear models For an inclusion closed set
S of subsets of {1, . . . , n}, the hierarchical model ES for n binary variables is the set of distributions of the form px (θ) = 1 Z(θ) exp λ∈S θλ φλ (x) , x ∈ {0, 1}n, for all possible choices of parameters θλ ∈ R, λ ∈ S. Here the φλ are real valued functions with φλ (x) = φλ (y) whenever xi = yi for all i ∈ λ. 18

Maximal likelihood via Wasserstein gradient descent Consider θk+1 = θk
− hG(θk )−1∇ KL(q pθt ), where G(θ) =      I Euclidean metric tensor; GF (θ) Fisher-Rao metric tensor; GW (θ) Wasserstein metric tensor. Figure: Compared results for Wasserstein natural gradient, Fisher-Rao natural gradient and Euclidean gradient. 19

Example II: One dimensional case In one dimensional sample space,
the Wasserstein metric tensor has a non-trivial explicit solution: GW (θ) = R 1 ρ(x, θ) (∇θ F(x, θ))T ∇θ F(x, θ)dx , where F(x, θ) = x −∞ ρ(x, θ)dx is the cumulative distribution function. Compare it with the Fisher-Rao metric tensor: GF (θ) = R 1 ρ(x, θ) (∇θ ρ(x, θ))T ∇θ ρ(x, θ)dx . 20

Mixed model Consider the mixture model aN(µ1 , σ1 )
+ (1 − a)N(µ2 , σ2 ) with density functions: ρ(x, θ) = a σ1 √ 2π e − (x−µ1)2 2σ2 1 + 1 − a σ2 √ 2π e − (x−µ2)2 2σ2 2 , where θ = (a, µ1 , σ2 1 , µ2 , σ2 2 ) and a ∈ [0, 1]. 0 -10 1 0.2 0.8 -5 0.4 density ρ 0.6 time t position x 0 0.6 0.4 0.8 5 0.2 0 10 0 1 -10 0.2 0.8 -5 0.4 density ρ 0.6 time t position x 0 0.6 0.4 0.8 5 0.2 0 10 Figure: Geodesic: Gaussian mixtures; left: in the sub-mainford; right: in the whole space 21

Wasserstein ﬁtting problems Consider the ﬁtting problem min θ W
ρ(·, θ), 1 n N i=1 δxi . We perform the following iterative algorithms to solve the optimization problem: Gradient descent (GD) : θn+1 = θn − h∇θ ( 1 2 W2)|θn Wasserstein GD : θn+1 = θn − hGW (θn )−1∇θ ( 1 2 W2)|θn Fisher-Rao GD : θn+1 = θn − hGF (θn )−1∇θ ( 1 2 W2)|θn 22

Preconditioner 0 10 20 30 40 50 60 70 80
90 100 num of iteration 10-4 10-3 10-2 10-1 100 101 102 103 objective value Gaussian mixture distribution fitting: Wasserstein model GD GD with diag-preconditioning Wasserstein GD Hybrid Wasserstein GD Fisher-Rao GD Figure: Objective value 23

Example III: Generative Adversary Networks For each parameter θ ∈
Rd and given neural network parameterized mapping function gθ , consider ρθ = gθ #p(z). 24

Wasserstein natural proximal The update scheme follows: θk+1 = arg
min θ∈Θ F(ρθ ) + 1 2h dW (θ, θk)2. where θ is the parameters of the generator, F(ρθ ) is the loss function, and dW is the Wasserstein metric. In practice, we approximate the Wasserstein metric to obtain the following update: θk+1 = arg min θ∈Θ F(ρθ ) + 1 B B i 1 2h gθ (zi ) − gθk (zi ) 2 where gθ is the generator, B is the batch size, and zi are inputs to the generator. 25

Examples Figure: The Relaxed Wasserstein Proximal of GANs, on the
CIFA10 (left), CelebA (right)datasets. 26

Discussions In this talk, we demonstrate the possibility for applying
dynamical optimal transport for probability models. Many interesting questions are waiting for us. They include but not limited to      Natural proximal learning Scientiﬁc computing via Wasserstein information geometry · · · 27

Main references Wuchen Li Geometry of probability simplex via optimal
transport, 2018. Wuchen Li and Guido Montufar Natural gradient via optimal transport, Information Geometry, 2018. Yifan Chen and Wuchen Li Natural gradient in Wasserstein statistical manifold, 2018. Alex Lin, Wuchen Li, Stanley Osher and Guido Montufar Wasserstein proximal of GANs, 2018. 28

Conferences ICIAM 2019 on Optimal Transport for Nonlinear Problems; Geometry
Science Information 2019 special session: Wasserstein Information Geometry; IPAM: Mean ﬁeld games 2020. 29

Learning via Wasserstein Information Geometry

Learning via Wasserstein Information Geometry

Wuchen Li

More Decks by Wuchen Li

Other Decks in Research

Featured

Transcript