Slide 1

Slide 1 text

Transport information Hessian distances Wuchen Li University of South Carolina Divergence Statistics Geometric Science of Information. 1

Slide 2

Slide 2 text

History of Statistical Divergences 2

Slide 3

Slide 3 text

Examples in Euclidean space Given X, Y ∈ R+ , consider a divergence function between them by D: R+ × R+ → R+ . Several examples are given below. Squared Euclidean distance: D(X Y ) = (X − Y )2; KullbackLeibler (KL) divergence: D(X Y ) = X log X Y ; Squared Hellinger distance: D(X Y ) = 4( √ X − √ Y )2. We briefly review them in L2 space. And we plan to build their counterparts in optimal transport (Wasserstein) space. 3

Slide 4

Slide 4 text

KL divergence One important example of divergence functional is the KL divergence: DKL (p q) = Ω p(x) log p(x) q(x) dx. KL divergence has a lot of properties. Nonsymmetry: DKL (p q) = DKL (q p); Separable; Convexity in both variables p and q. 4

Slide 5

Slide 5 text

Hessian distance In particular, there is a Hessian metric for the KL divergence. Observe that DKL (q + ˙ q q) = gq ( ˙ q, ˙ q) + o( ˙ q 2 L2 ), where the notation gq (h, h) = Ω | ˙ q(x)|2 q(x) dx, represents the Hessian operator of negative entropy Ω q(x) log q(x)dx, in L2 space. Here gq (·, ·) is a Hessian metric, also named Fisher-Rao-information metric. 5

Slide 6

Slide 6 text

Hessian distance The Hessian metric of KL divergence induces a distance function below. DH (p, q)2 = inf γ : [0,1]×Ω→R 1 0 gγt (∂t γt , ∂t γt )dt: γ0 = p, γ1 = q = inf γ : [0,1]×Ω→R 1 0 Ω |∂t γ(t, x)|2 γ(t, x) : γ0 = p, γ1 = q = inf γ : [0,1]×Ω→R 1 0 Ω |2∂t γ(t, x)|2 : γ0 = p, γ1 = q =4 Ω ( p(x) − q(x))2dx. Here DH is named the Hellinger distance. 6

Slide 7

Slide 7 text

Optimal transport What is the optimal way to move or transport the mountain with shape X, density q(x) to another shape Y with density p(y)? I.e. DistT (p, q)2 = inf T : Ω→Ω Ω T(x) − x 2q(x)dx: T# q = p . The problem was first introduced by Monge in 1781 and relaxed by Kantorovich in 1940. It introduces a metric function on probability set, named optimal transport distance, Wasserstein metric or Earth Mover’s distance (Ambrosio, Gangbo, McCann, Benamou, Breiner, Villani, Otto, Figali et.al.). Nowadays, optimal transport distances have been shown useful in inference problems and inverse problems (Poggio, Preye, Yunan, Engquist, Arjovsky, Osher, et.al.). 7

Slide 8

Slide 8 text

Goals We plan to design Hessian distances of information entropies in Wasserstein space. Natural questions (i) What are Hessian distances in Wasserstein space? (ii) What is the “Hellinger” distance in Wasserstein space? Related studies Amari, Karakida, Oizumi, Cuturi; Guo, Hong, Yang; Leonard Wong, Yang, Zhang; Ay, Felice. 8

Slide 9

Slide 9 text

Optimal transport distance In one dimensional sample space, optimal transport distance has the following closed form formulations. DistT (p, q)2 = Ω |T(x) − x|2q(x)dx, where T is a monotone mapping function such that p(T(x))T (x) = q(x). By some calculations, DistT (p, q)2 = Ω |F−1 p (y) − F−1 q (y)|2dy, where Fp , Fq are cumulative distributions of p, q, respectively. From now on, we call F−1 p the transport coordinates. 9

Slide 10

Slide 10 text

Hessian metric of Entropy in optimal transport space Consider f-entropy by F(p) = Ω f(p(x))dx. The Hessian metric of f-entropy in optimal transport space satisfies gT p ( ˙ p, ˙ p) = Ω f (p)|∇2φ|2p(x)2dx, where ˙ p = −∇ · (p∇φ). 10

Slide 11

Slide 11 text

Transport Hessian distances Denote a one dimensional function h: Ω → R by h(y) = y 1 f ( 1 z ) 1 z 3 2 dz. Theorem The squared transport Hessian distance of f-entropy has the following formulations. (i) Inverse CDF formulation: DistTH (p, q)2 = 1 0 h(∇y F−1 p (y)) − h(∇y F−1 q (y)) 2dy. (ii) Mapping formulation: DistTH (p, q)2 = Ω h( ∇x T(x) q(x) ) − h( 1 q(x) ) 2q(x)dx, where T is an optimal transport mapping function, such that T# q = p and T(x) = F−1 p (Fq (x)). 11

Slide 12

Slide 12 text

Transport Hellinger distances If f(p) = p log p, then h(z) = − log z. Hence DistTH (p, q)2 = Ω log ∇x T(x) 2q(x)dx = 1 0 log ∇y F−1 p (y) − log ∇y F−1 q (y) 2dy. In short, the transport Hellinger distance is a Hessian metric of entropy in Wasserstein space. 12

Slide 13

Slide 13 text

One Dimension: TKL vs KL divergence Similarly, we can extend the study of transport Hessian distances to transport Bregman divergences. Transport KL divergence: DTKL (p q) := 1 0 ∇y F−1 p (y) ∇y F−1 q (y) − log ∇y F−1 p (y) ∇y F−1 q (y) − 1 dy. KL divergence: DKL (p q) = Ω ∇x Fp (x) log ∇x Fp (x) ∇x Fq (x) dx. Here Fp = x p(s)ds, Fq = x q(s)ds are cumulative distributions of probability densities p, q, respectively. 13