Transport information Hessian distances

Transport information Hessian distances Wuchen Li University of South Carolina
Divergence Statistics Geometric Science of Information. 1

History of Statistical Divergences 2

Examples in Euclidean space Given X, Y ∈ R+ ,
consider a divergence function between them by D: R+ × R+ → R+ . Several examples are given below. Squared Euclidean distance: D(X Y ) = (X − Y )2; KullbackLeibler (KL) divergence: D(X Y ) = X log X Y ; Squared Hellinger distance: D(X Y ) = 4( √ X − √ Y )2. We brieﬂy review them in L2 space. And we plan to build their counterparts in optimal transport (Wasserstein) space. 3

KL divergence One important example of divergence functional is the
KL divergence: DKL (p q) = Ω p(x) log p(x) q(x) dx. KL divergence has a lot of properties. Nonsymmetry: DKL (p q) = DKL (q p); Separable; Convexity in both variables p and q. 4

Hessian distance In particular, there is a Hessian metric for
the KL divergence. Observe that DKL (q + ˙ q q) = gq ( ˙ q, ˙ q) + o( ˙ q 2 L2 ), where the notation gq (h, h) = Ω | ˙ q(x)|2 q(x) dx, represents the Hessian operator of negative entropy Ω q(x) log q(x)dx, in L2 space. Here gq (·, ·) is a Hessian metric, also named Fisher-Rao-information metric. 5

Hessian distance The Hessian metric of KL divergence induces a
distance function below. DH (p, q)2 = inf γ : [0,1]×Ω→R 1 0 gγt (∂t γt , ∂t γt )dt: γ0 = p, γ1 = q = inf γ : [0,1]×Ω→R 1 0 Ω |∂t γ(t, x)|2 γ(t, x) : γ0 = p, γ1 = q = inf γ : [0,1]×Ω→R 1 0 Ω |2∂t γ(t, x)|2 : γ0 = p, γ1 = q =4 Ω ( p(x) − q(x))2dx. Here DH is named the Hellinger distance. 6

Optimal transport What is the optimal way to move or
transport the mountain with shape X, density q(x) to another shape Y with density p(y)? I.e. DistT (p, q)2 = inf T : Ω→Ω Ω T(x) − x 2q(x)dx: T# q = p . The problem was ﬁrst introduced by Monge in 1781 and relaxed by Kantorovich in 1940. It introduces a metric function on probability set, named optimal transport distance, Wasserstein metric or Earth Mover’s distance (Ambrosio, Gangbo, McCann, Benamou, Breiner, Villani, Otto, Figali et.al.). Nowadays, optimal transport distances have been shown useful in inference problems and inverse problems (Poggio, Preye, Yunan, Engquist, Arjovsky, Osher, et.al.). 7

Goals We plan to design Hessian distances of information entropies
in Wasserstein space. Natural questions (i) What are Hessian distances in Wasserstein space? (ii) What is the “Hellinger” distance in Wasserstein space? Related studies Amari, Karakida, Oizumi, Cuturi; Guo, Hong, Yang; Leonard Wong, Yang, Zhang; Ay, Felice. 8

Optimal transport distance In one dimensional sample space, optimal transport
distance has the following closed form formulations. DistT (p, q)2 = Ω |T(x) − x|2q(x)dx, where T is a monotone mapping function such that p(T(x))T (x) = q(x). By some calculations, DistT (p, q)2 = Ω |F−1 p (y) − F−1 q (y)|2dy, where Fp , Fq are cumulative distributions of p, q, respectively. From now on, we call F−1 p the transport coordinates. 9

Hessian metric of Entropy in optimal transport space Consider f-entropy
by F(p) = Ω f(p(x))dx. The Hessian metric of f-entropy in optimal transport space satisﬁes gT p ( ˙ p, ˙ p) = Ω f (p)|∇2φ|2p(x)2dx, where ˙ p = −∇ · (p∇φ). 10

Transport Hessian distances Denote a one dimensional function h: Ω
→ R by h(y) = y 1 f ( 1 z ) 1 z 3 2 dz. Theorem The squared transport Hessian distance of f-entropy has the following formulations. (i) Inverse CDF formulation: DistTH (p, q)2 = 1 0 h(∇y F−1 p (y)) − h(∇y F−1 q (y)) 2dy. (ii) Mapping formulation: DistTH (p, q)2 = Ω h( ∇x T(x) q(x) ) − h( 1 q(x) ) 2q(x)dx, where T is an optimal transport mapping function, such that T# q = p and T(x) = F−1 p (Fq (x)). 11

Transport Hellinger distances If f(p) = p log p, then
h(z) = − log z. Hence DistTH (p, q)2 = Ω log ∇x T(x) 2q(x)dx = 1 0 log ∇y F−1 p (y) − log ∇y F−1 q (y) 2dy. In short, the transport Hellinger distance is a Hessian metric of entropy in Wasserstein space. 12

One Dimension: TKL vs KL divergence Similarly, we can extend
the study of transport Hessian distances to transport Bregman divergences. Transport KL divergence: DTKL (p q) := 1 0 ∇y F−1 p (y) ∇y F−1 q (y) − log ∇y F−1 p (y) ∇y F−1 q (y) − 1 dy. KL divergence: DKL (p q) = Ω ∇x Fp (x) log ∇x Fp (x) ∇x Fq (x) dx. Here Fp = x p(s)ds, Fq = x q(s)ds are cumulative distributions of probability densities p, q, respectively. 13

Transport information Hessian distances

Transport information Hessian distances

Wuchen Li

More Decks by Wuchen Li

Featured

Transcript

Transport information Hessian distances Wuchen Li University of South Carolina

History of Statistical Divergences 2

Examples in Euclidean space Given X, Y ∈ R+ ,

KL divergence One important example of divergence functional is the

Hessian distance In particular, there is a Hessian metric for

Hessian distance The Hessian metric of KL divergence induces a

Optimal transport What is the optimal way to move or

Goals We plan to design Hessian distances of information entropies

Optimal transport distance In one dimensional sample space, optimal transport

Hessian metric of Entropy in optimal transport space Consider f-entropy

Transport Hessian distances Denote a one dimensional function h: Ω

Transport Hellinger distances If f(p) = p log p, then

One Dimension: TKL vs KL divergence Similarly, we can extend