positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /
= n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /
i,j Bij log Bij Aij Theorem (Csiszár ( )) Sinkhorn’s iterate A(k) satisfies A( k+ ) = argmin{DKL(B || A( k)) : B n = /m}, A( k+ ) = argmin{DKL(B || A( k+ )) : B m = n/n}. In information geometry, this is alternating e-projection w.r.t. the Fisher metric. /
Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /
Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /
In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Under reasonable conditions, Φ(k) convergences to a solution [Gurvits, ] Can we view this as “alternating e-projection”? 8 /
] Nonmetrical dual connections play central role (i.e., different from Levi-Civita connection) Key components metric tensor g two affine connections ∇(m), ∇(e) −→ induce two geodesics (m/e-geodesics) /
... , pn) : pk > , n k= pk = ⊂ Rn ++ Probability simplex of dim n − . We will introduce Riemmanian structure on Sn− with metric tensor g affine connections ∇(m), ∇(e) /
= (p , ... , pn− ) e-coordinate (exponential): = (log(p /pn), ... , log(pn− /pn)) Fisher metric g(X, Y)p = i X(log pi)Y(pi) (X, Y ∈ Tp(Sn− ): tangent vectors) = i X(e) i Y(m) i where X = i X(e) i i p , Y(m) = i Y(m) i i p Note Fisher metric is the unique metric satisfying natural statistical invariance (Cencov’s theorem). /
and submanifolds Π = {A ∈ Rm×n ++ | A n = m− m}, Π = {A ∈ Rm×n ++ | A m = n− n}. (These submanifolds are m-autoparallel) Π Π A( k) A( k+ ) Theorem If A ∈ Rm×n ++ , then iterates of Sinkhorn algorithm is e-projection: e-geodesic from A( k) to A( k+ ) (from A( k+ ) to A( k+ )) is orthogonal to Π (Π ) w.r.t. Fisher metric. /
i pi log pi − pi be negative entropy. Then, = Ψ( ) , = Ψ∗( ) Legendre transform g = Hess(Ψ) Hessian One can define canonical divergence as D(p || q) = (p) − (q) − grad (q), q − p (in our case, it is KL) Fact E-projection onto m-autoparallel submanifolds can be done via canonical divergence minimization. −→ information geometric proof of Csiszár ( ) 6 /
[Gurvits and Samorodnitsky, ; Idel, 6] cap(A) = inf x> n i= (Ax)i n i= xi /n Capacity can be used as “potential” for Sinkhorn. − log cap(A)+ log n = min B∈Π ∩Π DKL(B || A) /
from A to A and the e-geodesic from A to A are orthogonal at A w.r.t. the Fisher metric, then DKL(A || A ) = DKL(A || A ) + DKL(A || A ) A A A Theorem (Csiszár ( )) The Sinkhorn algorithm converges to the e-projection A∗ of A onto Π ∩ Π : DKL(A∗ || A) = min B∈Π ∩Π DKL(B || A) DKL(A∗ || A) = DKL(A∗ || A(K)) + K k= DKL(A(k) || A(k− )) 8 /
we move to manifold of PD matrices. Then apply quantum information geometry on the PD manifold. matrix scaling operator scaling manifold p ∈ RN ++ : pi > , pi = ∈ CN×N: O, tr = metric Fisher SLD divergence KL ??? dually flat? YES NO /
the Fisher metric is the only monotone metric (Cencov’s theorem). • However, in quantum information geometry, monotone metrics are not unique. • Monotone metrics are characterized by operator monotone functions [Petz, 6] • Each monotone metric induces its own e-connection. Symmetric Logarithmic Derivative (SLD) metric gS(X, Y) = tr(LS X Y ), where X = (LS X + LS X ) Lyapunov equation /
m-geodesic: (t) = ( − t) + t e-geodesic: (t) ∝ Kt Kt, where K = − # is matrix geometric mean Π Π k k+ Theorem If O, then iterates of operator Sinkhorn algorithm is the unique e-projection w.r.t. SLD metric: e-geodesic from k to k+ (from k+ to k+ ) is orthogonal to Π (Π ) w.r.t. SLD metric. /
(t) = Kt k Kt, K = − k # k+ = I ⊗ Φ k (I)− / • The e-representation LS of ( ) satisfies the Lyapunov equation: (LS k+ + k+ LS) = ( ) = (log K) k+ + k+ (log K) • Since the solution of the Lyapunov equation is unique, LS = log K = −I ⊗ log Φ k (I) • Therefore, ( ) is orthogonal to Π w.r.t. SLD metric. • Uniqueness is shown similarly (not from generalized Pythagorean theorem, but from the uniqueness of solutions of matrix equations) Π k k+ ( ) 6 /
n. Capacity [Gurvits, ] cap(Φ) = inf X O det Φ(X) det X /n Key tool for studying operator scaling [Gurvits, ; Garg et al., ; Allen-Zhu et al., 8] Q. Is there a “divergence” D s.t. − log cap(Φ)+ log n = min ∗∈Π ∩Π D( ∗ || CH(Φ)) as in matrix scaling? /
D( || ) = tr[ (log − log )] • arises from dually flat structure with Bogoliubov–Kubo–Mori metric, which corresponds to Ψ( ) = tr( log − ). However, numerical experiments shows this does not coincide with operator Sinkhorn iteration... Actually, SLD metric is NOT dually flat! Still open! 8 /
PD matrices. Future work • Divergence characterization? • Convergence analysis based on SLD metric? Optimization algorithm? • Operator scaling applications for quantum information theory? /
geometric mean is defined as A#B = A / (A− / BA− / ) / A / . Properties • A#B O • A#B = B#A • A#B is the unique PD solution of algebraic Riccati equation XA− X = B. /