Information Geometry of Operator Scaling Part II: information geometry and scaling

Slide 1

Slide 1 text

Information Geometry of Operator Scaling Part II: information geomerty and scaling Π Π k k+ Takeru Matsuda (UTokyo, RIKEN) Tasuku Soma (UTokyo) July 8, /

Slide 2

Slide 2 text

Overview /

Slide 3

Slide 3 text

Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output: positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /

Slide 4

Slide 4 text

Sinkhorn algorithm [Sinkhorn, 6 ] W.l.o.g. assume that A m = n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /

Slide 5

Slide 5 text

Sinkhorn = alternating e-projection Kullback-Leibler divergence DKL(B || A) = i,j Bij log Bij Aij Theorem (Csiszár ( )) Sinkhorn’s iterate A(k) satisﬁes A( k+ ) = argmin{DKL(B || A( k)) : B n = /m}, A( k+ ) = argmin{DKL(B || A( k+ )) : B m = n/n}. In information geometry, this is alternating e-projection w.r.t. the Fisher metric. /

Slide 6

Slide 6 text

Operator scaling • A linear map Φ : Cn×n → Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /

Slide 7

Slide 7 text

Operator scaling Input: CP map Φ : Cn×n → Cm×m Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /

Slide 8

Slide 8 text

Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) = In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Under reasonable conditions, Φ(k) convergences to a solution [Gurvits, ] Can we view this as “alternating e-projection”? 8 /

Slide 9

Slide 9 text

Our result Theorem (Matsuda and S. ) Operator Sinkhorn is alternating e-projection w.r.t. the symmetric logarithmic derivative (SLD) metric on positive deﬁnite matrices. • Quantum generalization of [Csiszár, ] /

Slide 10

Slide 10 text

Information geometry of matrix scaling /

Slide 11

Slide 11 text

Information geometry Statistical theory using differential geometry [Amari and Nagaoka, ] Nonmetrical dual connections play central role (i.e., different from Levi-Civita connection) Key components metric tensor g two afﬁne connections ∇(m), ∇(e) −→ induce two geodesics (m/e-geodesics) /

Slide 12

Slide 12 text

Information geometry on Sn− Sn− = p = (p , ... , pn) : pk > , n k= pk = ⊂ Rn ++ Probability simplex of dim n − . We will introduce Riemmanian structure on Sn− with metric tensor g afﬁne connections ∇(m), ∇(e) /

Slide 13

Slide 13 text

Fisher metric Two coordinate systems on Sn− : m-coordinate (mixture): = (p , ... , pn− ) e-coordinate (exponential): = (log(p /pn), ... , log(pn− /pn)) Fisher metric g(X, Y)p = i X(log pi)Y(pi) (X, Y ∈ Tp(Sn− ): tangent vectors) = i X(e) i Y(m) i where X = i X(e) i i p , Y(m) = i Y(m) i i p Note Fisher metric is the unique metric satisfying natural statistical invariance (Cencov’s theorem). /

Slide 14

Slide 14 text

Dual connections Take ∇(m), ∇(e) s.t. Christoffel symbols in m/e-coordinates vanish, respectively. m-geodesic: (t) = ( − t)p + tq. e-geodesic: (t) ∝ exp(( − t) log p + t log q). p q m e They are dual connections: Zg(X, Y) = g(∇(m) Z X, Y) + g(X, ∇(e) Z Y) Cf. Levi-Civita connection ∇LC is self-dual: Zg(X, Y) = g(∇LC Z X, Y) + g(X, ∇LC Z Y) /

Slide 15

Slide 15 text

Sinkhorn as alternating e-projection Now, consider SN− (N = mn) and submanifolds Π = {A ∈ Rm×n ++ | A n = m− m}, Π = {A ∈ Rm×n ++ | A m = n− n}. (These submanifolds are m-autoparallel) Π Π A( k) A( k+ ) Theorem If A ∈ Rm×n ++ , then iterates of Sinkhorn algorithm is e-projection: e-geodesic from A( k) to A( k+ ) (from A( k+ ) to A( k+ )) is orthogonal to Π (Π ) w.r.t. Fisher metric. /

Slide 16

Slide 16 text

Dually ﬂat structure of Sn− and KL-divergence Let Ψ(p) = i pi log pi − pi be negative entropy. Then, = Ψ( ) , = Ψ∗( ) Legendre transform g = Hess(Ψ) Hessian One can deﬁne canonical divergence as D(p || q) = (p) − (q) − grad (q), q − p (in our case, it is KL) Fact E-projection onto m-autoparallel submanifolds can be done via canonical divergence minimization. −→ information geometric proof of Csiszár ( ) 6 /

Slide 17

Slide 17 text

KL-divergence and capacity Consider case of m = n Capacity [Gurvits and Samorodnitsky, ; Idel, 6] cap(A) = inf x> n i= (Ax)i n i= xi /n Capacity can be used as “potential” for Sinkhorn. − log cap(A)+ log n = min B∈Π ∩Π DKL(B || A) /

Slide 18

Slide 18 text

Convergence of Sinkhorn algorithm Generalized Pythagorean theorem If the m-geodesic from A to A and the e-geodesic from A to A are orthogonal at A w.r.t. the Fisher metric, then DKL(A || A ) = DKL(A || A ) + DKL(A || A ) A A A Theorem (Csiszár ( )) The Sinkhorn algorithm converges to the e-projection A∗ of A onto Π ∩ Π : DKL(A∗ || A) = min B∈Π ∩Π DKL(B || A) DKL(A∗ || A) = DKL(A∗ || A(K)) + K k= DKL(A(k) || A(k− )) 8 /

Slide 19

Slide 19 text

Quantum information geometry of operator scaling /

Slide 20

Slide 20 text

Information geometry of operator scaling Idea Using the Choi representation, we move to manifold of PD matrices. Then apply quantum information geometry on the PD manifold. matrix scaling operator scaling manifold p ∈ RN ++ : pi > , pi = ∈ CN×N: O, tr = metric Fisher SLD divergence KL ??? dually ﬂat? YES NO /

Slide 21

Slide 21 text

Partial trace For a partitioned matrix       A · · · A n . . . ... . . . An · · · Ann       , partial traces are deﬁned as tr       A · · · A n . . . ... . . . An · · · Ann       = n i= Aii, tr       A · · · A n . . . ... . . . An · · · Ann       =       tr A · · · tr A n . . . ... . . . tr An · · · tr Ann       . /

Slide 22

Slide 22 text

Choi representation [Choi, ] CH(Φ) = n i,j= Eij ⊗ Φ(Eij) =       Φ(E ) · · · Φ(E n) . . . ... . . . Φ(En ) · · · Φ(Enn)       Facts: • CH(Φ) is isomorphism of linear maps and Hermitians • CH(Φ) O ⇐⇒ Φ is CP • CH(ΦL,R) = (R† ⊗ L) CH(Φ)(R ⊗ L†) • tr CH(Φ) = Φ(In), tr CH(Φ) = Φ∗(Im) /

Slide 23

Slide 23 text

Operator Sinkhorn in Choi representation We assume that CH(Φ) is PD. Consider S(Cmn) = { ∈ Cmn×mn : O, tr = } (density matrices) and Π = { O | tr ( ) = I/m} ⊂ S(Cmn) Π = { O | tr ( ) = I/n} ⊂ S(Cmn) Putting k := CH(Φk ), iterates of operator Sinkhorn are: k+ = (I ⊗ Φ k (I)− / ) k (I ⊗ Φ k (I)− / ) ∈ Π k+ = (Φ∗ k+ (I)− / ⊗ I) k+ (Φ∗ k+ (I)− / ⊗ I) ∈ Π /

Slide 24

Slide 24 text

Symmetric Logarithmic Derivative (SLD) metric • In classical information geometry, the Fisher metric is the only monotone metric (Cencov’s theorem). • However, in quantum information geometry, monotone metrics are not unique. • Monotone metrics are characterized by operator monotone functions [Petz, 6] • Each monotone metric induces its own e-connection. Symmetric Logarithmic Derivative (SLD) metric gS(X, Y) = tr(LS X Y ), where X = (LS X + LS X ) Lyapunov equation /

Slide 25

Slide 25 text

Operator Sinkhorn = alternating e-projection One can introduce m/e-connections s.t. m-geodesic: (t) = ( − t) + t e-geodesic: (t) ∝ Kt Kt, where K = − # is matrix geometric mean Π Π k k+ Theorem If O, then iterates of operator Sinkhorn algorithm is the unique e-projection w.r.t. SLD metric: e-geodesic from k to k+ (from k+ to k+ ) is orthogonal to Π (Π ) w.r.t. SLD metric. /

Slide 26

Slide 26 text

Proof sketch • The e-geodesic from k to k+ is (t) = Kt k Kt, K = − k # k+ = I ⊗ Φ k (I)− / • The e-representation LS of ( ) satisﬁes the Lyapunov equation: (LS k+ + k+ LS) = ( ) = (log K) k+ + k+ (log K) • Since the solution of the Lyapunov equation is unique, LS = log K = −I ⊗ log Φ k (I) • Therefore, ( ) is orthogonal to Π w.r.t. SLD metric. • Uniqueness is shown similarly (not from generalized Pythagorean theorem, but from the uniqueness of solutions of matrix equations) Π k k+ ( ) 6 /

Slide 27

Slide 27 text

Is there divergence for capacity? Consider case of m = n. Capacity [Gurvits, ] cap(Φ) = inf X O det Φ(X) det X /n Key tool for studying operator scaling [Gurvits, ; Garg et al., ; Allen-Zhu et al., 8] Q. Is there a “divergence” D s.t. − log cap(Φ)+ log n = min ∗∈Π ∩Π D( ∗ || CH(Φ)) as in matrix scaling? /

Slide 28

Slide 28 text

Is there divergence for capacity? Naive idea: Umegaki relative entropy D( || ) = tr[ (log − log )] • arises from dually ﬂat structure with Bogoliubov–Kubo–Mori metric, which corresponds to Ψ( ) = tr( log − ). However, numerical experiments shows this does not coincide with operator Sinkhorn iteration... Actually, SLD metric is NOT dually ﬂat! Still open! 8 /

Slide 29

Slide 29 text

Numerical Example Genereted random density matrix and compare • − log cap( ) • D( ∗ || ) ( ∗: limit of Sinkhorn) 6 8 6 8 Umegaki relative entropy − log cap /

Slide 30

Slide 30 text

Summary /

Slide 31

Slide 31 text

Summary Operator Sinkhorn is alternating e-projection w.r.t. SLD metric on PD matrices. Future work • Divergence characterization? • Convergence analysis based on SLD metric? Optimization algorithm? • Operator scaling applications for quantum information theory? /

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Matrix geometric mean For PD matrices A, B O, matrix geometric mean is deﬁned as A#B = A / (A− / BA− / ) / A / . Properties • A#B O • A#B = B#A • A#B is the unique PD solution of algebraic Riccati equation XA− X = B. /