Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Geometry of Operator Scaling Part II: information geometry and scaling

Tasuku Soma
June 17, 2020
1.7k

Information Geometry of Operator Scaling Part II: information geometry and scaling

Tasuku Soma

June 17, 2020
Tweet

Transcript

  1. Information Geometry of Operator Scaling Part II: information geomerty and

    scaling Π Π k k+ Takeru Matsuda (UTokyo, RIKEN) Tasuku Soma (UTokyo) July 8, /
  2. Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output:

    positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /
  3. Sinkhorn algorithm [Sinkhorn, 6 ] W.l.o.g. assume that A m

    = n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /
  4. Sinkhorn = alternating e-projection Kullback-Leibler divergence DKL(B || A) =

    i,j Bij log Bij Aij Theorem (Csiszár ( )) Sinkhorn’s iterate A(k) satisfies A( k+ ) = argmin{DKL(B || A( k)) : B n = /m}, A( k+ ) = argmin{DKL(B || A( k+ )) : B m = n/n}. In information geometry, this is alternating e-projection w.r.t. the Fisher metric. /
  5. Operator scaling • A linear map Φ : Cn×n →

    Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /
  6. Operator scaling Input: CP map Φ : Cn×n → Cm×m

    Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /
  7. Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) =

    In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Under reasonable conditions, Φ(k) convergences to a solution [Gurvits, ] Can we view this as “alternating e-projection”? 8 /
  8. Our result Theorem (Matsuda and S. ) Operator Sinkhorn is

    alternating e-projection w.r.t. the symmetric logarithmic derivative (SLD) metric on positive definite matrices. • Quantum generalization of [Csiszár, ] /
  9. Information geometry Statistical theory using differential geometry [Amari and Nagaoka,

    ] Nonmetrical dual connections play central role (i.e., different from Levi-Civita connection) Key components metric tensor g two affine connections ∇(m), ∇(e) −→ induce two geodesics (m/e-geodesics) /
  10. Information geometry on Sn− Sn− = p = (p ,

    ... , pn) : pk > , n k= pk = ⊂ Rn ++ Probability simplex of dim n − . We will introduce Riemmanian structure on Sn− with metric tensor g affine connections ∇(m), ∇(e) /
  11. Fisher metric Two coordinate systems on Sn− : m-coordinate (mixture):

    = (p , ... , pn− ) e-coordinate (exponential): = (log(p /pn), ... , log(pn− /pn)) Fisher metric g(X, Y)p = i X(log pi)Y(pi) (X, Y ∈ Tp(Sn− ): tangent vectors) = i X(e) i Y(m) i where X = i X(e) i i p , Y(m) = i Y(m) i i p Note Fisher metric is the unique metric satisfying natural statistical invariance (Cencov’s theorem). /
  12. Dual connections Take ∇(m), ∇(e) s.t. Christoffel symbols in m/e-coordinates

    vanish, respectively. m-geodesic: (t) = ( − t)p + tq. e-geodesic: (t) ∝ exp(( − t) log p + t log q). p q m e They are dual connections: Zg(X, Y) = g(∇(m) Z X, Y) + g(X, ∇(e) Z Y) Cf. Levi-Civita connection ∇LC is self-dual: Zg(X, Y) = g(∇LC Z X, Y) + g(X, ∇LC Z Y) /
  13. Sinkhorn as alternating e-projection Now, consider SN− (N = mn)

    and submanifolds Π = {A ∈ Rm×n ++ | A n = m− m}, Π = {A ∈ Rm×n ++ | A m = n− n}. (These submanifolds are m-autoparallel) Π Π A( k) A( k+ ) Theorem If A ∈ Rm×n ++ , then iterates of Sinkhorn algorithm is e-projection: e-geodesic from A( k) to A( k+ ) (from A( k+ ) to A( k+ )) is orthogonal to Π (Π ) w.r.t. Fisher metric. /
  14. Dually flat structure of Sn− and KL-divergence Let Ψ(p) =

    i pi log pi − pi be negative entropy. Then, = Ψ( ) , = Ψ∗( ) Legendre transform g = Hess(Ψ) Hessian One can define canonical divergence as D(p || q) = (p) − (q) − grad (q), q − p (in our case, it is KL) Fact E-projection onto m-autoparallel submanifolds can be done via canonical divergence minimization. −→ information geometric proof of Csiszár ( ) 6 /
  15. KL-divergence and capacity Consider case of m = n Capacity

    [Gurvits and Samorodnitsky, ; Idel, 6] cap(A) = inf x> n i= (Ax)i n i= xi /n Capacity can be used as “potential” for Sinkhorn. − log cap(A)+ log n = min B∈Π ∩Π DKL(B || A) /
  16. Convergence of Sinkhorn algorithm Generalized Pythagorean theorem If the m-geodesic

    from A to A and the e-geodesic from A to A are orthogonal at A w.r.t. the Fisher metric, then DKL(A || A ) = DKL(A || A ) + DKL(A || A ) A A A Theorem (Csiszár ( )) The Sinkhorn algorithm converges to the e-projection A∗ of A onto Π ∩ Π : DKL(A∗ || A) = min B∈Π ∩Π DKL(B || A) DKL(A∗ || A) = DKL(A∗ || A(K)) + K k= DKL(A(k) || A(k− )) 8 /
  17. Information geometry of operator scaling Idea Using the Choi representation,

    we move to manifold of PD matrices. Then apply quantum information geometry on the PD manifold. matrix scaling operator scaling manifold p ∈ RN ++ : pi > , pi = ∈ CN×N: O, tr = metric Fisher SLD divergence KL ??? dually flat? YES NO /
  18. Partial trace For a partitioned matrix    

      A · · · A n . . . ... . . . An · · · Ann       , partial traces are defined as tr       A · · · A n . . . ... . . . An · · · Ann       = n i= Aii, tr       A · · · A n . . . ... . . . An · · · Ann       =       tr A · · · tr A n . . . ... . . . tr An · · · tr Ann       . /
  19. Choi representation [Choi, ] CH(Φ) = n i,j= Eij ⊗

    Φ(Eij) =       Φ(E ) · · · Φ(E n) . . . ... . . . Φ(En ) · · · Φ(Enn)       Facts: • CH(Φ) is isomorphism of linear maps and Hermitians • CH(Φ) O ⇐⇒ Φ is CP • CH(ΦL,R) = (R† ⊗ L) CH(Φ)(R ⊗ L†) • tr CH(Φ) = Φ(In), tr CH(Φ) = Φ∗(Im) /
  20. Operator Sinkhorn in Choi representation We assume that CH(Φ) is

    PD. Consider S(Cmn) = { ∈ Cmn×mn : O, tr = } (density matrices) and Π = { O | tr ( ) = I/m} ⊂ S(Cmn) Π = { O | tr ( ) = I/n} ⊂ S(Cmn) Putting k := CH(Φk ), iterates of operator Sinkhorn are: k+ = (I ⊗ Φ k (I)− / ) k (I ⊗ Φ k (I)− / ) ∈ Π k+ = (Φ∗ k+ (I)− / ⊗ I) k+ (Φ∗ k+ (I)− / ⊗ I) ∈ Π /
  21. Symmetric Logarithmic Derivative (SLD) metric • In classical information geometry,

    the Fisher metric is the only monotone metric (Cencov’s theorem). • However, in quantum information geometry, monotone metrics are not unique. • Monotone metrics are characterized by operator monotone functions [Petz, 6] • Each monotone metric induces its own e-connection. Symmetric Logarithmic Derivative (SLD) metric gS(X, Y) = tr(LS X Y ), where X = (LS X + LS X ) Lyapunov equation /
  22. Operator Sinkhorn = alternating e-projection One can introduce m/e-connections s.t.

    m-geodesic: (t) = ( − t) + t e-geodesic: (t) ∝ Kt Kt, where K = − # is matrix geometric mean Π Π k k+ Theorem If O, then iterates of operator Sinkhorn algorithm is the unique e-projection w.r.t. SLD metric: e-geodesic from k to k+ (from k+ to k+ ) is orthogonal to Π (Π ) w.r.t. SLD metric. /
  23. Proof sketch • The e-geodesic from k to k+ is

    (t) = Kt k Kt, K = − k # k+ = I ⊗ Φ k (I)− / • The e-representation LS of ( ) satisfies the Lyapunov equation: (LS k+ + k+ LS) = ( ) = (log K) k+ + k+ (log K) • Since the solution of the Lyapunov equation is unique, LS = log K = −I ⊗ log Φ k (I) • Therefore, ( ) is orthogonal to Π w.r.t. SLD metric. • Uniqueness is shown similarly (not from generalized Pythagorean theorem, but from the uniqueness of solutions of matrix equations) Π k k+ ( ) 6 /
  24. Is there divergence for capacity? Consider case of m =

    n. Capacity [Gurvits, ] cap(Φ) = inf X O det Φ(X) det X /n Key tool for studying operator scaling [Gurvits, ; Garg et al., ; Allen-Zhu et al., 8] Q. Is there a “divergence” D s.t. − log cap(Φ)+ log n = min ∗∈Π ∩Π D( ∗ || CH(Φ)) as in matrix scaling? /
  25. Is there divergence for capacity? Naive idea: Umegaki relative entropy

    D( || ) = tr[ (log − log )] • arises from dually flat structure with Bogoliubov–Kubo–Mori metric, which corresponds to Ψ( ) = tr( log − ). However, numerical experiments shows this does not coincide with operator Sinkhorn iteration... Actually, SLD metric is NOT dually flat! Still open! 8 /
  26. Numerical Example Genereted random density matrix and compare • −

    log cap( ) • D( ∗ || ) ( ∗: limit of Sinkhorn) 6 8 6 8 Umegaki relative entropy − log cap /
  27. Summary Operator Sinkhorn is alternating e-projection w.r.t. SLD metric on

    PD matrices. Future work • Divergence characterization? • Convergence analysis based on SLD metric? Optimization algorithm? • Operator scaling applications for quantum information theory? /
  28. Summary Operator Sinkhorn is alternating e-projection w.r.t. SLD metric on

    PD matrices. Future work • Divergence characterization? • Convergence analysis based on SLD metric? Optimization algorithm? • Operator scaling applications for quantum information theory? Thank you! /
  29. Matrix geometric mean For PD matrices A, B O, matrix

    geometric mean is defined as A#B = A / (A− / BA− / ) / A / . Properties • A#B O • A#B = B#A • A#B is the unique PD solution of algebraic Riccati equation XA− X = B. /