Slide 1

Slide 1 text

Information Geometry of Operator Scaling PART I: survey of scaling problems Π Π k k+ Takeru Matsuda (UTokyo, RIKEN) Tasuku Soma (UTokyo) June , /

Slide 2

Slide 2 text

Talk overview Part I: survey of scaling problems • What is scaling problem? • Matrix scaling and operator scaling with applications Part II: information geometry and scaling • My recent work (with Takeru Matsuda) • Operator scaling and quantum information geometry /

Slide 3

Slide 3 text

Summary of Part I • Scaling problems are linear algeraic problems arising in surprisingly many fields. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /

Slide 4

Slide 4 text

Matrix scaling /

Slide 5

Slide 5 text

Matrix scaling Input: nonnegative matrix A ∈ Rm×n + Output: positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (LAR) n = m m and (LAR) m = n n Applications • Markov chain estimation [Richard Sinkhorn, 6 ] • Contingency table analysis [Morioka and Tsuda, ] • Optimal transport [Peyré and Cuturi, ] /

Slide 6

Slide 6 text

Sinkhorn theorem Theorem (Richard Sinkhorn ( 6 )) For positive matrix A ∈ Rm×n, there exists a solution (L, R) for matrix scaling. Can we find it by efficient algorithm? 6 /

Slide 7

Slide 7 text

Sinkhorn algorithm [Richard Sinkhorn, 6 ] W.l.o.g. assume that A m = n/n. A( ) = A A( k+ ) = m Diag(A( k) n)− A( k), A( k+ ) = n A( k+ ) Diag((A( k+ ) m)− . Theorem (Richard Sinkhorn ( 6 )) If A is a positive matrix, then there exists a solution and A(k) converges to a solution. /

Slide 8

Slide 8 text

Example A( ) = . . . . . . . . . . . . . 8 . . . 6 . . . . 6 . 6 . . . 8 /

Slide 9

Slide 9 text

Example A( ) = . . . 6 . 8 . . 86 . . 8 . . . 6 . . . . . . 6 . . . . 6 . . 66 . 8 /

Slide 10

Slide 10 text

Example A( ) = . . . . . . 8 . . . 8 . . 6 . . 6 . . . 8 6 . . 6 . 88 . . 6 8 . . . 8 /

Slide 11

Slide 11 text

Example A( ) = . . 6 . 6 . . . 8 . . . . . . . 6 . 6 8 . . 8 . . 6 . 886 . . 6 . . . 8 /

Slide 12

Slide 12 text

Example A( 8) = . . . . . . . . 6 . . . . 6 . 6 . 68 . . 8 6 . . . 88 . . 6 8 . 8 . 8 . 8 /

Slide 13

Slide 13 text

Matrix scaling: History s Kruithof telephone forecast s Deming, Stephen statistcs 6 s Stone economics (RAS method) Sinkhorn formulation of “matrix scaling” Knopp Sinkhorn algorithm s Csiszár information theory s Wigderson et al. computer science s Cuturi machine learning /

Slide 14

Slide 14 text

Application: Transportation problem / / / / / / cost: Cij i j /

Slide 15

Slide 15 text

Application: Transportation problem / / / / / / /6 /6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . /

Slide 16

Slide 16 text

Application: Transportation problem / / / / / / /6 /6 /6 /6 / cost = (C + C + C + C ) · 6 + C · . Find min cost transportation /

Slide 17

Slide 17 text

Application: Transportation problem C ∈ Rm×n + : cost matrix min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j /

Slide 18

Slide 18 text

Application: Transportation problem C ∈ Rm×n + : cost matrix min P≥O C, P s.t. P n = m m , P m = n n /m /n /m /n /m /n Cij i j Entropic regularization [Wilson, 6 ] min P≥O C, P − i,j Pij log(Pij) s.t. P n = m m , P m = n n −→ The optimal solution is unique and in the form of P = LAR, where L, R: nonnegative diagonal and Aij = exp(−Cij/ ) matrix scaling! • Heavily used in ML (Wasserstein distance) [Peyré and Cuturi, ] /

Slide 19

Slide 19 text

Approximate scaling Input: nonnegative matrix A ∈ Rm×n + , > Find: positive diagonal matrices L ∈ Rm ++ , R ∈ Rn ++ s.t. (L AR ) n − m m < and (L AR ) m − n n < • A is approx scalable def ⇐⇒ has solution for ∀ > . /

Slide 20

Slide 20 text

Approximate scaling and bipartite matching A =       . . . .8 . .       a b c support graph a b c perfect matching Theorem (R. Sinkhorn and Knopp ( 6 )) A is approx scalable ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /

Slide 21

Slide 21 text

Operator scaling /

Slide 22

Slide 22 text

Operator scaling Noncommutative/quantum generalization of matrix scaling. Applications • Edmonds problem [Gurvits, ; Ivanyos, Qiao, and Subrahmanyam, ; Ivanyos, Qiao, and Subrahmanyam, 8; Garg et al., ] • Brascamp–Lieb inequalities [Garg et al., 8] • Quantum Schrödinger bridge [Georgiou and Pavon, ] • Multivariate scatter estimation [Franks and Moitra, ] • Computational invariant theory [Allen-Zhu et al., 8]. /

Slide 23

Slide 23 text

Operator scaling • A linear map Φ : Cn×n → Cm×m is completely positive (CP) if Φ(X) = k i= AiXA† i for some A , ... , Ak ∈ Cm×n. • The dual map of the above CP map is Φ∗(X) = k i= A† i XAi • For nonsingular Hermitian matrices L, R, the scaled map ΦL,R is ΦL,R(X) = LΦ(R†XR)L† 6 /

Slide 24

Slide 24 text

Operator scaling Input: CP map Φ : Cn×n → Cm×m Output: nonsingular Hermitian matrices L, R s.t. ΦL,R(In) = Im m and Φ∗ L,R (Im) = In n . Note: Changed constants from Gurvits’ original “doubly stochastic” formulation /

Slide 25

Slide 25 text

Approximate scaling Input: CP map Φ : Cn×n → Cm×m, > Find: nonsingular Hermitian matrices L , R s.t. ΦL ,R (In) − Im m < and Φ∗ L ,R (Im) − In n < . • Φ is approx scalable def ⇐⇒ has solution for ∀ > . 8 /

Slide 26

Slide 26 text

Matrix scaling ⊆ Operator scaling For A ∈ Rm×n + , define Aij = √ aij ei e† j for i, j and CP map Φ(X) = i,j AijXA† ij = i,j aij ei e† j Xej e† i . Then, Φ(I) = i,j aij ei e† j ej e† i = i j aij ei e† i = Diag(A n) Φ∗(I) = i,j aij ej e† i ei e† j = j i aij ej e† j = Diag(A m) /

Slide 27

Slide 27 text

Operator Sinkhorn algorithm [Gurvits, ] W.l.o.g. assume that Φ∗(Im) = In/n. Φ( ) = Φ Φ( k+ ) = Φ( k) L,In where L = √ m Φ( k) (In)− / , Φ( k+ ) = Φ( k+ ) Im,R where R = √ n (Φ( k+ ))∗(Im)− / . Theorem (Gurvits ( )) If Φ is approximately scalable, Φ(k) convergences to a solution. /

Slide 28

Slide 28 text

Application: Edmonds problem Given: A = x A + · · · + xk Ak (A , ... , Ak : matrices, x , ... , xk : scalar variables) Determine: det(A) = ? • If one can use randomness, it is easy! Can we do deterministically? −→ P vs BPP (major open problem in complexity theory) • Deep connection to combinatorial optimization and complexity theory [Edmonds, 6 ; Lovász, 8 ; Murota, ] • For some wide class of A (and over C), one can solve it by operator scaling [Gurvits, ] /

Slide 29

Slide 29 text

Application: Edmonds problem A =       x x x x x x       a b c support graph a b c perfect matching Theorem (Edmonds ( 6 )) Assume that A , ... , Ak have only one nonzero entries. Then, det(A) ≠ ⇐⇒ support graph has perfect mathing Linear algebra Graph theory /

Slide 30

Slide 30 text

Application: Edmonds problem Theorem (Lovász’s weak duality) For A = x A + · · · + xk Ak , rank A ≤ min {n − dim U + dim( i AiU) : U ≤ Cn subspace} • If A , ... , Ak are rank-one, the equality holds. Furthermore, one can compute rank A and minimizer U by linear matroid intersection. • Lovász ( 8 ) also gave several families of A s.t. one can compute rank A via combinatorial optimization. Linear algebra Combinatorial optimization /

Slide 31

Slide 31 text

Application: Edmonds problem Theorem (Gurvits ( )) For A = x A + · · · + xk Ak , consider CP map Φ(X) = k i= Ak XA† k . Then, [∃X O s.t. rank Φ(X) < rank X] =⇒ det A = Furtheremore, the former condition can be efficiently checked by operator scaling. • Gurvits ( ) gave several classes of A s.t. the converse is also true. /

Slide 32

Slide 32 text

Noncommutative rank Lovász’s weak duality has deep connections to algebra... • noncommutative rank (nc-rank) is the rank of A = A x + · · · + Ak xk where xi are noncommutative variables (i.e., xixj ≠ xjxi) • Highly nontrivial to define! (needs deep algebra, e.g. free-skew fields, matrix ring, ...) Theorem (Amitsur ( 66), Cohn ( ), Gurvits ( ), and Garg et al. ( )) nc-rank of A = min{n − dim U + dim( i AiU) : U ≤ Cn subspace} Furthermore, nc-rank can be computed via operator scaling. Noncommutative Edmonds problem is in P!! /

Slide 33

Slide 33 text

Application: Brascamp-Lieb inequalities B = (B , ... , Bk ): tuple of matrices, Bi: ni × n, p = (p , ... , pk ) ∈ Rk ++ Theorem (Brascamp and Lieb ( 6)) There exists C ∈ ( , +∞] s.t. ∫ Rn k i= (fi(Bix))pi dx ≤ C · k i= ∫ Rni (fi(Bixi))dxi pi for all integrable nonnegative functions f , ... , fk . Includes Hölder, Loomis–Whitney, etc. 6 /

Slide 34

Slide 34 text

BL datum and operator scaling Given B = (B , ... , Bk ) and p, let BL(B, p) be the smallest C s.t. BL-inequality holds. Theorem (Lieb ( )) BL(B, p) = sup X ,...,Xk O k i= det(Xi) det( m i= piB† i XiBi) / BL(B, p) can be computed by operator scaling [Garg et al., 8] /

Slide 35

Slide 35 text

Summary 8 /

Slide 36

Slide 36 text

Summary of Part I • Scaling problems are linear algeraic problems arising in surprisingly many fields. • Matrix scaling: optimal transport, statistics, machine learning, combinatorial optimization, ... • Operator scaling: combinatorial optimization, computational complexity, noncommutative algebra, analysis, ... • Both can be solved by simple alternating algorithm (Sinkhorn) /