The low-rank basis problem for a matrix subspace

The low-rank basis problem for a matrix subspace Tasuku Soma
Univ. Tokyo Joint work with: Yuji Nakatsukasa (Univ. Tokyo) André Uschmajew (Univ. Bonn) 1 / 29

1 The low-rank basis problem 2 Algorithm 3 Convergence Guarantee
4 Experiments 2 / 29

The low-rank basis problem Low-rank basis problem: for a matrix
subspace M ⊆ Rm×n spanned by M1 ,. . . ,Md ∈ Rm×n, minimize rank(X1) + · · · + rank(Xd ) subject to span{X1 ,. . . ,Xd } = M. 4 / 29

The low-rank basis problem Low-rank basis problem: for a matrix
subspace M ⊆ Rm×n spanned by M1 ,. . . ,Md ∈ Rm×n, minimize rank(X1) + · · · + rank(Xd ) subject to span{X1 ,. . . ,Xd } = M. • Generalizes the sparse basis problem: minimize x1 0 + · · · + xd 0 subject to span{x1 ,. . . ,xd } = S ⊆ RN. • Matrix singular values play role of vector nonzero elements 4 / 29

Scope lowrank basis sparse basis (Coleman-Pothen 86) basis problems

Scope lowrank basis sparse basis (Coleman-Pothen 86) basis problems lowrank
matrix sparse vector (Qu-Sun-Wright 14) single element problems • sparse vector problem is NP-hard [Coleman-Pothen 1986] 5 / 29

Scope lowrank basis sparse basis (Coleman-Pothen 86) basis problems lowrank
matrix sparse vector (Qu-Sun-Wright 14) single element problems • sparse vector problem is NP-hard [Coleman-Pothen 1986] • Related studies: dictionary learning [Sun-Qu-Wright 14], sparse PCA [Spielman-Wang-Wright],[Demanet-Hand 14] 5 / 29

Applications • memory-eﬃcient representation of matrix subspace • matrix compression
beyond SVD • dictionary learning • string theory: rank-deﬁcient matrix in rectangular subspace • image separation • accurate eigenvector computation • maximum-rank completion (discrete mathematics) • ... 6 / 29

Abstract greedy algorithm Algorithm 1 Greedy meta-alg. for computing a
low-rank basis Input: Subspace M ⊆ Rm×n of dimension d. Output: Basis B = {X∗ 1 ,. . . ,X∗ d } of M. Initialize B = ∅. for = 1,. . . ,d do Find X∗ ∈ M of lowest possible rank s.t. B ∪ {X∗} is linearly independent. B ← B ∪ {X∗} • If each step is successful, this ﬁnds the required basis! 8 / 29

Greedy algorithm: lemma Lemma X∗ 1 ,. . . ,X∗
d : output of greedy algorithm. For any ∈ {1,. . . ,d} and lin. indep. {X1 ,. . . ,X } ⊆ M with rank(X1) ≤ · · · ≤ rank(X ), rank(Xi ) ≥ rank(X∗ i ) for i = 1,. . . , . 9 / 29

Greedy algorithm: lemma Lemma X∗ 1 ,. . . ,X∗
d : output of greedy algorithm. For any ∈ {1,. . . ,d} and lin. indep. {X1 ,. . . ,X } ⊆ M with rank(X1) ≤ · · · ≤ rank(X ), rank(Xi ) ≥ rank(X∗ i ) for i = 1,. . . , . Proof. If rank(X ) < rank(X∗), then rank(Xi ) < rank(X∗) for i ≤ . But since one Xi must be linearly independent from X∗ 1 ,. . . ,X∗ −1 , this contradicts the choice of X∗. (Adaption of standard argument from matroid theory.) 9 / 29

Greedy algorithm: justiﬁcation Theorem X∗ 1 ,. . . ,X∗
d : lin. indep. output of greedy algorithm. Then {X1 ,. . . ,X } is of minimal rank iﬀ rank(Xi ) = rank(X∗ i ) for i = 1,. . . , . In particular, {X∗ 1 ,. . . ,X∗} is of minimal rank. • Analogous result for sparse basis problem in [Coleman, Pothen 1986] 10 / 29

The single matrix problem minimize rank(X) subject to X ∈
M \ {0}. • NP-hard of course (since sparse vector is) 11 / 29

The single matrix problem minimize rank(X) subject to X ∈
M \ {0}. • NP-hard of course (since sparse vector is) Nuclear norm heuristic ( A ∗ := σi (A)) minimize X ∗ subject to X ∈ M, X F = 1. NOT a convex relaxation due to non-convex constraint. 11 / 29

Algorithm Outline (for the single matrix problem) Phase I: rank
estimate Y = Sτ(X), X = PM (Y) PM (Y) F until rank(Y) converges Phase II: alternating projection Y = T r (X), X = PM (Y) PM (Y) F estimated r = rank(Y) 12 / 29

Shrinkage operator Shrinkage operator (soft thresholding) for X = UΣVT
: Sτ(X) = USτ(Σ)VT , Sτ(Σ) = diag(σ1 − τ,. . . ,σrank(X) − τ)+ Fixed-point iteration Y = Sτ(X), X = PM (Y) PM (Y) F 13 / 29

Shrinkage operator Shrinkage operator (soft thresholding) for X = UΣVT
: Sτ(X) = USτ(Σ)VT , Sτ(Σ) = diag(σ1 − τ,. . . ,σrank(X) − τ)+ Fixed-point iteration Y = Sτ(X), X = PM (Y) PM (Y) F Interpretation: [Cai, Candes, Shen 2010], [Qu, Sun, Wright @NIPS 2014] block coordinate descent (a.k.a. alternating direction) for minimize X,Y τ Y ∗ + 1 2 Y − X 2 F subject to X ∈ M, and X F = 1, [Qu, Sun, Wright @NIPS 2014]: analogous method for sparsest vector. 13 / 29

The use as a rank estimator Y = Sτ(X), X
= PM (Y) PM (Y) F • The fixed point of Y would be a matrix of low-rank r, which is close to, but not in M if r > 1. • otherwise, it would be a fixed point of Y = Sτ (Y) Sτ (Y) F which can hold only for rank-one matrices. • The fixed point of X usually has full rank, and “too large“ σi 1. ⇒ Need further improvement, but accept r as rank estimate. 14 / 29

Algorithm Outline (for the single matrix problem) Phase I: rank
estimate Y = Sτ(X), X = PM (Y) PM (Y) F until rank(Y) converges Phase II: alternating projection Y = T r (X), X = PM (Y) PM (Y) F estimated r = rank(Y) 15 / 29

Obtaining solution: truncation operator Truncation operator (hard thresholding) for X
= UΣVT : T r (X) = UT r (Σ)VT , T r (Σ) = diag(σ1 ,. . . ,σr ,0,. . . ,0) Fixed-point iteration Y = T r (X), X = PM (Y) PM (Y) F 16 / 29

Obtaining solution: truncation operator Truncation operator (hard thresholding) for X
= UΣVT : T r (X) = UT r (Σ)VT , T r (Σ) = diag(σ1 ,. . . ,σr ,0,. . . ,0) Fixed-point iteration Y = T r (X), X = PM (Y) PM (Y) F Interpretation: alternating projection method for ﬁnding X∗ ∈ {X ∈ M : X F = 1} ∩ {Y : rank(Y) ≤ r} . 16 / 29

Greedy algorithm: pseudocode Algorithm 2 Greedy algorithm for computing a
low-rank basis Input: Basis M1 ,. . . Md ∈ Rm×n for M Output: Low-rank basis X1 ,. . . ,Xd of M. for = 1,. . . ,d do Phase I on X , obtain rank estimate r. Phase II on X with rank r, obtain X ∈ M of rank r. • To force linear independence, restarting is sometimes necessary: X is always initialized and restarted in span{X1 ,. . . ,X −1 }⊥ ∩ M. • Phase I output X is Phase II input 17 / 29

Observed convergence (single initial guess) m = 20,n = 10,d
= 5, exact ranks: (1,2,3,4,5). • ranks recovered in “wrong” order (2,1,5,3,4) 19 / 29

Observed convergence (several initial guesses) • ranks recovered in correct
order 20 / 29

Local convergence of Phase II Rr := {X : rank(X)
= r}, B := {X ∈ M : X F = 1} TX∗ (N ): tangent space of manifold N at X∗ Theorem Assume X∗ ∈ T r ∩ B has rank r, input of Phase II, and TX∗ Rr ∩ TX∗ B = ∅. Then Phase II is locally linearly convergent: Xnew − X∗ F X − X∗ F cos θ 21 / 29

Local convergence of Phase II Rr := {X : rank(X)
= r}, B := {X ∈ M : X F = 1} TX∗ (N ): tangent space of manifold N at X∗ Theorem Assume X∗ ∈ T r ∩ B has rank r, input of Phase II, and TX∗ Rr ∩ TX∗ B = ∅. Then Phase II is locally linearly convergent: Xnew − X∗ F X − X∗ F cos θ • Follows from a meta-theorem on alternating projections in nonlinear optimization [Lewis, Luke, Malick 2009] • We provide “direct” linear algebra proof • Assumption holds if X∗ is isolated rank-r matrix in M 21 / 29

Local convergence: intuition X∗ TX∗ B TX∗ Rr cos θ
≈ 1 √ 2 X∗ TX∗ B TX∗ Rr cos θ ≈ 0.9 Xnew − X∗ F X − X∗ F ≤ cos θ + O( X − X∗ 2 F ) θ ∈ (0, π 2 ]: subspace angle between TX∗ B and TX∗ Rr cos θ = max X∈TX∗ B Y∈TX∗ Rr | X,Y F | X F Y F . 22 / 29

Results for synthetic data exact ranks av. sum(ranks) av. Phase
I err (iter) av. Phase II err (iter) ( 1 , 1 , 1 , 1 , 1) 5.05 2.59e-14 (55.7) 7.03e-15 (0.4) ( 2 , 2 , 2 , 2 , 2 ) 10.02 4.04e-03 (58.4) 1.04e-14 (9.11) ( 1 , 2 , 3 , 4 , 5) 15.05 6.20e-03 (60.3) 1.38e-14 (15.8) ( 5 , 5 , 5 , 10 , 10) 35.42 1.27e-02 (64.9) 9.37e-14 (50.1) ( 5 , 5 , 10 , 10 , 15) 44.59 2.14e-02 (66.6) 3.96e-05 (107) Table: m = n = 20, d = 5, random initial guess. exact ranks av. sum (ranks) av. Phase I err (iter) av. Phase II err (iter) ( 1 , 1 , 1 , 1 , 1) 5.00 6.77e-15 (709) 6.75e-15 (0.4) ( 2 , 2 , 2 , 2 , 2) 10.00 4.04e-03 (393) 9.57e-15 (9.0) ( 1 , 2 , 3 , 4 , 5) 15.00 5.82e-03 (390) 1.37e-14 (18.5) ( 5 , 5 , 5 , 10 , 10) 35.00 1.23e-02 (550) 3.07e-14 (55.8) ( 5 , 5 , 10 , 10 , 15) 44.20 2.06e-02 (829) 8.96e-06 (227) Table: Five random initial guesses. 24 / 29

Image separation original mixed computed 25 / 29

Link to tensor decomposition Rank-one basis: M = span{a 1
bT 1 ,. . . ,a d bT d } If M1 ,. . . ,Md is any basis, then Mk = d =1 c k, a bT (k = 1,. . . ,d) ⇐⇒ T = d =1 a ◦ b ◦ c , where T is the third-order tensor with slices M . ⇒ rank(T ) = d. Suggests ﬁnding rank-one basis using CP decomposition algorithms (ALS, Generalized Eigenvalue, ...) but CP not enough for higher-rank case 26 / 29

Comparison with CP for rank-1 basis matrix size n 100
200 300 400 500 600 700 800 900 1000 10-3 10-2 10-1 100 101 102 103 Runtime(s) Phase I Phase II Tensorlab matrix size n 100 200 300 400 500 600 700 800 900 1000 10-15 10-10 10-5 100 Error Phase I Phase II Tensorlab d = 10, varying m = n between 50 and 500. 27 / 29

Comparison with CP for rank-1 basis dimension d 2 4
6 8 10 12 14 16 18 20 10-4 10-3 10-2 10-1 100 Runtime(s) Phase I Phase II Tensorlab dimension d 2 4 6 8 10 12 14 16 18 20 10-15 10-10 10-5 100 Error Phase I Phase II Tensorlab m = n = 10, varying d between 2 and 20. 28 / 29

Summary and outlook Summary • low-rank basis problem • lots
of applications • NP-hard • introduced practical greedy algorithm Future work • further probabilistic analysis of Phase I • ﬁnding low-rank tensor basis 29 / 29

Application: matrix compression • classical SVD compression: achieves ≈ r
n compression A ≈ U Σ VT 30 / 29

Application: matrix compression • classical SVD compression: achieves ≈ r
n compression A ≈ U Σ VT further, suppose m = st and reshape ith column of U: ui,1 ui,2 . . . ui,s . . . ui,st → ui,1 ui,s+1 · · · ui,s(t−1)+1 . . . . . . . . . . . . ui,s ui,2s · · · ui,st ≈ Uu,i Σu,i VT u,i • Compression: ≈ r1r2 n2 , “squared” reduction 30 / 29

Finding right “basis” for storage reduction U = [U1 ,.
. . ,Ur ], ideally, each column Ui = [ui,1 ,ui,2 ,. . . ,ui,st ]T has low-rank structure when matricized ui,1 ui,2 . . . ui,s . . . ui,st → ui,1 ui,s+1 · · · ui,s(t−1)+1 . . . . . . . . . . . . ui,s ui,2s · · · ui,st ≈ Uu,i VT u,i for i = 1,. . . ,r • More realistically: ∃ Q ∈ Rr×r s.t. UQ has such property ⇒ ﬁnding low-rank basis for matrix subspace spanned by mat(U1),mat(U2),. . . ,mat(Ur ) 31 / 29

Compressible matrices • FFT matrix e2π ij n : each
mat(column) is rank-1 • Any circulant matrix eigenspace • Graph Laplacian eigenvectors: • rank 2: ladder, circular • rank 3: binary tree, cycle, path, wheel • rank 4: lollipop • rank 5: barbell • ... ⇒ explanation is open problem, fast FFT-like algorithms? 32 / 29

Eigenvectors of multiple eigenvalues A[x1 ,x2 ,. . . ,xk
] = λ[x1 ,x2 ,. . . ,xk ] • eigenvector x for Ax = λx is not determined uniquely + non-diﬀerentiable • numerical practice: content with computing span(x1 ,x2) extreme example: I, any vector is eigenvector! • but perhaps                     1          ,          1          , . . .          1                     is a “good” set of eigenvectors • why? low-rank, low-memory! 33 / 29

Computing eigenvectors of multiple eigenvalues F : FFT matrix A
= 1 n Fdiag(1 + ,1 + ,1 + ,1 + ,1 + ,6,. . . ,n2)F∗, = O(10−10) • cluster of five eigenvalues ≈ 1 • “exact”, low-rank eigenvectors: first five columns of F. v1 v2 v3 v4 v5 memory eig 4.2e-01 1.2e+00 1.4e+00 1.4e+00 1.5e+00 O(n2) eig+Alg. 2 1.2e-12 1.2e-12 1.2e-12 1.2e-12 2.7e-14 O(n) Table: Eigenvector accuracy, before and after finding low-rank basis. • MATLAB’s eig fails to find accurate eigenvectors. • accurate eigenvectors obtained by finding low-rank basis. 34 / 29

Restart Algorithm 3 Restart for linear independence Input: Orthogonal projection
Q −1 onto M ∩ Span(X∗ 1 ,. . . ,X∗ −1 )⊥, cur- rent matrix X and tolerance restarttol > 0. Output: Eventually replaced X . if Q −1(X ) F < restarttol (e.g. 0.01) then Replace X by a random element in Range(Q −1). X ← X / X F • Linear dependence monitored by projected norm Q −1(X ) F 35 / 29

Linear algebra convergence proof TX∗ Rr = [U∗ U⊥ ∗
] A B C 0 [V∗ V⊥ ∗ ]T . (1) X − X∗ = E + O( X − X∗ 2 F ) with E ∈ TX∗ B. Write E = [U∗ U⊥ ∗ ][ A B C D ][V∗ V⊥ ∗ ]T . By (1) D 2 F ≥ sin2 θ · E 2 F . ∃F,G orthogonal s.t. X = FT Σ∗ + A 0 0 D GT + O( E 2 F ). T r (X) − X∗ F = A + Σ∗ 0 0 0 − Σ∗ −B −C CΣ−1 ∗ B F + O( E 2 F ) = A B C 0 F + O( E 2 F ). So T r (X) − X∗ F X − X∗ F = E 2 F − D 2 F + O( X − X∗ 2 F ) E F + O( X − X∗ 2 F ) ≤ 1 − sin2 θ + O( X − X∗ 2 F ) = cos θ + O( X − X∗ 2 F ) 36 / 29

Partial result for Phase I Corollary If r = 1,
then already Phase I is locally linearly convergent. Proof. In a neighborood of a rank-one matrix, shrinkage and truncation are the same up to normalization. • In general, Phase I “sieves out” the non-dominant components to reveal the rank 37 / 29

String theory problem Given A1 ,. . . ,Ak ∈
Rm×n, ﬁnd ci ∈ R s.t. rank(c1A1 + c2A2 + · · · + ckAk ) < n A1 A2 . . . Ak Q = 0, Q = [q1 ,. . . ,qnk−m ] • ﬁnding null vector that is rank-one when matricized • ⇒ Lowest-rank problem from mat(q1),. . . ,mat(qnk−m) ∈ Rn×k when rank= 1 • slow (or non-)convergence in practice • NP-hard? probably.. but unsure (since special case) 38 / 29

The low-rank basis problem for a matrix subspace

The low-rank basis problem for a matrix subspace

More Decks by Tasuku Soma

Featured

Transcript