PCA with network/graph constraints

On PCA with network/graph constraints Daisuke Yoneoka October 15, 2018
1 Introduction and notations We are interested in a principal component analysis (PCA) with network constraints between features and/or samples. Given a data matrix X ∈ Rp×n, the classical PCA can be formulated as a learner of the projection V ∈ Rn×d (principal directions) of X in a d-dimensional linear space spanned by an orthonormal basis U ∈ Rp×d (principal components), and the product L = UV T ∈ Rp×n can be viewed as the low-rank approximation of X. PCA find the optimal U and V by minimizing min U,V X − UV T 2 F s.t. UT U = I. (1) PCA provides an embedding for the data on a linear manifold. In contrast, Laplacian embedding can provide an embedding for the data on the non-linear manifold and thus can preserve the geometrical relationships [1]. Let assume that there is a graph G = (S, E) with vertex set S and pairwise edges E. Based on the graph, we can calculate the adjacency matrix W which encodes the weights of edges E. Then, the graph Laplacian matrix Ψ or the normalized graph Laplacian matrix Ψnorm can be defined as Ψ = D −A or Ψnorm = D−1/2(D −W)D−1/2 = I − D−1/2WD−1/2, respectively, where D is the degree matrix defined as D = diag(dii ) and dii = j Wij . The Laplacian embedding finds Q ∈ Rn×d, which is the embedded coordinates of the n data point, by minimizing min Q n i,j=1 qi − qj 2Wij = tr{QT (D − W)Q} = tr{QT ΨQ} s.t. QT Q = I, (2) where W ∈ Rn×n is the adjacency matrix of a graph with n nodes. Throughout this review, · F , · ∗ , · 1 , and · 2 denote the Frobenius, nuclear, l1 , and l2 norms, respectively. 1

2 Graph-Laplacian PCA: Closed from Solution and Ro- bustness [5]
This is the ﬁrst article to incorporate data X and graph G for PCA. They proposed a graph- Laplacian PCA (gLPCA) and robust version of gLPCA (RgLCPA). In this article, they focus on the adjacency matrix W ∈ Rn×n (or equivalent Ψ) containing the edge weights on a graph with n nodes. 2.1 Objective function for gLPCA min U,V X − UV T 2 F + α tr(V T ΨV ) s.t. V T V = I, (3) where α ≥ 0 is a parameter balance the two parts. 2.2 Optimization for gLPCA gLPCA has the closed-form solution: Theorem 1. The optimal solution (U, V ) of gLPCA are given by V ∗ = (v1 , . . . , vd ) (4) U∗ = XV ∗, (5) where v1 , . . . , vd are the eigenvectors from the ﬁrst d smallest eigenvalues of the matrix Gα = −XT X + αΨ. (6) 2.3 Objective function for RgLPCA min U,V X − UV T 2,1 + α tr(V T ΨV ) s.t. V T V = I, (7) where A 2,1 = n j=1 p i=1 A2 ij [2]. 2.4 Optimization for RgLPCA RgLPCA uses the updating algorithm to solve the problem by using the Augmented Lagrange Multiplier (ALM) method. ALM extends (8) as min U,V,E E 2,1 + tr{CT (E − X + UV T )} + µ 2 E − X + UV T 2 F + α tr{V T ΨV } s.t. E = X − UV T , V T V = I, (8) 2

where C is Lagrange multiplier and µ is the penalty
parameter. The algorithm is divided into two parts: 1. Solving the sub-problems First, solve U, V while fixing E. After several calculation, (8) is min U,V µ 2 E − X + UV T + C µ 2 F + α tr{V T ΨV } (∵ A 2 F = tr(AAT )), (9) which can be solved by Theorem 1. Second, solve E while fixing U, V . (8) becomes min E E 2,1 + µ 2 E − X + UV T + C µ 2 F . (10) Let ei and ai be the ith column of matrix E and X − UV T − C µ , respectively. (10) is decomposed into n independent problems min ei ei = µ 2 ei − ai 2, (11) and this solution is known to be ei = max(1 − 1 µ ai , 0)ai (12) 2. Updating parameters C and µ are updated as follow: C = C + µ(E − X + UV T ) (13) µ = ρµ, (14) where ρ > 1 is a step size parameter. 2.5 What’s new? • This is a first article that advocates PCA accounting for the graph Laplacian. • In summary, the idea is to add the (manifold/graph) regularization term for the graph smoothness of principal component V to the standard PCA using tr{V T ΨV }. 3 Robust Principal Component Analysis on Graphs [7] It generalizes an robust PCA framework by incorporating the graph data similarity. Note that classical PCA formulation is susceptible to error or outliers in the data because of its quadratic term, and thus robust PCA solves this problem by using sparse penalty term. The first idea for robust PCA was proposed by Candes et al. [2]. 3

3.1 Objective function for Robust PCA on Graph min L,S
L ∗ + λ S 1 + γ tr(LΨLT ) s.t. X = L + S (15) where S ∈ Rp×n is the sparse matrix, and λ and γ control the amount of sparsity of S and the smoothness of L on the graph, respectively. 3.2 Optimization for Robust PCA on Graph They propose using an Alternative Direction Method of Multiplier (ADMM) that is a variant of the standard augmented Lagrangian method. By using the augmented Lagrangian, (15) can be rewritten as: min L,S,W L ∗ + λ S 1 + γ tr(WΨWT ) + tr(ZT 1 (X − L − S)) + r1 2 X − L − S 2 F + tr(ZT 2 (W − L)) + r2 2 W − L 2 F (16) s.t. X = L + S, L = W, where Zi ∈ Rp×n(i = 1, 2) are the Lagrange multipliers. For Z1 and Z2 , the augmented Lagrangian proposes the iteration scheme: Zk+1 1 = Zk 1 + r1 (X − Lk+1 − Sk+1) (17) Zk+1 2 = Zk 2 + r2 (Wk+1 − Lk+1) (18) To update L, S and W, they consider the proximity operator of f, which is given by the following minimization problem proxf (x) = argmin y f(y) + 1 2 x − y 2 2 , (19) where f ∈ F(Rn) and F(Rn) is the class of lower semi-continuous convex function from Rn to ] − ∞, +∞] with dom(F) = ∅. The updates for L, S and W at k + 1th iteration can be formulated by using proxf . 4

Update L Solve L while fixing other terms. That is
Lk+1 = argmin L L ∗ + tr(ZT 1 (X − L − S)) + r1 2 X − L − S 2 F + tr(ZT 2 (W − L)) + r2 2 W − L 2 F = argmin L L ∗ + r1 + r2 2 L − r1 Hk 1 + r2 Hk 2 r1 + r2 2 F = prox 2 r1+r2 L ∗ r1 Hk 1 + r2 Hk 2 r1 + r2 = PΩ1 r (Σ)QT , (20) where Hk 1 = X−Sk+Zk 1 /r1 , Hk 2 = Wk+Zk 2 /r2 , Ωτ (x) is the element-wise soft-thresholding operator Ωτ (x) = sgn(x) max(|x| − τ, 0), r = (r1 + r2 )/2, and r1 Hk 1 + r2 Hk 2 r1 + r2 = PΣQT is a singular value decomposition. Update S Following a similar way, solve S while fixing other terms. That is Sk+1 = prox λ r1 S 1 (X − Lk+1 + Zk 1 r1 ) = Ω λ r1 (X − Lk+1 + Zk 1 r1 ) (21) Update W For updating W, (16) is a smooth function in W, thus we can use several standard methods (such as conjugate gradient method) to find a solution: Wk+1 = r2 (γΨ + r2 I)−1(Lk+1 − Zk 2 r2 ). (22) 3.3 What’s new? • Instead of learning the principal direction U and principal component V , it learns their product L = UV T . • It assumes the smoothness of L instead of assuming the smoothness of the principal component matrix V . • It is strictly convex problem and a unique global maxima can be obtained by using standard methods such as ADMM. • As with the classical robust PCA, it does not require the rank of L to be specified because the nuclear norm enables the automatic selection of an appropriate rank (but it depends on the choice of λ and γ) 5

4 Low-rank matrix approximation with manifold regularization [11] Manifold
Regularized Matrix Factorization (MMF) exploits the orthonormality constrain on the principal direction U to obtain a low-dimensional matrix L = UV T . Unlike other standard Nonnegative Matrix Factorization (NMF), it relax the restriction of nonnegativity on U and V but impose the orthonormal constraint on U. 4.1 Objective function for MMF min U,V f(U, V ) = X − UV T 2 F + α tr(V T ΨV ) s.t. UT U = I, (23) where U is an orthonormal matrix whose columns consist of a principle orthogonal basis of the data space, and V can be viewed as as a projection of the data in the low dimension space. In addition, the extensions of this model with an ensemble of graph and hypergraph regu- ralisation terms have been proposed by Tao et al. [10] and Jin et al. [6] by extending (23) as min U,V,α f(U, V, α) = X − UV T 2 F + α tr(V T k αk Ψk V ) + β α 2 s.t. UT U = I, 1T α = 1. (24) 4.2 Optimization for MMF The iterative algorithm is divided into two parts: 1. Updateing V First, solve V while ﬁxing U. It is easy to derive the optimal solution by setting the derivative of (23) with respect to V to be zero: that is ∂f(U, V ) ∂V 2(V T Ψ − UT X) = 0, (25) where this derivative is zero iif V = UT Ψ−1. Thus, update V k+1 = UT(k+1)Ψ−1 2. Updating U Solve U while ﬁxing V . (23) can be rewritten as: min U X − UV T 2 F s.t. UT U = I ⇔ min U −2 tr(UT XV ) s.t. UT U = I. (26) That is equivalent to maximizing tr(UT XV ) that can be represented in terms of the singular values of XV . Let XV = GΣST be the SVD of XV with Σ a diagonal matrix 6

of positive diagonal entries. We have tr(UT XV ) =
tr(UT GΣST ) = d i=1 (ST UT Σ)ii Dii . (27) Because ST UT G is an orthogonal matrix, tr(ST UT G) ≤ d. Then the maximum of tr(ST UT G) is achieved when ST UT G = I. Thus U = GST . In summary, update Uk+1 = GST where XV = GΣST (SVD of XV ). 4.3 What’s new? • This model has globally optimal solutions that can be computed directly (in small sample case) or iteratively (in many sample case). • This model put the orthonormality constrain on the principal direction U instead of V such as [5]. 5 Joint-L1/2 Norm Constraint and Graph-Laplacian PCA Method for Feature Extraction [3] Jiang et al. proposed robust graph Laplacian PCA (RgLPCA) by using L2,3 norm [5]. But, the major contribution of L2,1 norm is to introduce sparseness on rows, in which the eﬀect is not so obvious. Instead, they replace it with L1/2 (called L1/2 gLPCA). 5.1 Objective function for L1/2 gLPCA min U,V X − UV T 1/2 1/2 + α tr(V T ΨV ) s.t. V T V = I, (28) where A 1/2 1/2 = n j=1 n i=1 |aij |1/2. 5.2 Optimization for L1/2 gLPCA As with RgLPCA [5], the Augmented Lagrange Multiplier (ALM) method is used. ALM extends (28) as min U,V,E E 1/2 1/2 + tr{CT (E − X + UV T )} + µ 2 E − X + UV T 2 F + α tr{V T ΨV } s.t. E = X − UV T , V T V = I, (29) 7

where C is Lagrange multiplier and µ is the penalty
parameter. The algorithm is divided into two parts: 1. Updating E First, solve U, V while fixing E. After several calculation, (29) is min E E 1/2 1/2 + µ 2 EX + UV T + C µ 2 F . (30) This can be viewed as the proximal operator of L1/2 norm. Thus the following lemma can be used: Lemma 1. The proximal operator of L1/2 norm minimizes the following problem: prox λ A 1/2 1/2 = min Xp×n R X − A 2 F + λ A 1/2 1/2 , which is given by Udiag(Hλ (σ))V T , (31) where Hλ (σ) = (hλ (σ1 ), . . . , hλ (σn ))T and hλ (σi )(i = 1, . . . , n) is the half thresholding operator and given as hλ (σi ) =    2 3 σi (1 + cos( 2π 3 − 2 3 φλ(σi ))), if |σi | > 543/2 4 λ2/3 0, otherwise, (32) where φλ(σi ) = arccos((λ/8)(|σi |/3)−2/3). 2. Updating U and V The same procedures with gLPCA can be used. First, solve U while fixing others. This is given by U = (X − E − C µ ). (33) Then, we solve V while fixing others, and the updating equation is given by solving the following problem V = min V tr V T (− (X − E − C µ )T (X − E − C µ ) + 2α µ Ψ V (34) 3. Updating C and µ C and µ are updated as follow: C = C + µ(E − X + UV T ) (35) µ = ρµ, (36) where ρ > 1 is a step size parameter. 8

5.3 What’s new? • Extend RgLPCA by using L1/2 norm
instead of L2,1 . The optimization algorithm is similar with RgLCPA. 6 Fast Robust PCA on graph [8] It introduces Fast Robust PCA on Graph (FRPCAG) by extending RgLCPA [5]. 6.1 Objective function for FRPCAG min L,S L 1 + λ1 tr(LΨ1 LT ) + λ2 tr(LT Ψ2 L) s.t. X = L + S (37) where L ∈ Rp×n is the low-rank noiseless matrix, S ∈ Rp×n is the sparse matrix, and Ψ1 ∈ Rn×n and Ψ2 ∈ Rp×p are the graph Laplacian of graph connecting the different samples (column of X) and different features (row of X) of X, respectively. 6.2 Optimization for FRPCAG It uses the Fast Iterative Soft Thresholding Algorithm (FASTA) to solve (37). Let define h : RN → R a convex function with the proximity operator proxλh (y) = argminx 1 2 x−y 2 F +λh(x). Our goal is to minimize h(L) + g(L) = L 1 + λ1 tr(LΨ1 LT ) + λ2 tr(LT Ψ2 L), which is done with proximal splitting method. In this case, the proximal operator of the h(L) = L 1 is the l1 soft-thresholding given by the element-wise operations: proxλh (L) = X + sgn(L − X) ◦ max(|L − X| − λ, 0), (38) where ◦ is the Hadamard product. Also, the gradient of g(L) = λ1 tr(LΨ1 LT ) + λ2 tr(LT Ψ2 L)) is given by ∇g (U) = 2(γ1 LΨ1 + γ2 Ψ2 L). By using these notations, the update procedure for U is given by Lj = proxλjh (Yj − λj ∇g (Yj )) tj+1 = 1 + 1 + 4t2 j 2 (39) Yj+1 = Uj + tj − 1 tj + 1 (Uj − Uj−1 ), where λj is the jth step size. If Yj+1 − Yj 2 F < Yj 2 F where is the stopping tolerance, the algorithm is stopped. 9

6.3 What’s new? • It targets directly the recovery of
the low-rank matrix U. • It uses the graph smooth assumption both between the samples and between the features. • It is convex, but non-smooth. But, it can be solved eﬃciently (linear time in the number of samples) with the FISTA algorithm. • It is parallelizable and scalable for large datasets because it requires only the multiplica- tion of two sparse matrix and element-wise soft-thresholding operations. 7 Nonlinear dimensionality reduction on graphs [9] It introduces an approach to graph-adaptive nonlinear dimensionality reduction which can be viewed as the extension of the kernel PCA with the graph regularisation. They call it graph- adaptive nonlinear Kernel PCA (GRAD KPCA). 7.1 Objective function for GRAD KPCA min B Kx − BT B 2 F + γ tr(BΨBT ) s.t BBT = Id (40) where Kx = XT X ∈ Rp×p is a Gram matrix and B = UX ∈ Rd×p. In addition, instead of directly using Ψ, we can use a family of graph kernels r∗(Ψ) = UG r∗(Λ)UG where r∗() is a non-decreasing scalar function of the eigenvalues of Ψ and UG is the eigenvectors of Ψ. By using r∗(Ψ) as a kernel matrix, we can rewrite (40) as min B Kx − BT B 2 F + γ tr(Br∗(Ψ)BT ) s.t BBT = Id . (41) 7.2 Optimization for GRAD KPCA As kernel PCA can be written as a trace minimization problem, (40) can be reduced to min B − tr(BKx BT ) + γ tr(BΨBT ) s.t BBT = Id ⇔ min B − tr B(Kx − γΨ)BT s.t BBT = Id (42) The optimal solution of B of (42) is given by the d largest eigenvalues and corresponding eigenvectors of Kx − γΨ = ¯ V Λ¯ V . Then, we can get a closed form solution as B = ¯ V T d , which is the sub-matrix of ¯ V formed by columns corresponding the d largest eigenvalues. 10

7.3 What’s new? • It extends the concept of kernel
PCA by using graph regularisation. 8 Locality Preserving Projections [4] Locality Preserving Projections (LPP) is not a PCA based method, but uses similar idea as PCA and provide useful insight to interpret PCA. Note that Section 3.4 (Geometrical Justiﬁcation) provides good explanation about why you should use the Laplacian matrix, which can be viewed as analogy to the Laplace Beltrami operator on compact Riemannian manifolds [1]. 8.1 Objective function for LPP min y i,j (yi − yj )2Wij s.t. yT Dy = 1 (43) where y = (y1 , . . . , yd )T is the map that maps the graph G to a line so that connected points stay as close together as possible and D is the degree matrix deﬁned as D = diag(dii ) and dii = j Wij . Let a be a transformation vector y = aT X. By simple calculation, the objective function can be reduced to min a i,j (aT xi − aT xj )2Wij s.t. aXDXT a = 1 ⇔ min a i aT xi Di ixT i aT − i,j aT xi Wij xT j a s.t. aXDXT a = 1 ⇔ min a aT X(DW )XT a = aXΨXT a s.t. aXDXT a = 1 (44) In addition, suppose that the Euclidean space Rn is mapped to a Hilbert space H through a nonlinear mapping function φ : Rn → H. We consider the kernel function K(xi , xj ) = ψ(xi )T ψ(xj ) on the Hilbert space H. After simple algebra, kernel version of (44) can be given by the following eigenvector problem: KΨKb = λKDKb. (45) 8.2 Optimization for LPP The optimal solution of a that minimizes the objective function is given by the minimum eigenvalue solution to the generalized eigenvalue problem: XΨXT a = λXDXT a. (46) 11

8.3 What’s new? • It imposes a new restriction aXDXT
a = 1: the matrix D provides a natural measure on the data points. The bigger the value Dii is (corresponding to yi ), the more ”important” is yi . • It is linear. This makes it fast and suitable for practical big data. • It may be conducted in the original or the reproducing kernel Hilbert space (RKHS) into which data points are mapped. 12

References [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and
spectral techniques for embedding and clustering. In Advances in neural information processing systems, pages 585–591, 2002. [2] Emmanuel J Cand` es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011. [3] Chun-Mei Feng, Ying-Lian Gao, Jin-Xing Liu, Juan Wang, Dong-Qin Wang, and Chang- Gang Wen. Joint-norm constraint and graph-laplacian pca method for feature extraction. BioMed research international, 2017, 2017. [4] Xiaofei He and Partha Niyogi. Locality preserving projections. In Advances in neural information processing systems, pages 153–160, 2004. [5] Bo Jiang, Chris Ding, Bio Luo, and Jin Tang. Graph-laplacian pca: Closed-form solution and robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3492–3498, 2013. [6] Taisong Jin, Jun Yu, Jane You, Kun Zeng, Cuihua Li, and Zhengtao Yu. Low-rank matrix factorization with multiple hypergraph regularizer. Pattern Recognition, 48(3):1011–1022, 2015. [7] Nauman Shahid, Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, and Pierre Van- dergheynst. Robust principal component analysis on graphs. In Proceedings of the IEEE International Conference on Computer Vision, pages 2812–2820, 2015. [8] Nauman Shahid, Nathanael Perraudin, Vassilis Kalofolias, Gilles Puy, and Pierre Van- dergheynst. Fast robust pca on graphs. IEEE Journal of Selected Topics in Signal Pro- cessing, 10(4):740–756, 2016. [9] Yanning Shen, Panagiotis A Traganitis, and Georgios B Giannakis. Nonlinear dimensionality reduction on graphs. In Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2017 IEEE 7th International Workshop on, pages 1–5. IEEE, 2017. [10] Liang Tao, Horace HS Ip, Yinglin Wang, and Xin Shu. Low rank approximation with sparse integration of multiple manifolds for data representation. Applied Intelligence, 42(3):430– 446, 2015. 13

[11] Zhenyue Zhang and Keke Zhao. Low-rank matrix approximation with
manifold regularization. IEEE transactions on pattern analysis and machine intelligence, 35(7):1717–1729, 2013. 14

PCA with network/graph constraints

PCA with network/graph constraints

Daisuke Yoneoka

More Decks by Daisuke Yoneoka

Other Decks in Research

Featured

Transcript

On PCA with network/graph constraints Daisuke Yoneoka October 15, 2018

2 Graph-Laplacian PCA: Closed from Solution and Ro- bustness [5]

where C is Lagrange multiplier and µ is the penalty

3.1 Objective function for Robust PCA on Graph min L,S

Update L Solve L while ﬁxing other terms. That is

4 Low-rank matrix approximation with manifold regu- larization [11] Manifold

of positive diagonal entries. We have tr(UT XV ) =

where C is Lagrange multiplier and µ is the penalty

5.3 What’s new? • Extend RgLPCA by using L1/2 norm

6.3 What’s new? • It targets directly the recovery of

7.3 What’s new? • It extends the concept of kernel

8.3 What’s new? • It imposes a new restriction aXDXT

References [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and

[11] Zhenyue Zhang and Keke Zhao. Low-rank matrix approximation with