Decomposition Based Unsupervised Feature Extraction Applied to Bioinformatics Y-h. Taguchi, Department of Physics, Chuo University, Tokyo 112-8551, Japan
l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example
ij’ ):non-negative definite k (x ij , x ij ' )=exp(−α∑i (x ij −x ij ' )2) Radial base function (RBF) kernel k (x ij , x ij ' )=(1+∑ i x ij x ij ' ) d Polynomial kernel k(x ij ,x ij’ )→ diagonalization
=1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 ij u l 2 k u l 3 j' u l 4 k ' Kernel Tensor decomposition x ijk G u l1i u l2j u l3k L1 L2 L3 N M K x ij’k’ N M K ⨉ x jkj’k’ = G u l3j’ u l1j u l2k L3 L1 L2 u l4k’ L4 x jkj ' k ' =∑ i x ijk x ij' k ' https://doi.org/10.1101/2020.10.09.333195 https://doi.org/10.1101/2020.10.09.333195
ij’k’ ):non-negative definite k (x ijk , x ij ' k ' )=exp(−α∑i ( x ijk −x ij ' k ' )2) Radial base function kernel k (x ijk , x ij ' k ' )=(1+∑ i x ijk x ij ' k ' ) d Polynomial kernel k(x ijk ,x ij’k’ )→ tensor decomposition
μ:mean,σ:standard deviation N genes 1st M experiments 2nd M experiments M2 samples x ijk ∈ℝN ×M ×M ∼ N(μ,3) j ,k≤ M 2 ,i≤N 1 ≪N N(0,3) otherwise M << N
experiments Zero mean Non-zero mean i≦N 1 :distinct between j,k≦M/2 and others i>N 1 : no distinction Task: Can we get latent vectors coincident with distinction?
correlation coefficients between distinction and latent variables (smaller is better) KTD with linear is equivalent to that with RBF kernel. 0.043 0.043 0.039 0.039
L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell line dependence u l2k : l 2 th SARS-CoV-2 infection YES/NO u l3m : l 3 th biological replicate dependence u l4i : l 4 th gene dependence G: weights Purpose: identification of l 1 ,l 2 ,l 3 independent of cell lines or replicates (u l1j and u l3m are constant independent of j,m)whereas dependent upon SARS-CoV-2 infection(u l21 =-u l22 )
=1 u 5i (l 4 =5) represents gene expression profiles independent of cell lines or biological replicate but altered by SARS-CoV-2 infection Which l 4 has largest |G|?
2 infects seem to be detected. ↓ Drug repositioning will be possible by identifying compounds that affect selevcted 163 genes ↓ Fortunately, we have data bases that list genes whose expression is altered with drug treatments