UNSUPERVISED FEATURE EXTRACTION Y-h. Taguchi Department of Physics, Chuo University Tokyo, Japan Published in Knowledge-based systems (IF=5.9) https://doi.org/10.1016/j.knosys.2021.106834
large number of variables (=p) based upon small number of samples (=n). (so called “large p small n” problem) → difficult because ….. Statistical test: Small n→ not small enough (not significant enough) P-values Large p→ strong multiple comparison correction (corrected P- values take larger values) → No significant p-values at all.
“large p small n”→ overfitting…. Too optimized selection toward a specific set of small number n results in “sample specific-variable selection” → Other set of variables will be selected if using another set of small number of samples (n) is used.
Gaussian Zero mean Gaussian Non-zero mean M2 samples /variable i≦N 1 :distinct between j,k≦M/2 and others i>N 1 : no distinction Task: Can we identify N 1 variables correctly?
to test if it is distinct between two classes (i.e. j,k≦M/2 vs others) • Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. • Variables with corrected P-values <0.05 are selected. j k M M/2 M/2
989.3 3.4 P≦0.05 0.7 6.6 N=103, N 1 =10, M=6, Gaussian dist. μ(mean)=2, σ(SD)=1 Averaged over 100 independent trials. Fact N P Prediction N TN FN P FP TP Fact N P Prediction N 990 0 P 0 10 Matthew’s correlation coefficient (MCC) (TP⨉TN)-(FN⨉FP) (TN+FP)(FN+TP)(TN+FN)(RP+TP) ~ 0.77
i > N 1 i ≦ N 1 P>0.05 989.4 2.4 P≦0.05 0.6 7.6 MCC ~ 0.84 Random Forest (N 1 =10 given, since no P-value computations) i > N 1 i ≦ N 1 P>0.05 988.2 1.8 P≦0.05 1.8 8.2 MCC ~ 0.81
L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k
- log 10 P i Assuming that u 1i obey Gaussian (null hypothesis), P-values are attributed to individual variables (i) using χ2 distribution - log 10 P i i ≦ N 1
unsupervised FE (feature extraction)”, which was in detail described in my recently published book. Unsupervised Feature extraction applied to Bioinformatcs, 2020, Springer international.
based unsupervised FE, 1) It is very fitted to feature selection problems in “large p small n” problem. 2) In contrast to conventional feature selection methods (e.g., lasso and random forest) no knowledge about the number of selected variables is required. Variables can be selected using P- values like conventional statistical test.
test), it work in “large p small n” problems, at least, comparative with conventional feature selections that require the number of variables selected. 4) TD based unsupervised FE is unsupervised method, since it does not require knowledge about classes or labeling when singular value vectors (u l1i , u l2j , u l3k ) are generated. MCC ~ 0.88 t test MCC ~ 0.77 x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k
≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell lines dependence u l2k : l 2 th with and without SARS-CoV-2 infection u l3m : l 3 th dependence upon biological replicate u l4i : l 4 th gene dependence G: weight of individual terms
independent of cell lines and biological replicates (u l1j ,u l3m take constant regardless j,m) and dependent upon with or wothout SARS-CoV-2 infection(u l21 =-u l22 ) Heavy “large p small n” problem Number of variables(=p): 21797 ~ 104 Number of samples (=n): 5 ⨉2 ⨉3 =30 ~10 p/n ~ 103
Cell lines With and without SARS-CoV-2 infection biological replicate Independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection.
but dependent upon SARS-CoV-2 infection is associated with u 5i (l 4 =5) P i =P χ2 [> (u 5i σ5 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. 163 genes with corrected P-values <0.01 are selected among 21,797 genes.
we do not know how many genes should be selected, lasso and random forest is useless. Instead we employed SAM and limma, which are gene selection specific algorithm (adjusted P-values are used ). t test SAM limma P>0.01 P≦0.01 P>0.01 P≦0.01 P>0.01 P≦0.01 Calu3 21754 43 21797 0 335 3789 NHBE 21797 0 21797 0 342 3906 A549 MOI 0.2 21797 0 21797 0 319 4391 MOI 2.0 21472 325 21797 0 208 4169 ACE2 expressed 21796 1 21797 0 182 4245
L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 j' u l 4 k ' x jkj’k’ G u l3j’ u l1j u l2k L3 L1 L2 u l4k’ L4 Kernel Trick x jkj’k’ → k(x ijk ,x ij’k’ ):non-negative definite
)=exp(−α∑i ( x ijk −x ij ' k ' )2) Radial base function kernel k (x ijk , x ij ' k ' )=(1+∑ i x ijk x ij ' k ' ) d Polynomial kernel k(x ijk ,x ij’k’ )→ tensor decomposition
u l1j , u l2k u l 1 i ∝∑ jk x ijk u l 1 j u l 2 k P i =P χ2 [> (u l 1 i σl 1 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. Features with corrected P-values <0.01 are selected. TD
Recompute x jkj’k’ x jkj’k’ → u l1j ⨉ u l2k TD Estimate coincidence between u l1j , u l2k and classification of (k,j) Rank i i based upon the amount of decreased coincidence u l1j ⨉ u l2k k
selections in “large p small n” It can work comparatively with conventional feature selections (lasso, random forest) and can give us P-values that lasso and random forest cannot. TD based unsupervised FE could select human genes related with SARS-CoV-2 infection even when other conventional gene selection methods (t test, SAM, limma) cannot work well.
(KTD) based unsupervised FE could even outperform TD based unsupervised FE when it was applied to identification human genes related to SARS-CoV-2 infection. Other advanced KTD based unsupervised FE is expected to develop to attack more wide range of problems including genomic science/bioinformatics.