MATHEMATICAL FORMULATION AND APPLICATION OF KERNEL TENSOR DECOMPOSITION BASED UNSUPERVISED FEATURE EXTRACTION

Slide 1

Slide 1 text

1 MATHEMATICAL FORMULATION AND APPLICATION OF KERNEL TENSOR DECOMPOSITION BASED UNSUPERVISED FEATURE EXTRACTION Y-h. Taguchi Department of Physics, Chuo University Tokyo, Japan Published in Knowledge-based systems (IF=5.9) https://doi.org/10.1016/j.knosys.2021.106834

Slide 2

Slide 2 text

2 Purpose: Identification of small number of critical variables within large number of variables (=p) based upon small number of samples (=n). (so called “large p small n” problem) → difficult because ….. Statistical test: Small n→ not small enough (not significant enough) P-values Large p→ strong multiple comparison correction (corrected P- values take larger values) → No significant p-values at all.

Slide 3

Slide 3 text

3 More advanced machine learning approach: (e.g., lasso, random forest) “large p small n”→ overfitting…. Too optimized selection toward a specific set of small number n results in “sample specific-variable selection” → Other set of variables will be selected if using another set of small number of samples (n) is used.

Slide 4

Slide 4 text

4 Try synthetic example of (p>>n, i.e. p/n ~ 10 Try synthetic example of (p>>n, i.e. p/n ~ 102 2) )

Slide 5

Slide 5 text

5 N variables N 1 M measurements M/2 M measurements Gaussian Zero mean Gaussian Non-zero mean M2 samples /variable i≦N 1 :distinct between j,k≦M/2 and others i>N 1 : no distinction Task: Can we identify N 1 variables correctly?

Slide 6

Slide 6 text

6 Strategy 1 ● Apply t test to individual variables to test if it is distinct between two classes (i.e. j,k≦M/2 vs others) ● Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. ● Variables with corrected P-values <0.05 are selected. j k M M/2 M/2

Slide 7

Slide 7 text

7 i > N 1 i ≦ N 1 P>0.05 989.3 3.4 P≦0.05 0.7 6.6 N=103, N 1 =10, M=6, Gaussian dist. μ(mean)=2, σ(SD)=1 Averaged over 100 independent trials. Fact N P Prediction N TN FN P FP TP Fact N P Prediction N 990 0 P 0 10 Matthew’s correlation coefficient (MCC) (TP⨉TN)-(FN⨉FP) (TN+FP)(FN+TP)(TN+FN)(RP+TP) ~ 0.77

Slide 8

Slide 8 text

8 Lasso (N 1 =10 given, since no P-value computations) i > N 1 i ≦ N 1 P>0.05 989.4 2.4 P≦0.05 0.6 7.6 MCC ~ 0.84 Random Forest (N 1 =10 given, since no P-value computations) i > N 1 i ≦ N 1 P>0.05 988.2 1.8 P≦0.05 1.8 8.2 MCC ~ 0.81

Slide 9

Slide 9 text

9 Singular value decomposition (SVD) xij N M (uli)T N L vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl

Slide 10

Slide 10 text

10 x ijk G u l1i u l2j u l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

Slide 11

Slide 11 text

11 N variables N 1 M measurements M/2 M measurements Gaussian Zero mean Gaussian Non-zero mean M2 samples /variable x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

Slide 12

Slide 12 text

12 j k i u 1j u 1k u 1i i ≦ N 1

Slide 13

Slide 13 text

13 u 1i u 1i i ≦ N 1

Slide 14

Slide 14 text

14 P i =P χ2 [> (u 1i σ1 )2] - log 10 P i Assuming that u 1i obey Gaussian (null hypothesis), P-values are attributed to individual variables (i) using χ2 distribution - log 10 P i i ≦ N 1

Slide 15

Slide 15 text

15 Adjusted P i <0.05 are selected i > N 1 i ≦ N 1 P>0.05 989.9 2.2 P≦0.05 0.1 7.8 MCC ~ 0.88 t test MCC ~ 0.77 lasso MCC ~ 0.84 Random forest MCC ~ 0.81

Slide 16

Slide 16 text

16 We named this strategy as “TD (tensor decomposition) based unsupervised FE (feature extraction)”, which was in detail described in my recently published book. Unsupervised Feature extraction applied to Bioinformatcs, 2020, Springer international.

Slide 17

Slide 17 text

17 Advantages of TD based unsupervised FE, Advantages of TD based unsupervised FE, 1) It is very fitted to feature selection problems in “large p small n” problem. 2) In contrast to conventional feature selection methods (e.g., lasso and random forest) no knowledge about the number of selected variables is required. Variables can be selected using P- values like conventional statistical test.

Slide 18

Slide 18 text

18 3) In contrast to conventional statistical tests (e.g., t test), it work in “large p small n” problems, at least, comparative with conventional feature selections that require the number of variables selected. 4) TD based unsupervised FE is unsupervised method, since it does not require knowledge about classes or labeling when singular value vectors (u l1i , u l2j , u l3k ) are generated. MCC ~ 0.88 t test MCC ~ 0.77 x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

Slide 19

Slide 19 text

19 Application to a real example Application to a real example

Slide 20

Slide 20 text

Slide 21

Slide 21 text

21 Data set　GSE147507 Gene expression of human lung cell lines with/without SARS-CoV-2 infection. i:genes(21797) j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4: A549 MOI 2.0, j=5:A549 ACE2 expressed (MOI:Multiplicity of infection) k: k=1: Mock, k=2:SARS-CoV-2 infected m: three biological replicates

Slide 22

Slide 22 text

22 x i jk m ∈ℝ21797×5×2×3 x i jk m ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell lines dependence u l2k : l 2 th with and without SARS-CoV-2 infection u l3m : l 3 th dependence upon biological replicate u l4i : l 4 th gene dependence G: weight of individual terms

Slide 23

Slide 23 text

23 Purpose： identification of l 1 ,l 2 ,l 3 independent of cell lines and biological replicates （u l1j ,u l3m take constant regardless j,m） and dependent upon with or wothout SARS-CoV-2 infection（u l21 =-u l22 ） Heavy “large p small n” problem Number of variables(=p): 21797 ~ 104 Number of samples (=n): 5 ⨉2 ⨉3 =30 ~10 p/n ~ 103

Slide 24

Slide 24 text

24 l 1 =1 l 2 =2 l 3 =1 Cell lines With and without SARS-CoV-2 infection biological replicate Independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection.

Slide 25

Slide 25 text

25 l 1 =1 l 2 =2 l 3 =1 ｜G｜is the largest in which l 4 ？

Slide 26

Slide 26 text

26 Gene expression independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection is associated with u 5i (l 4 =5) P i =P χ2 [> (u 5i σ5 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. 163 genes with corrected P-values <0.01 are selected among 21,797 genes.

Slide 27

Slide 27 text

27 Multiple hits with known SARS-CoV-2 interacting human genes

Slide 28

Slide 28 text

28 Comparisons with conventional methods: Comparisons with conventional methods: Since we do not know how many genes should be selected, lasso and random forest is useless. Instead we employed SAM and limma, which are gene selection specific algorithm (adjusted P-values are used ). t test SAM limma P>0.01 P≦0.01 P>0.01 P≦0.01 P>0.01 P≦0.01 Calu3 21754 43 21797 0 335 3789 NHBE 21797 0 21797 0 342 3906 A549 MOI 0.2 21797 0 21797 0 319 4391 MOI 2.0 21472 325 21797 0 208 4169 ACE2 expressed 21796 1 21797 0 182 4245

Slide 29

Slide 29 text

29 Kernelization of TD based unsupervised FE Kernelization of TD based unsupervised FE

Slide 30

Slide 30 text

30 Published in Knowledge-based systems (IF=5.9) https://doi.org/10.1016/j.knosys.2021.106834

Slide 31

Slide 31 text

31 Kernel Tensor decomposition x ijk G u l1i u l2j u l3k L1 L2 L3 N M K x ij’k’ N M K ⨉ x jkj ' k ' =∑ i x ijk x ij' k ' (Linear kernel)

Slide 32

Slide 32 text

32 x jkj ' k ' ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 j' u l 4 k ' x jkj’k’ G u l3j’ u l1j u l2k L3 L1 L2 u l4k’ L4 Kernel Trick x jkj’k’ → k(x ijk ,x ij’k’ ):non-negative definite

Slide 33

Slide 33 text

33 k (x ijk , x ij ' k ' )=exp(−α∑i ( x ijk −x ij ' k ' )2) Radial base function kernel k (x ijk , x ij ' k ' )=(1+∑ i x ijk x ij ' k ' ) d Polynomial kernel k(x ijk ,x ij’k’ )→ tensor decomposition

Slide 34

Slide 34 text

34 Synthetic example:Swiss Roll x ijk ∈ℝ1000×3×10 ⨉ 10 Number of points (=n) Spatial dimension (=p)

Slide 35

Slide 35 text

35 SVD applied to single Swiss Roll

Slide 36

Slide 36 text

36 TD applied to a bundle of 10 Swiss Rolls

Slide 37

Slide 37 text

37 Kernel TD (with RBF) applied to a bundle of 10 Swiss Rolls

Slide 38

Slide 38 text

38 Feature selection Feature selection Linear Kernel: x jkj’k’ → u l1j , u l2k u l 1 i ∝∑ jk x ijk u l 1 j u l 2 k P i =P χ2 [> (u l 1 i σl 1 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. Features with corrected P-values <0.01 are selected. TD

Slide 39

Slide 39 text

39 RBF, Polynomial Kernels Exclusion of a specific i i Recompute x jkj’k’ x jkj’k’ → u l1j ⨉ u l2k TD Estimate coincidence between u l1j , u l2k and classification of (k,j) Rank i i based upon the amount of decreased coincidence u l1j ⨉ u l2k k

Slide 40

Slide 40 text

40 Application to SARS-CoV-2 data set Applying RBF kernel and select 163 top ranked genes. TD KTD

Slide 41

Slide 41 text

41 Conclusions TD based unsupervised FE is specialized to feature selections in “large p small n” It can work comparatively with conventional feature selections (lasso, random forest) and can give us P-values that lasso and random forest cannot. TD based unsupervised FE could select human genes related with SARS-CoV-2 infection even when other conventional gene selection methods (t test, SAM, limma) cannot work well.

Slide 42

Slide 42 text

42 TD based unsupervised FE was successfully “kernelized”. Kernel TD (KTD) based unsupervised FE could even outperform TD based unsupervised FE when it was applied to identification human genes related to SARS-CoV-2 infection. Other advanced KTD based unsupervised FE is expected to develop to attack more wide range of problems including genomic science/bioinformatics.