Slide 1

Slide 1 text

Kernel Tensor decomposition based unsupervised feature extraction applied to bioinformatics Y-h. Taguchi Department of Physics Chuo University Tokyo, Japan

Slide 2

Slide 2 text

Purpose: Feature selection with “large p small n” In bioinformatics…. n<

Slide 3

Slide 3 text

Why is feature selection with “large p small n” difficult? Conventional approach: Conventional approach: Apply statistical test to individual feature ↓ Attribute P-values to individual feature to reject null hypothesis (e.g., expression of the gene is identical between healthy control and patients) ↓ Correct P-values with considering large p (P-value as small as 1/p can occur accidentally) ↓ Select features associated with adjusted P-values less than threshold values (e.g. <0.05) P-value 1-(P-value) 0 1 D(1-P)

Slide 4

Slide 4 text

small n(# of samples)→ not small enough P-value large p(# of feayures)→ P-values are heavily corrected (become larger, i.e., less significant) Adjusted P-value are hard to be less than threshold values Advanced approach? LASSO (add penalty term to restrict the number of features selected) ↓ Lack of stability (biologically, selected feature must be stable regardless to samples considered, if the same target (e.g., a specific disease) Why is feature selection with “large p small n” difficult?

Slide 5

Slide 5 text

Our approach: Find sample vectors coincident with desired property (e.g., distinction between healthy controls (HC) and patients (PA)) ↓ Select features with large enough contribution to the selected sample vectors (i.e., large enough to reject null hypothesis, e.g., projection obeys Gaussian Gaussian) Individual p p features (genes) can be represented as a cloud of points in the space spanned by n n sample vectors (e.g. gene expression of samples sample vector Projection to sample vectors Sample 1 (PA) Sample 2 (HC) Sample n (HC)

Slide 6

Slide 6 text

How can we derive sample vectors?→ Tensor decomposition Reason: Measurements are often associated with multiple characters e.g: “Healthy controls vs Patients” (j) + “Tissue specificity” (k) that can be represented as tensor, x ijk , that represents ith gene expression HC vs PA Tissue Genes G u l1i u l2j u l3k x ijk =∑ l 1 ∑ l 2 ∑ l 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k sample vector Projection to sample vectors

Slide 7

Slide 7 text

7 Application to a real example Application to a real example Drug repositioning for COVID-19 Drug repositioning for COVID-19

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

9 Data set GSE147507 Gene expression of human lung cell lines with/without SARS-CoV-2 infection. i:genes(21797) j: j=1:Calu3, j=2: NHBE, j=3:A549 MOI:0.2, j=4: A549 MOI 2.0, j=5:A549 ACE2 expressed (MOI:Multiplicity of infection) k: k=1: Mock, k=2:SARS-CoV-2 infected m: three biological replicates

Slide 10

Slide 10 text

10 x i jk m ∈ℝ21797×5×2×3 x i jk m ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 m u l 4 i u l1j : l 1 th cell lines dependence u l2k : l 2 th with and without SARS-CoV-2 infection u l3m : l 3 th dependence upon biological replicate u l4i : l 4 th gene dependence G: weight of individual terms

Slide 11

Slide 11 text

11 Purpose: identification of l 1 ,l 2 ,l 3 independent of cell lines and biological replicates (u l1j ,u l3m take constant regardless j,m) and dependent upon with or wothout SARS-CoV-2 infection(u l21 =-u l22 ) Heavy “large p small n” problem Number of variables(=p): 21797 ~ 104 Number of samples (=n): 5 ⨉2 ⨉3 =30 ~10 p/n ~ 103

Slide 12

Slide 12 text

12 l 1 =1 l 2 =2 l 3 =1 Cell lines With and without SARS-CoV-2 infection biological replicate Independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection.

Slide 13

Slide 13 text

13 l 1 =1 l 2 =2 l 3 =1 |G|is the largest in which l 4 ?

Slide 14

Slide 14 text

14 Gene expression independent of cell lines and biological replicate, but dependent upon SARS-CoV-2 infection is associated with u 5i (l 4 =5) P i =P χ2 [> (u 5i σ5 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. 163 genes with corrected P-values <0.01 are selected among 21,797 genes.

Slide 15

Slide 15 text

15 Multiple hits with known SARS-CoV-2 interacting human genes

Slide 16

Slide 16 text

16 Comparisons with conventional methods: Comparisons with conventional methods: Since we do not know how many genes should be selected, lasso and random forest is useless. Instead we employed SAM and limma, which are gene selection specific algorithm (adjusted P-values are used ). t test SAM limma P>0.01 P≦0.01 P>0.01 P≦0.01 P>0.01 P≦0.01 Calu3 21754 43 21797 0 335 3789 NHBE 21797 0 21797 0 342 3906 A549 MOI 0.2 21797 0 21797 0 319 4391 MOI 2.0 21472 325 21797 0 208 4169 ACE2 expressed 21796 1 21797 0 182 4245

Slide 17

Slide 17 text

17 Comparisons with DESeq2: Comparisons with DESeq2: DESeq2 P>0.01 P≦0.01 Calu3 7278 16432 NHBE 23383 327 A549 MOI 0.2 7858 15852 MOI 2.0 16279 7431 ACE2 expressed 16201 7509 After the publication of our paper, we have found the paper[*] that originally studied this GEO data was published (when we have done this study, only GEO data set was provides and no papers were published). The paper includes DESeq2 results. It is similar to limma; it detected most of genes as DEGs wheras it identified limited number of DEGs for NHBE cell lines [*]Daniel Blanco-Melo et al, Cell, 2020; 181(5): 1036-1045.e9. doi: 10.1016/j.cell.2020.04.026. Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19

Slide 18

Slide 18 text

18 Kernelization of TD based unsupervised FE Kernelization of TD based unsupervised FE This can reduce the data size: ℝp n1 n2 ⨉ ⨉ → ℝn1 n2 n1 ⨉ ⨉ n2 ⨉ , n 1 ,n 2 <

Slide 19

Slide 19 text

19 Published in Knowledge-based systems (IF=5.9) https://doi.org/10.1016/j.knosys.2021.106834

Slide 20

Slide 20 text

20 Kernel Tensor decomposition x ijk G u l1i u l2j u l3k L1 L2 L3 j=1,..,n 1 k=1,…,n 2 i=1,…,p x ij’k’ j=1,..,n 1 k=1,…,n 2 i=1,…,p ⨉ x jkj ' k ' =∑ i x ijk x ij' k ' (Linear kernel) This can reduce the data size: ℝp n1 n2 ⨉ ⨉ → ℝn1 n2 n1 n2 ⨉ ⨉ ⨉ , n 1 ,n 2 <

Slide 21

Slide 21 text

21 x jkj ' k ' ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G(l 1 l 2 l 3 l 4 )u l 1 j u l 2 k u l 3 j' u l 4 k ' x jkj’k’ ∈ ℝn1 n2 n1 ⨉ ⨉ n2 ⨉ G u l3j’ u l1j u l2k L3 L1 L2 u l4k’ L4 Kernel Trick x jkj’k’ → k(x ijk ,x ij’k’ ):non-negative definite

Slide 22

Slide 22 text

22 k (x ijk , x ij ' k ' )=exp(−α∑i ( x ijk −x ij ' k ' )2) Radial base function kernel k (x ijk , x ij ' k ' )=(1+∑ i x ijk x ij ' k ' ) d Polynomial kernel k(x ijk ,x ij’k’ )→ tensor decomposition

Slide 23

Slide 23 text

23 Feature selection Feature selection Linear Kernel: x jkj’k’ → u l1j , u l2k u l 1 i ∝∑ jk x ijk u l 1 j u l 2 k P i =P χ2 [> (u l 1 i σl 1 )2] Computed P-values are corrected with considering multiple comparison corrections by Benjamini-Hochberg method. Features with corrected P-values <0.01 are selected. TD

Slide 24

Slide 24 text

24 RBF, Polynomial Kernels Exclusion of a specific i i Recompute x jkj’k’ x jkj’k’ → u l1j ⨉ u l2k TD Estimate coincidence between u l1j , u l2k and classification of (k,j) Rank i i based upon the amount of decreased coincidence u l1j ⨉ u l2k k

Slide 25

Slide 25 text

25 Application to SARS-CoV-2 data set Applying RBF kernel and select 163 top ranked genes. TD KTD

Slide 26

Slide 26 text

26 Application to integration of multiomics data sets Application to integration of multiomics data sets

Slide 27

Slide 27 text

27

Slide 28

Slide 28 text

28 Methylation Gene Expression Proteome 1 Proteome 2 p k =687582 p k =35829 p k =1588 p k =1588 Integrate four omics profiles that share same samples, n 1 , n 2 with distinct number distinct number of features, of features, p p k k p k ⨉n 1 ⨉n 2 p k ⨉n 1 ⨉n 2 p k ⨉n 1 ⨉n 2 p k ⨉n 1 ⨉n 2 n 1 =5, n 2 =15

Slide 29

Slide 29 text

29 ℝpk n1 n2 ⨉ ⨉ ℝn1 n2 n1 ⨉ ⨉ n2 ⨉ ℝn1 n2 n1 n2 ⨉ ⨉ ⨉ ⨉K u l 1 i ∝∑ jk x ijk u l 1 j u l 2 k P i =P χ2 [> (u l 1 i σl 1 )2]

Slide 30

Slide 30 text

30 1335 genes associated with 2077 methylation probes

Slide 31

Slide 31 text

31 Conclusions We have invented kernel tensor decomposition based unsupervised feature extraction. It can reduce the required data size from p⨉n to n⨉n. Since p>>n, it is a great achievement. It can integrate multiomics data sets that have distinct number of features if they share samples.