Y-h. Taguchi
October 30, 2019
190

Tensor decomposition based unsupervised feature extraction applied to bioinformatics

Presentation at PreMed'19
http://sharmalab.bwh.harvard.edu/premed19/
31th Oct 2019, OIST

October 30, 2019

Transcript

1. 1 Tensor decomposition based unsupervised feature extraction applied to bioinformatics

Y-h. Taguchi Department of Physics, Chuo University, Tokyo 112-8551, Japan. This presentation is available
2. 2 Singular value decomposition xij N M (uli)T N L

vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl N: number of genes (i) M: number of samples (j) xij: gene expression Example
3. 3 Interpretation….. j:samples Healthy control Patients vlj i:genes uli DEG:

Differentially Expressed Genes For some specific l Healthy controls < Patients DEG: DEG: Healthy controls > Patients
4. 4 x ijk G u l1i u l2j u l3k

L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example
5. 5 Interpretation….. j:samples Healthy control Patients ul2j For some specific

l2 For some specific l3 k:tissues Tissue specific expression ul3k
6. 6 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes Healthy

controls < Patients tDEG: tDEG: Healthy controls > Patients For some specific l1 with max |G(l1l2l3)| If G(l1l2l3)>0 Fixed
7. 7 Integrated analysis of multiple matrices and/or tensors xij :

expression of gene i of sample j xkj: methylaion of region k of sample j x xijk ijk ≡ ≡ x xij ij ⨉ ⨉ x xkj kj G u l1i u l2j u l3k L1 L2 L3 x ijk N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

l2
9. 9 i:genes ul1i DEG: Differentially Expressed Genes Healthy controls <

Patients DEG: DEG: Healthy controls > Patients If G(l1l2l3)>0 For gene expression For some specific l1, l3 with max |G(l1l2l3)| Fixed
10. 10 k:regions ul3k DMR: Differentially Methylated Regions Healthy controls <

Patients DMR: DMR: Healthy controls > Patients For methylation
11. 11 Application example No.1 Application example No.1 “Multiomics Data Analysis

Using Tensor Decomposition Based Unsupervised Feature Extraction –Comparison with DIABLO–” Y-h. Taguchi in De-Shuang Huang Vitoantonio Bevilacqua Prashan Premaratne (Eds.), Intelligent Computing Theories and Application, 15th International Conference, ICIC 2019 Nanchang, China, August 3–6, 2019 Proceedings, Part I, pp.565-574 https://doi.org/10.1007/978-3-030-26763-6_54 Preprint: https://doi.org/10.1101/591867
12. 12 ## \$mRNA ## [1] 150 samples ⨉ 200 mRNAs

## ## \$miRNA ## [1] 150 samples ⨉184 miRNAs ## ## \$proteomics ## [1] 150 samples ⨉142 proteins Three cell lines ## Basal Her2 LumA ## 45 30 75 Taken from mixOmics package in bioconductor https://bioconductor.org/packages/release/bioc/html/ mixOmics.html
13. 13 x ij :expression of ith mRNA of jth sample

x kj :expression of kth miRNA of jth sample x pj :expression of pth protein of jth sample tensor：x ikpj =x ij・x kj・x pj Apply tensor decomposition (tensor version of singular vallue decomposition) x ikpj ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G (l 1 l 2 l 3 l 4 )u l 1 i u l 2 k u l 3 p u l 4 j ul1i: mRNA, ul2k: miRNA ul3p: proteome, ul4j: sample
14. 14 u 1j u 4j Basal Her2 LumA Basal 42

42 4 0 Her2 2 25 25 2 LumA 1 1 73 73 predict Real Error ６．５% Linear discriminant analysis Leave One Out Cross Validation
15. 15 Descending order of |G(l1,l2,l3,l4)| with l4=1,4 1 ≦ l3

≦ 4, proteome 1 ≦ l1 ≦ 2, mRNA 1 ≦ l2 ≦ 2, miRNA
16. 16 Selecting 10 top ranked mRNAs, miRNAs and proteins based

upon squared sum of singular value vectors
17. 17 Basal Her2 LumA mRNA miRNA protein Discrimination performances using

selected features
18. 18 Number of components generated Errors 0.05 0.10 0.15 Discrimination

performances using generated features Comparisons with DIABLO impremented in mixOmics Comparisons with DIABLO impremented in mixOmics

20. 20 Pros and cons of TD based unsupervised FE Pros:

Pros: Fast (because of no optimization) Robust (independent of label information) Unsupervised (no need to construct model in advance) Cons: Cons: No ways if it does not work Need more memories: 150 ⨉ (200+184+142) vs 150 ⨉ 200 ⨉ 184 ⨉ 142
21. 21 Application example No.2 Application example No.2 Tensor Decomposition-Based Unsupervised

Feature Extraction Applied to Single-Cell Gene Expression Analysis Y-h. Taguchi and Turki Turki Frontiers in Genetics, Volume 10, Article 864, 2019. doi: 10.3389/fgene.2019.00864
22. 22 Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse Data

set: GSE76381 scRNA-seq of human and mouse mid brain developments i:Genes j,k:cells Purpose of the analysis: Selection of genes associated with mid brain development commonly between human and mouse
23. 23 Cell numbers and time points Human: 6w：287cells,7w：131cells,8w：331cells, 9w：322cells,10w：509cells,11w：397cells, in

total, 1977cells (w:week) Mouse: E11.5：349cells,E12.5：350cells, E13.5：345cells,E14.5：308cells, E15.5：356cells、E18.5：142cells, unknown：57cells, in total, 1907cells.
24. 24 Tensor decomposition : Tensor is generated Tensor decomposition :

Tensor is generated from product of cells using 13,384 common from product of cells using 13,384 common genes between human and mouse genes between human and mouse xijk = xij × xik ∈ ℝ13384×1977×1907 i:Genes j,k:Cells Size reduction needed because of too huge tensors xjk: decomposed by singular value decomposition vlj: lth human cell singular value vectors vlk: lth mouse cell singular value vectors x jk =∑ i x ijk
25. 25 v lj =a l +∑ t b lt δjt

v lk =a l ' +∑ t b lt ' δkt δjt,δkt： 1 when cells j,k is measured at t 0 otherwise vlj and vlk with any kind of time dependence are selected with categorical regression(ANOVA)
26. 26 How are selected singular value vectors are common? 12

23 32 32 human mouse Singular value vectors associated with adjusted P-values less than 0.01 are selected.
27. 27 uli are generated from vlj and vlk u li

( j)=∑ j v lj x ij u li (k)=∑ k v lk x ik lth human gene singular value vectors lth mouse gene singular value vectors P-values are attributed to gene singular value vectors by χ2 distribution, corrected by BH criterion, genes associated with adjusted P- values less than 0.01 are selected.
28. 28 Benjamini-Hochberg corrected P <0.01 P(p) 1-p 0 1 P

i =P[ >∑ l ( u li σ ) 2 ] P-values by χ2 dist 151 200 305 305 Human Mouse Selected genes
29. 29 Validation：uploaded to Enrichr (Enrichment server） “Allen Brain Atlas” Top

ranked five terms For both Human and Mouse, four out of top five are related to Hypothalamus, which belong to mid brain.
30. 30 Summary We can select biologically reasonable genes with unsupervised

methods using TD for multi-omics data analysis as well as RNA-seq data analysis. I have published a monograph from Springer. I am happy if you can but it, although it is extremely expensive.