Tensor decomposition based unsupervised feature extraction applied to bioinformatics

1 Tensor decomposition based unsupervised feature extraction applied to bioinformatics
Y-h. Taguchi Department of Physics, Chuo University, Tokyo 112-8551, Japan. This presentation is available

2 Singular value decomposition xij N M (uli)T N L
vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl N: number of genes (i) M: number of samples (j) xij: gene expression Example

3 Interpretation….. j:samples Healthy control Patients vlj i:genes uli DEG:
Differentially Expressed Genes For some specific l Healthy controls < Patients DEG: DEG: Healthy controls > Patients

4 x ijk G u l1i u l2j u l3k
L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example

5 Interpretation….. j:samples Healthy control Patients ul2j For some specific
l2 For some specific l3 k:tissues Tissue specific expression ul3k

6 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes Healthy
controls < Patients tDEG: tDEG: Healthy controls > Patients For some specific l1 with max |G(l1l2l3)| If G(l1l2l3)>0 Fixed

7 Integrated analysis of multiple matrices and/or tensors xij :
expression of gene i of sample j xkj: methylaion of region k of sample j x xijk ijk ≡ ≡ x xij ij ⨉ ⨉ x xkj kj G u l1i u l2j u l3k L1 L2 L3 x ijk N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

8 Interpretation….. j:samples Healthy control Patients ul2j For some specific
l2

9 i:genes ul1i DEG: Differentially Expressed Genes Healthy controls <
Patients DEG: DEG: Healthy controls > Patients If G(l1l2l3)>0 For gene expression For some specific l1, l3 with max |G(l1l2l3)| Fixed

10 k:regions ul3k DMR: Differentially Methylated Regions Healthy controls <
Patients DMR: DMR: Healthy controls > Patients For methylation

11 Application example No.1 Application example No.1 “Multiomics Data Analysis
Using Tensor Decomposition Based Unsupervised Feature Extraction –Comparison with DIABLO–” Y-h. Taguchi in De-Shuang Huang Vitoantonio Bevilacqua Prashan Premaratne (Eds.), Intelligent Computing Theories and Application, 15th International Conference, ICIC 2019 Nanchang, China, August 3–6, 2019 Proceedings, Part I, pp.565-574 https://doi.org/10.1007/978-3-030-26763-6_54 Preprint: https://doi.org/10.1101/591867

12 ## $mRNA ## [1] 150 samples ⨉ 200 mRNAs
## ## $miRNA ## [1] 150 samples ⨉184 miRNAs ## ## $proteomics ## [1] 150 samples ⨉142 proteins Three cell lines ## Basal Her2 LumA ## 45 30 75 Taken from mixOmics package in bioconductor https://bioconductor.org/packages/release/bioc/html/ mixOmics.html

13 x ij :expression of ith mRNA of jth sample
x kj :expression of kth miRNA of jth sample x pj :expression of pth protein of jth sample tensor：x ikpj =x ij・x kj・x pj Apply tensor decomposition (tensor version of singular vallue decomposition) x ikpj ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G (l 1 l 2 l 3 l 4 )u l 1 i u l 2 k u l 3 p u l 4 j ul1i: mRNA, ul2k: miRNA ul3p: proteome, ul4j: sample

14 u 1j u 4j Basal Her2 LumA Basal 42
42 4 0 Her2 2 25 25 2 LumA 1 1 73 73 predict Real Error ６．５% Linear discriminant analysis Leave One Out Cross Validation

15 Descending order of |G(l1,l2,l3,l4)| with l4=1,4 1 ≦ l3
≦ 4, proteome 1 ≦ l1 ≦ 2, mRNA 1 ≦ l2 ≦ 2, miRNA

16 Selecting 10 top ranked mRNAs, miRNAs and proteins based
upon squared sum of singular value vectors

17 Basal Her2 LumA mRNA miRNA protein Discrimination performances using
selected features

18 Number of components generated Errors 0.05 0.10 0.15 Discrimination
performances using generated features Comparisons with DIABLO impremented in mixOmics Comparisons with DIABLO impremented in mixOmics

19 Discrimination performances using selected features

20 Pros and cons of TD based unsupervised FE Pros:
Pros: Fast (because of no optimization) Robust (independent of label information) Unsupervised (no need to construct model in advance) Cons: Cons: No ways if it does not work Need more memories: 150 ⨉ (200+184+142) vs 150 ⨉ 200 ⨉ 184 ⨉ 142

21 Application example No.2 Application example No.2 Tensor Decomposition-Based Unsupervised
Feature Extraction Applied to Single-Cell Gene Expression Analysis Y-h. Taguchi and Turki Turki Frontiers in Genetics, Volume 10, Article 864, 2019. doi: 10.3389/fgene.2019.00864

22 Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse Data
set: GSE76381 scRNA-seq of human and mouse mid brain developments i:Genes j,k:cells Purpose of the analysis: Selection of genes associated with mid brain development commonly between human and mouse

23 Cell numbers and time points Human: 6w：287cells,7w：131cells,8w：331cells, 9w：322cells,10w：509cells,11w：397cells, in
total, 1977cells (w:week) Mouse: E11.5：349cells,E12.5：350cells, E13.5：345cells,E14.5：308cells, E15.5：356cells、E18.5：142cells, unknown：57cells, in total, 1907cells.

24 Tensor decomposition : Tensor is generated Tensor decomposition :
Tensor is generated from product of cells using 13,384 common from product of cells using 13,384 common genes between human and mouse genes between human and mouse xijk = xij × xik ∈ ℝ13384×1977×1907 i:Genes j,k:Cells Size reduction needed because of too huge tensors xjk: decomposed by singular value decomposition vlj: lth human cell singular value vectors vlk: lth mouse cell singular value vectors x jk =∑ i x ijk

25 v lj =a l +∑ t b lt δjt
v lk =a l ' +∑ t b lt ' δkt δjt,δkt： 1 when cells j,k is measured at t 0 otherwise vlj and vlk with any kind of time dependence are selected with categorical regression(ANOVA)

26 How are selected singular value vectors are common? 12
23 32 32 human mouse Singular value vectors associated with adjusted P-values less than 0.01 are selected.

27 uli are generated from vlj and vlk u li
( j)=∑ j v lj x ij u li (k)=∑ k v lk x ik lth human gene singular value vectors lth mouse gene singular value vectors P-values are attributed to gene singular value vectors by χ2 distribution, corrected by BH criterion, genes associated with adjusted P- values less than 0.01 are selected.

28 Benjamini-Hochberg corrected P <0.01 P(p) 1-p 0 1 P
i =P[ >∑ l ( u li σ ) 2 ] P-values by χ2 dist 151 200 305 305 Human Mouse Selected genes

29 Validation：uploaded to Enrichr (Enrichment server） “Allen Brain Atlas” Top
ranked five terms For both Human and Mouse, four out of top five are related to Hypothalamus, which belong to mid brain.

30 Summary We can select biologically reasonable genes with unsupervised
methods using TD for multi-omics data analysis as well as RNA-seq data analysis. I have published a monograph from Springer. I am happy if you can but it, although it is extremely expensive.

Tensor decomposition based unsupervised feature...

Tensor decomposition based unsupervised feature extraction applied to bioinformatics

Y-h. Taguchi PRO

More Decks by Y-h. Taguchi

Other Decks in Science

Featured

Transcript

1 Tensor decomposition based unsupervised feature extraction applied to bioinformatics

2 Singular value decomposition xij N M (uli)T N L

3 Interpretation….. j:samples Healthy control Patients vlj i:genes uli DEG:

4 x ijk G u l1i u l2j u l3k

5 Interpretation….. j:samples Healthy control Patients ul2j For some specific

6 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes Healthy

7 Integrated analysis of multiple matrices and/or tensors xij :

8 Interpretation….. j:samples Healthy control Patients ul2j For some specific

9 i:genes ul1i DEG: Differentially Expressed Genes Healthy controls <

10 k:regions ul3k DMR: Differentially Methylated Regions Healthy controls <

11 Application example No.1 Application example No.1 “Multiomics Data Analysis

12 ## $mRNA ## [1] 150 samples ⨉ 200 mRNAs

13 x ij :expression of ith mRNA of jth sample

14 u 1j u 4j Basal Her2 LumA Basal 42

15 Descending order of |G(l1,l2,l3,l4)| with l4=1,4 1 ≦ l3

16 Selecting 10 top ranked mRNAs, miRNAs and proteins based

17 Basal Her2 LumA mRNA miRNA protein Discrimination performances using

18 Number of components generated Errors 0.05 0.10 0.15 Discrimination

19 Discrimination performances using selected features

20 Pros and cons of TD based unsupervised FE Pros:

21 Application example No.2 Application example No.2 Tensor Decomposition-Based Unsupervised

22 Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse Data

23 Cell numbers and time points Human: 6w：287cells,7w：131cells,8w：331cells, 9w：322cells,10w：509cells,11w：397cells, in

24 Tensor decomposition : Tensor is generated Tensor decomposition :

25 v lj =a l +∑ t b lt δjt

26 How are selected singular value vectors are common? 12

27 uli are generated from vlj and vlk u li

28 Benjamini-Hochberg corrected P <0.01 P(p) 1-p 0 1 P

29 Validation：uploaded to Enrichr (Enrichment server） “Allen Brain Atlas” Top

30 Summary We can select biologically reasonable genes with unsupervised