Slide 1

Slide 1 text

1 Tensor decomposition based unsupervised feature extraction applied to bioinformatics Y-h. Taguchi Department of Physics, Chuo University, Tokyo 112-8551, Japan. This presentation is available

Slide 2

Slide 2 text

2 Singular value decomposition xij N M (uli)T N L vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl N: number of genes (i) M: number of samples (j) xij: gene expression Example

Slide 3

Slide 3 text

3 Interpretation….. j:samples Healthy control Patients vlj i:genes uli DEG: Differentially Expressed Genes For some specific l Healthy controls < Patients DEG: DEG: Healthy controls > Patients

Slide 4

Slide 4 text

4 x ijk G u l1i u l2j u l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example

Slide 5

Slide 5 text

5 Interpretation….. j:samples Healthy control Patients ul2j For some specific l2 For some specific l3 k:tissues Tissue specific expression ul3k

Slide 6

Slide 6 text

6 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes Healthy controls < Patients tDEG: tDEG: Healthy controls > Patients For some specific l1 with max |G(l1l2l3)| If G(l1l2l3)>0 Fixed

Slide 7

Slide 7 text

7 Integrated analysis of multiple matrices and/or tensors xij : expression of gene i of sample j xkj: methylaion of region k of sample j x xijk ijk ≡ ≡ x xij ij ⨉ ⨉ x xkj kj G u l1i u l2j u l3k L1 L2 L3 x ijk N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k

Slide 8

Slide 8 text

8 Interpretation….. j:samples Healthy control Patients ul2j For some specific l2

Slide 9

Slide 9 text

9 i:genes ul1i DEG: Differentially Expressed Genes Healthy controls < Patients DEG: DEG: Healthy controls > Patients If G(l1l2l3)>0 For gene expression For some specific l1, l3 with max |G(l1l2l3)| Fixed

Slide 10

Slide 10 text

10 k:regions ul3k DMR: Differentially Methylated Regions Healthy controls < Patients DMR: DMR: Healthy controls > Patients For methylation

Slide 11

Slide 11 text

11 Application example No.1 Application example No.1 “Multiomics Data Analysis Using Tensor Decomposition Based Unsupervised Feature Extraction –Comparison with DIABLO–” Y-h. Taguchi in De-Shuang Huang Vitoantonio Bevilacqua Prashan Premaratne (Eds.), Intelligent Computing Theories and Application, 15th International Conference, ICIC 2019 Nanchang, China, August 3–6, 2019 Proceedings, Part I, pp.565-574 https://doi.org/10.1007/978-3-030-26763-6_54 Preprint: https://doi.org/10.1101/591867

Slide 12

Slide 12 text

12 ## $mRNA ## [1] 150 samples ⨉ 200 mRNAs ## ## $miRNA ## [1] 150 samples ⨉184 miRNAs ## ## $proteomics ## [1] 150 samples ⨉142 proteins Three cell lines ## Basal Her2 LumA ## 45 30 75 Taken from mixOmics package in bioconductor https://bioconductor.org/packages/release/bioc/html/ mixOmics.html

Slide 13

Slide 13 text

13 x ij :expression of ith mRNA of jth sample x kj :expression of kth miRNA of jth sample x pj :expression of pth protein of jth sample tensor:x ikpj =x ij・x kj・x pj Apply tensor decomposition (tensor version of singular vallue decomposition) x ikpj ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G (l 1 l 2 l 3 l 4 )u l 1 i u l 2 k u l 3 p u l 4 j ul1i: mRNA, ul2k: miRNA ul3p: proteome, ul4j: sample

Slide 14

Slide 14 text

14 u 1j u 4j Basal Her2 LumA Basal 42 42 4 0 Her2 2 25 25 2 LumA 1 1 73 73 predict Real Error 6.5% Linear discriminant analysis Leave One Out Cross Validation

Slide 15

Slide 15 text

15 Descending order of |G(l1,l2,l3,l4)| with l4=1,4 1 ≦ l3 ≦ 4, proteome 1 ≦ l1 ≦ 2, mRNA 1 ≦ l2 ≦ 2, miRNA

Slide 16

Slide 16 text

16 Selecting 10 top ranked mRNAs, miRNAs and proteins based upon squared sum of singular value vectors

Slide 17

Slide 17 text

17 Basal Her2 LumA mRNA miRNA protein Discrimination performances using selected features

Slide 18

Slide 18 text

18 Number of components generated Errors 0.05 0.10 0.15 Discrimination performances using generated features Comparisons with DIABLO impremented in mixOmics Comparisons with DIABLO impremented in mixOmics

Slide 19

Slide 19 text

19 Discrimination performances using selected features

Slide 20

Slide 20 text

20 Pros and cons of TD based unsupervised FE Pros: Pros: Fast (because of no optimization) Robust (independent of label information) Unsupervised (no need to construct model in advance) Cons: Cons: No ways if it does not work Need more memories: 150 ⨉ (200+184+142) vs 150 ⨉ 200 ⨉ 184 ⨉ 142

Slide 21

Slide 21 text

21 Application example No.2 Application example No.2 Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis Y-h. Taguchi and Turki Turki Frontiers in Genetics, Volume 10, Article 864, 2019. doi: 10.3389/fgene.2019.00864

Slide 22

Slide 22 text

22 Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse Data set: GSE76381 scRNA-seq of human and mouse mid brain developments i:Genes j,k:cells Purpose of the analysis: Selection of genes associated with mid brain development commonly between human and mouse

Slide 23

Slide 23 text

23 Cell numbers and time points Human: 6w:287cells,7w:131cells,8w:331cells, 9w:322cells,10w:509cells,11w:397cells, in total, 1977cells (w:week) Mouse: E11.5:349cells,E12.5:350cells, E13.5:345cells,E14.5:308cells, E15.5:356cells、E18.5:142cells, unknown:57cells, in total, 1907cells.

Slide 24

Slide 24 text

24 Tensor decomposition : Tensor is generated Tensor decomposition : Tensor is generated from product of cells using 13,384 common from product of cells using 13,384 common genes between human and mouse genes between human and mouse xijk = xij × xik ∈ ℝ13384×1977×1907 i:Genes j,k:Cells Size reduction needed because of too huge tensors xjk: decomposed by singular value decomposition vlj: lth human cell singular value vectors vlk: lth mouse cell singular value vectors x jk =∑ i x ijk

Slide 25

Slide 25 text

25 v lj =a l +∑ t b lt δjt v lk =a l ' +∑ t b lt ' δkt δjt,δkt: 1 when cells j,k is measured at t 0 otherwise vlj and vlk with any kind of time dependence are selected with categorical regression(ANOVA)

Slide 26

Slide 26 text

26 How are selected singular value vectors are common? 12 23 32 32 human mouse Singular value vectors associated with adjusted P-values less than 0.01 are selected.

Slide 27

Slide 27 text

27 uli are generated from vlj and vlk u li ( j)=∑ j v lj x ij u li (k)=∑ k v lk x ik lth human gene singular value vectors lth mouse gene singular value vectors P-values are attributed to gene singular value vectors by χ2 distribution, corrected by BH criterion, genes associated with adjusted P- values less than 0.01 are selected.

Slide 28

Slide 28 text

28 Benjamini-Hochberg corrected P <0.01 P(p) 1-p 0 1 P i =P[ >∑ l ( u li σ ) 2 ] P-values by χ2 dist 151 200 305 305 Human Mouse Selected genes

Slide 29

Slide 29 text

29 Validation:uploaded to Enrichr (Enrichment server) “Allen Brain Atlas” Top ranked five terms For both Human and Mouse, four out of top five are related to Hypothalamus, which belong to mid brain.

Slide 30

Slide 30 text

30 Summary We can select biologically reasonable genes with unsupervised methods using TD for multi-omics data analysis as well as RNA-seq data analysis. I have published a monograph from Springer. I am happy if you can but it, although it is extremely expensive.