Tensor decomposition based unsupervised feature extraction applied to bioinformatics

948966d9c690e72faba4fd76e1858c56?s=47 Y-h. Taguchi
October 30, 2019

Tensor decomposition based unsupervised feature extraction applied to bioinformatics

Presentation at PreMed'19
http://sharmalab.bwh.harvard.edu/premed19/
31th Oct 2019, OIST

948966d9c690e72faba4fd76e1858c56?s=128

Y-h. Taguchi

October 30, 2019
Tweet

Transcript

  1. 1 Tensor decomposition based unsupervised feature extraction applied to bioinformatics

    Y-h. Taguchi Department of Physics, Chuo University, Tokyo 112-8551, Japan. This presentation is available
  2. 2 Singular value decomposition xij N M (uli)T N L

    vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl N: number of genes (i) M: number of samples (j) xij: gene expression Example
  3. 3 Interpretation….. j:samples Healthy control Patients vlj i:genes uli DEG:

    Differentially Expressed Genes For some specific l Healthy controls < Patients DEG: DEG: Healthy controls > Patients
  4. 4 x ijk G u l1i u l2j u l3k

    L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example
  5. 5 Interpretation….. j:samples Healthy control Patients ul2j For some specific

    l2 For some specific l3 k:tissues Tissue specific expression ul3k
  6. 6 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes Healthy

    controls < Patients tDEG: tDEG: Healthy controls > Patients For some specific l1 with max |G(l1l2l3)| If G(l1l2l3)>0 Fixed
  7. 7 Integrated analysis of multiple matrices and/or tensors xij :

    expression of gene i of sample j xkj: methylaion of region k of sample j x xijk ijk ≡ ≡ x xij ij ⨉ ⨉ x xkj kj G u l1i u l2j u l3k L1 L2 L3 x ijk N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k
  8. 8 Interpretation….. j:samples Healthy control Patients ul2j For some specific

    l2
  9. 9 i:genes ul1i DEG: Differentially Expressed Genes Healthy controls <

    Patients DEG: DEG: Healthy controls > Patients If G(l1l2l3)>0 For gene expression For some specific l1, l3 with max |G(l1l2l3)| Fixed
  10. 10 k:regions ul3k DMR: Differentially Methylated Regions Healthy controls <

    Patients DMR: DMR: Healthy controls > Patients For methylation
  11. 11 Application example No.1 Application example No.1 “Multiomics Data Analysis

    Using Tensor Decomposition Based Unsupervised Feature Extraction –Comparison with DIABLO–” Y-h. Taguchi in De-Shuang Huang Vitoantonio Bevilacqua Prashan Premaratne (Eds.), Intelligent Computing Theories and Application, 15th International Conference, ICIC 2019 Nanchang, China, August 3–6, 2019 Proceedings, Part I, pp.565-574 https://doi.org/10.1007/978-3-030-26763-6_54 Preprint: https://doi.org/10.1101/591867
  12. 12 ## $mRNA ## [1] 150 samples ⨉ 200 mRNAs

    ## ## $miRNA ## [1] 150 samples ⨉184 miRNAs ## ## $proteomics ## [1] 150 samples ⨉142 proteins Three cell lines ## Basal Her2 LumA ## 45 30 75 Taken from mixOmics package in bioconductor https://bioconductor.org/packages/release/bioc/html/ mixOmics.html
  13. 13 x ij :expression of ith mRNA of jth sample

    x kj :expression of kth miRNA of jth sample x pj :expression of pth protein of jth sample tensor:x ikpj =x ij・x kj・x pj Apply tensor decomposition (tensor version of singular vallue decomposition) x ikpj ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 ∑ l 4 =1 L 4 G (l 1 l 2 l 3 l 4 )u l 1 i u l 2 k u l 3 p u l 4 j ul1i: mRNA, ul2k: miRNA ul3p: proteome, ul4j: sample
  14. 14 u 1j u 4j Basal Her2 LumA Basal 42

    42 4 0 Her2 2 25 25 2 LumA 1 1 73 73 predict Real Error 6.5% Linear discriminant analysis Leave One Out Cross Validation
  15. 15 Descending order of |G(l1,l2,l3,l4)| with l4=1,4 1 ≦ l3

    ≦ 4, proteome 1 ≦ l1 ≦ 2, mRNA 1 ≦ l2 ≦ 2, miRNA
  16. 16 Selecting 10 top ranked mRNAs, miRNAs and proteins based

    upon squared sum of singular value vectors
  17. 17 Basal Her2 LumA mRNA miRNA protein Discrimination performances using

    selected features
  18. 18 Number of components generated Errors 0.05 0.10 0.15 Discrimination

    performances using generated features Comparisons with DIABLO impremented in mixOmics Comparisons with DIABLO impremented in mixOmics
  19. 19 Discrimination performances using selected features

  20. 20 Pros and cons of TD based unsupervised FE Pros:

    Pros: Fast (because of no optimization) Robust (independent of label information) Unsupervised (no need to construct model in advance) Cons: Cons: No ways if it does not work Need more memories: 150 ⨉ (200+184+142) vs 150 ⨉ 200 ⨉ 184 ⨉ 142
  21. 21 Application example No.2 Application example No.2 Tensor Decomposition-Based Unsupervised

    Feature Extraction Applied to Single-Cell Gene Expression Analysis Y-h. Taguchi and Turki Turki Frontiers in Genetics, Volume 10, Article 864, 2019. doi: 10.3389/fgene.2019.00864
  22. 22 Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse Data

    set: GSE76381 scRNA-seq of human and mouse mid brain developments i:Genes j,k:cells Purpose of the analysis: Selection of genes associated with mid brain development commonly between human and mouse
  23. 23 Cell numbers and time points Human: 6w:287cells,7w:131cells,8w:331cells, 9w:322cells,10w:509cells,11w:397cells, in

    total, 1977cells (w:week) Mouse: E11.5:349cells,E12.5:350cells, E13.5:345cells,E14.5:308cells, E15.5:356cells、E18.5:142cells, unknown:57cells, in total, 1907cells.
  24. 24 Tensor decomposition : Tensor is generated Tensor decomposition :

    Tensor is generated from product of cells using 13,384 common from product of cells using 13,384 common genes between human and mouse genes between human and mouse xijk = xij × xik ∈ ℝ13384×1977×1907 i:Genes j,k:Cells Size reduction needed because of too huge tensors xjk: decomposed by singular value decomposition vlj: lth human cell singular value vectors vlk: lth mouse cell singular value vectors x jk =∑ i x ijk
  25. 25 v lj =a l +∑ t b lt δjt

    v lk =a l ' +∑ t b lt ' δkt δjt,δkt: 1 when cells j,k is measured at t 0 otherwise vlj and vlk with any kind of time dependence are selected with categorical regression(ANOVA)
  26. 26 How are selected singular value vectors are common? 12

    23 32 32 human mouse Singular value vectors associated with adjusted P-values less than 0.01 are selected.
  27. 27 uli are generated from vlj and vlk u li

    ( j)=∑ j v lj x ij u li (k)=∑ k v lk x ik lth human gene singular value vectors lth mouse gene singular value vectors P-values are attributed to gene singular value vectors by χ2 distribution, corrected by BH criterion, genes associated with adjusted P- values less than 0.01 are selected.
  28. 28 Benjamini-Hochberg corrected P <0.01 P(p) 1-p 0 1 P

    i =P[ >∑ l ( u li σ ) 2 ] P-values by χ2 dist 151 200 305 305 Human Mouse Selected genes
  29. 29 Validation:uploaded to Enrichr (Enrichment server) “Allen Brain Atlas” Top

    ranked five terms For both Human and Mouse, four out of top five are related to Hypothalamus, which belong to mid brain.
  30. 30 Summary We can select biologically reasonable genes with unsupervised

    methods using TD for multi-omics data analysis as well as RNA-seq data analysis. I have published a monograph from Springer. I am happy if you can but it, although it is extremely expensive.