Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Application of Tensor Decomposition based Unsupervised Feature Extraction to Single Cell RNA-seq Data Analysis

Application of Tensor Decomposition based Unsupervised Feature Extraction to Single Cell RNA-seq Data Analysis

Keynote at IEEE ICBCB2020
http://www.icbcb.org
16th -18th May, 2020
Taiyuan, China

Y-h. Taguchi

May 18, 2020
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. ICBCB2020 1 Application of Tensor Decomposition based Unsupervised Feature Extraction

    to Single Cell RNA-seq Data Analysis Y-h. Taguchi Department of Physics, Chuo University, Tokyo, Japan IEEE ICBCB2020 18th May, 2020 Taiyuan, China
  2. ICBCB2020 2 The method used in this presentation was fully

    described in the following my book published by Springer International, at Sep. 2019. I am glad if the audience can buy it and learn how to apply this method to your own research!
  3. ICBCB2020 3 Singular value decomposition xij N M (uli)T N

    L vlj L M ⨉ ≈ x ij ≃∑ l=1 L u li λl v l j L L ⨉ λl N: number of genes (i) M: number of samples (j) xij: gene expression Example
  4. ICBCB2020 4 Interpretation….. j:samples Healthy control Patients vlj i:genes uli

    DEG: Differentially Expressed Genes For some specific l Healthy controls < Patients DEG: DEG: Healthy controls > Patients
  5. ICBCB2020 5 x ijk G u l1i u l2j u

    l3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) Extension to tensor….. N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk: gene expression Example
  6. ICBCB2020 6 Interpretation….. j:samples Healthy control Patients ul2j For some

    specific l2 For some specific l3 k:tissues Tissue specific expression ul3k
  7. ICBCB2020 7 i:genes ul1i tDEG: tissue specific Differentially Expressed Genes

    Healthy controls < Patients tDEG: tDEG: Healthy controls > Patients For some specific l1 with max |G(l1l2l3)| If G(l1l2l3)>0 Fixed
  8. ICBCB2020 8 Integrated analysis of multiple matrices and/or tensors xij

    : expression of gene i of sample j xkj: methylaion of region k of sample j x xijk ijk ≡ ≡ x xij ij ⨉ ⨉ x xkj kj G u l1i u l2j u l3k L1 L2 L3 x ijk N M K x ijk ≃∑ l 1 =1 L 1 ∑ l 2 =1 L 2 ∑ l 3 =1 L 3 G(l 1 l 2 l 3 )u l 1 i u l 2 j u l 3 k
  9. ICBCB2020 10 i:genes ul1i DEG: Differentially Expressed Genes Healthy controls

    < Patients DEG: DEG: Healthy controls > Patients If G(l1l2l3)>0 For gene expression For some specific l1, l3 with max |G(l1l2l3)| Fixed
  10. ICBCB2020 11 k:regions ul3k DMR: Differentially Methylated Regions Healthy controls

    < Patients DMR: DMR: Healthy controls > Patients For methylation
  11. ICBCB2020 12 Application example No.1 Application example No.1 Neurological Disorder

    Drug Discovery from Gene Expression with Tensor Decomposition Author(s): Y-h. Taguchi*, Turki Turki. Journal Name: Current Pharmaceutical Design Volume 25 , Issue 43 , 2019 OPEN ACCESS Inference of drug efective to Alzheimer disease of mice brain single cell gene expression (without drug treated gene expression)
  12. ICBCB2020 13 Data & experiments (mice) GEO: GSE127892 Two genotypes

    (APP_NL-F-G and C57Bl/6), Two tissues (Cortex and Hippocampus), Four ages (3, 6, 12, and 21 weeks), Two sexes (male and female) Four 96 well plates (the number of cells). Aim: Understanding Alzheimer’s disease
  13. ICBCB2020 14 Tensor x i j1 j2 j3 j4 j5

    j6 represents gene expression of ith gene of j 1 th cell (well) j 2 th genotype (j 2 = 1:APP_NL-F-G and j 2 = 2: C57Bl/6), j 3 th tissue (j 3 = 1:Cortex and j 3 = 2:Hippocampus), j 4 th age (j 4 = 1: three weeks,j 4 = 2: six weeks, j 4 = 3: twelve weeks, and j 4 = 4: twenty one weeks), j 5 th sex (j5 = 1:female and j5 = 2:male) j 6 th plate.
  14. ICBCB2020 15 x i j 1 j 2 j 3

    j 4 j 5 j 6 =∑ l 1 l 2 l 3 l 4 l 5 l 6 l 7 G(l 1 ,l 2 ,l 3 ,l 4 ,l 5 ,l 6 ,l 7 ) ×u l 1 j 1 u l 2 j 2 u l 3 j 3 u l 4 j 4 u l 5 j 5 u l 6 j 6 u l 7 i (A) u l1j1 :96 wells (cells), l 1 =1 (B) u l2j2 : genotype APP_NL-F-G vs C57Bl/6, l 2 =1 (C) u l3j3 : Cortex vs Hippocampus, l 3 =1 (D) u l4j4 : 3, 6, 12, 21 weeks , l 4 =2 (E) u l5j5 : female vs male, l 5 =1 (F) u l6j6 : 4 plates , l 1 =1 → l 7 =2 with G(1,1,1,2,1,1,l 7 ) (the largest absolute values)
  15. ICBCB2020 16 P i =P χ2 [( u 2i σ2

    ) 2 ] Attributing P-values to genes After correcting P-values by BH criterion, 401 genes with corrected P i <0.01 are selected. → Evaluate how these are overlapped with genes affected by known Alzheimer’s drug treatments. 401 genes are uploaded to Enrichr
  16. ICBCB2020 17 Top ranked 10 compounds listed in “LINCS L1000

    Chem Pert up” category in Enrichr. Overlap is that between selected 401 genes and genes selected in individual experiments. known Alzheimer’s drug
  17. ICBCB2020 18 known Alzheimer’s drug Top ranked 10 compounds listed

    in “DrugMatrix” category in Enrichr. Overlap is that between selected 401 genes and genes selected in individual experiments.
  18. ICBCB2020 19 known Alzheimer’s drug Top ranked 10 compounds listed

    in “Drug Perturbations from GEO up” category in Enrichr. Overlap is that between selected 401 genes and genes selected in individual experiments.
  19. ICBCB2020 20 “Drugs → Gene Sets” is now easily performed

    by DrugEnrichr. Try it! https://amp.pharm.mssm.edu/DrugEnrichr/
  20. ICBCB2020 21 Comparison with other methods Although we tried to

    apply the other methods, the following two methods did not converge within 24 hours. (The present method converged in 10 hours) The first alternative method: CP decomposition: Orthogonal tensor decomposition. x ijk u l1i u l2j u l3k N M K u l1i u l2j u l3k + + ·······
  21. ICBCB2020 22 The second alternative method coupled matrix and tensor

    factorization (CMTF) (supervised TD) v i ,v j ,v k ....: various target vectors, e.g., time dependence, genotype dependence, cell dependence, plate dependence….. + ······· CP decomposition Penalty term
  22. ICBCB2020 23 Thus, we estimated cpu time using smaller scale

    model data v i v j v k v ’i v ’j v’ k + v i ,v j ,v k v’ i ,v’ j ,v’ k v i ,v j ,v k v’ i ,v’ j ,v’ k Model 1 Model 2 3 mode tensor generated by summation of two products of three identical vectors
  23. ICBCB2020 24 Present method ⨉10 ⨉10 Our method is only

    method applicable to large scale data, Our method is only method applicable to large scale data, since only our method does not require iteration! since only our method does not require iteration!
  24. ICBCB2020 25 Application example No.2 Application example No.2 Tensor Decomposition-Based

    Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis Y-h. Taguchi and Turki Turki Frontiers in Genetics, Volume 10, Article 864, 2019. doi: 10.3389/fgene.2019.00864 OPEN ACCESS
  25. ICBCB2020 26 Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse

    Data set: GSE76381 scRNA-seq of human and mouse mid brain developments i:Genes j,k:cells Purpose of the analysis: Selection of genes associated with mid brain development commonly between human and mouse
  26. ICBCB2020 27 Cell numbers and time points Human: 6w:287cells,7w:131cells,8w:331cells, 9w:322cells,10w:509cells,11w:397cells,

    in total, 1977cells (w:week) Mouse: E11.5:349cells,E12.5:350cells, E13.5:345cells,E14.5:308cells, E15.5:356cells、E18.5:142cells, unknown:57cells, in total, 1907cells.
  27. ICBCB2020 28 Tensor decomposition : Tensor is generated Tensor decomposition

    : Tensor is generated from product of cells using 13,384 common from product of cells using 13,384 common genes between human and mouse genes between human and mouse xijk = xij × xik ∈ ℝ13384×1977×1907 i:Genes j,k:Cells Size reduction needed because of too huge tensors xjk: decomposed by singular value decomposition vlj: lth human cell singular value vectors vlk: lth mouse cell singular value vectors x jk =∑ i x ijk
  28. ICBCB2020 29 v lj =a l +∑ t b lt

    δjt v lk =a l ' +∑ t b lt ' δkt δjt,δkt: 1 when cells j,k is measured at t 0 otherwise vlj and vlk with any kind of time dependence are selected with categorical regression(ANOVA)
  29. ICBCB2020 30 How are selected singular value vectors are common?

    23 12 32 32 human mouse Singular value vectors associated with adjusted P-values less than 0.01 are selected.
  30. ICBCB2020 31 uli are generated from vlj and vlk u

    li ( j)=∑ j v lj x ij u li (k)=∑ k v lk x ik lth human gene singular value vectors lth mouse gene singular value vectors P-values are attributed to gene singular value vectors by χ2 distribution, corrected by BH criterion, genes associated with adjusted P- values less than 0.01 are selected.
  31. ICBCB2020 32 Benjamini-Hochberg corrected P <0.01 P(p) 1-p 0 1

    P i =P[ >∑ l ( u li σ ) 2 ] P-values by χ2 dist 151 200 305 305 Human Mouse Selected genes
  32. ICBCB2020 34 Validation:uploaded to Enrichr Enrichr (Enrichment server) “Allen Brain

    Atlas” Top ranked five terms For both Human and Mouse, four out of top five are related to Hypothalamus, which belong to mid brain.
  33. ICBCB2020 39 Comparisons with other methods Highly variable genes: 144

    127 44 44 Human Mouse Selected genes Less overlaps between human and mouse. No biological terms related to brains are enriched. More comparisons are available in the following paper. Y-h. Taguchi, ICIC2018 (2018) “Principal Component Analysis-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis” https://doi.org/10.1007/978-3-319-95933-7_90
  34. ICBCB2020 40 Conclusions: Tensor decomposition based unsupervised feature extraction is

    applicable to massive single cell RNA-seq data and is capable to select biologically reasonable genes. Since it is an unsupervised method, it is easy to use and is applicable to wide range of scRNA-seq data set.