Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Principal component analysis and tensor decomposition based unsupervised feature extraction applied to single cell RNA-seq

Y-h. Taguchi
September 24, 2019

Principal component analysis and tensor decomposition based unsupervised feature extraction applied to single cell RNA-seq

Presentation at RNA frontier meeting, 24th Sep 2019.
https://sites.google.com/keio.jp/rnafrontier2019
IBM Amagi Homestead (English Version)

Y-h. Taguchi

September 24, 2019
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Principal component analysis and tensor decomposition based unsupervised feature extraction

    applied to single cell RNA-seq Y-h. Taguchi, Department of Physics, Chuo University Tensor decomposition (TD): Y-h. Taguchi and Turki Turki, Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis, Front. Genet., (2019) Vol.10, p864 https://doi.org/10.3389/fgene.2019.00864 Papers: Principal component analysis (PCA): Y-h. Taguchi, Principal Component Analysis- Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis. ICIC 2018. (2018) Lecture Notes in Computer Science, vol 10955. Springer, Cham https://doi.org/10.1007/978-3-319-95933-7_90 Preprint: https://doi.org/10.1101/312892
  2. N×M N M × Genes i Genes i Human j

    = Human j Human j Genes i Tissues k N M × Genes i Human j = × M Tissues k By decomposition to vectors, we can get “genes” “Humans” “Tissues” vectors and can get “meaning”. Actually, we need not a vector but a set of vectors. What is tensor What is tensor? ?:extension of matrix Matrix:genes i× human j(patients vs healthy control):xij Tensor:genes i× human j(patients vs healthy control)× tissues k: xijk What is PCA and TD? What is PCA and TD? Decompose matrix and tensor into vectors
  3. What is PCA based unsupervised FE? N features Categorical multiclasses

    In contrast to usual usage of PCA, not samples but features are embedded into Q dimensional space. PCA PC1 samples PC Loadings M samples N × M Matrix X (numerical values) PC2 PC1 PC Score + + + + + + + + + + + + + + + No distinction between classes
  4. Synthetic example 10 samples 10 samples 90 features 10 features

    N(0,1/2) N(m,1/2) [N(m,1/2)+N(0,1/2)]/2 +:Top 10 outliers m=2 Thus, extracting outliers selects features distinct between two classes in an unsupervised way. Accuracy:(100 trials) Accuracy:(100 trials) 89.5% (m=2) 52.6% (m=1) PC1 PC2 Normal μ:mean Distribution ½ :SD
  5. Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse Data set:

    GSE76381 ScRNA-seq of human and mouse mid brain developments i:Genes j,k:cells Cell nubmers and time points Human: 6w:287cells、7w:131cells、8w:331cells、9w:322cells、10w: 509cells、11w:397cells, in total, 1977cells (w:week) Mouse:E11.5:349cells、E12.5:350cells、E13.5:345cells、E14.5: 308cells、E15.5:356cells、E18.5:142cells、unknown:57cells, in total, 1907cells.
  6. Genes without expression are discarded. Variance 1、mean 0 per cell

    (standardized) PC scores uli are attributed to genes, PC loading, vlj, are attributed to samples by PCA (it differs from usual usage)。 ulis are assumed to obey multiple Gaussian P i =P χ2 [ >∑ l=1 L ( u li σl ) 2 ] Pi: corrected by Benjamini-Hochberg Genes with corrected Pi < 0.01are selected. cf. 演題番号O-17 遺伝子選択のためののためのFDRカットオフ水準検討水準検討 藤澤孝太、宮田龍太 Gene selection Gene selection 63 65 53 53 Human L=2 Mouse L=3 Genes
  7. Validation:uploaded to Enrichr(Enrichment server) “MGI Mammalian Phenotype 2017” Top five

    raked terms Cerebral cortex, Axon, Dentate gyrus, Hippocampus, and Olfactory bulb →all are brain related. Other enrichment analyses are omitted here.
  8. Tensor decomposition Tensor decomposition Tensor is generated from product of

    cells Tensor is generated from product of cells xijk = xij × xik ∈ ℝ13384×1977×1907 Size reduction needed because of too huge tansors xjk: is singular value decomposed vlj: lth human cell sigular value vectors vlk: lth mouse cell sigular value vectors vlj and vlk with any kind of time dependence are selected with categorical regression(ANOVA) v lj =a l +∑ t b lt δjt v lk =a l ' +∑ t b lt ' δkt δjt,δkt: 1 when cells j,k is measured at t otherwise 0 i:Genes j,k:Cells x jk =∑ i x ijk
  9. How are selected singular value vectors are common? 12 23

    32 32 Human mouse uli are generated from vlj and vlk u li ( j)=∑ j v lj x ij u li (k)=∑ k v lk x ik lth human gene singular value vectors lth mouse gene singular value vectors P-values are attributed to gene singular value vectors by χ2 distribution, corrected by BH criterion, genes with corrcted P <0.01 are selected.
  10. 151 200 305 305 Human Mouse Selected genes Validation:uploaded to

    Enrichr (Enrichment server) “Allen Brain Atlas” Top ranked five terms Hypothalamus ∈ mid brain Other enrichment analyses are omitted here.
  11. Summary We can select biologically reasonable genes with unsupervised methods

    using PCA and TD. Because of lack or small labels, this is fitted to scRNA-seq I have published a monograph from Springer. I am happy if you can but it, although it is extremely expensive.