Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Principal Component Analysis-Based Unsupervised Feature Extraction Applied to Single Cell Gene Expression Analysis

Y-h. Taguchi
August 15, 2018

Principal Component Analysis-Based Unsupervised Feature Extraction Applied to Single Cell Gene Expression Analysis

Principal Component Analysis-Based Unsupervised Feature Extraction Applied to Single Cell Gene Expression Analysis
Presentation at ICIC2018
http://ic-ic.tongji.edu.cn/2018/index.htm
papers
https://doi.org/10.1101/312892 (preprint)
https://doi.org/10.1007/978-3-319-95933-7_90

Y-h. Taguchi

August 15, 2018
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Principal Component Analysis-Based Unsupervised Feature Extraction Applied to Single Cell

    Gene Expression Analysis Y-h. Taguchi Department of Physics, Chuo University, Tokyo, Japan.
  2. Introduction Introduction By defnition, single cell (sc) RNA-seq data sets

    are unlabeled. Thus, clustering (e.g., tSNE) is inevitable. However, in order to perform clustering well, limited number of genes often must be selected. In spite of that, because of unlabeled samples, conventional gene selection procedure based upon t test and/or fold change analysis cannot be employed, since these analyses cannot be performed without classifying samples into two groups.
  3. Some popular unsupervised gene selection procedures 1. Highly Variable Genes

    Highly Variable Genes Genes with larger variance over single cells are selected. 2. Bimodal Genes: Bimodal Genes: Genes must not be, at least, unimodal, since unimodal distribution unlikely distinguish between multple classes. 3. dpFeature dpFeature More sophisticated methods including clustering. 4. Principal component analysis based unsupervised Principal component analysis based unsupervised feature extraction proposed method. ←
  4. What is PCA based unsupervised FE? N features Categorical multiclasses

    In contrast to usual usage of PCA, not samples but features are embedded into Q dimensional space. PCA PC1 samples PC Loadings M samples N × M Matrix X (numerical values) PC2 PC1 PC Score + + + + + + + + + + + + + + + No distinction between classes
  5. Synthetic example 10 samples 10 samples 90 features 10 features

    N(0,1/2) N(m,1/2) [N(m,1/2)+N(0,1/2)]/2 +:Top 10 outliers m=2 Thus, extracting outliers selects features distinct between two classes in an unsupervised way. Accuracy:(100 trials) Accuracy:(100 trials) 89.5% (m=2) 52.6% (m=1) PC1 PC2 Normal μ:mean Distribution ½ :SD
  6. Gene expression profles Gene expression profles GEO ID GSE76381. human

    human embryo ventral midbrain cells between 6 and 11 weeks of gestation, mouse mouse ventral midbrain cells at six developmental stages between E11.5 to E18.5, Th+ neurons at P19–P27, and FACS-sorted putative dopaminergic neurons at P28–P56 from Slc6a3-Cre/tdTomato mice.
  7. Results Results 63 53 65 human mouse PCA 53 63+53+65=181

    0.29 0.29 124 44 127 humanmouse Highly Variable Genes 44 124+44+127=295 0.15 0.15 Human:13775 Mouse:13362 Top 200 124 76 124 humanmouse 76 0.23 0.23 124+76+124=324 dpFeature Bimodal genes Human:11344 Mouse:10849 Top 200 179 22 179 humanmouse 22 0.06 0.06 179+22+179=380 Highest overlap!
  8. Biological validation “MGI Mammalian Phenotype 2017” in Enrichr PCA TOP

    four brain related Highly Variable Genes No brain related terms in top fve Bimodal genes No signifcally enriched terms dpFeature Only ffth one among top fve is brain related ← ← Best method! Best method!
  9. “Jensen TISSUES” by Enrichr PCA Highly Variable Genes No brain

    related terms in top fve Term Overlap P-value Adjusted P-value Human Embryonic_brain 150/4936 6.42E-51 8.15E-49 Mouse Embryonic_brain 122/4936 8.01E-28 2.36E-26 Bimodal genes
  10. Term Overlap P-value Adjusted P-value Human Embryonic_brain 110/4936 3.32E-20 1.14E-18

    Mouse Embryonic_brain 122/4936 3.32E-20 1.14E-18 dpFeature Although PCA based unsupervised FE could not outperform either bimodal genes or dpFeature, the ration are still comparative. # of genes overlapped / # of genes uploaded Human Mouse PCA 71/116=0.61 75/118=0.64 Bimodal 150/200=0.75 122/200=0.61 dpFeature 110/200=0.55 122/200=0.61
  11. Conclusions Conclusions Four unsupervised feature selection methods were applied to

    human and mouse brain development scRNA-seq data. The proposed method (PCA based unsupervised FE) could indentifed 1. Highest overlap ratio of gene identifed between human and mouse. 2. The most signifcant in “MGI Mammalian Phenotype 2017” in Enrichr 3. Comparative in Embryonic_brain in “Jensen TISSUES” by Enrichr” More biological validations are available in the paper. https://doi.org/10.1101/312892 (preprint) https://doi.org/10.1007/978-3-319-95933-7_90