Slide 1

Slide 1 text

Principal Component Analysis-Based Unsupervised Feature Extraction Applied to Single Cell Gene Expression Analysis Y-h. Taguchi Department of Physics, Chuo University, Tokyo, Japan.

Slide 2

Slide 2 text

Introduction Introduction By defnition, single cell (sc) RNA-seq data sets are unlabeled. Thus, clustering (e.g., tSNE) is inevitable. However, in order to perform clustering well, limited number of genes often must be selected. In spite of that, because of unlabeled samples, conventional gene selection procedure based upon t test and/or fold change analysis cannot be employed, since these analyses cannot be performed without classifying samples into two groups.

Slide 3

Slide 3 text

Some popular unsupervised gene selection procedures 1. Highly Variable Genes Highly Variable Genes Genes with larger variance over single cells are selected. 2. Bimodal Genes: Bimodal Genes: Genes must not be, at least, unimodal, since unimodal distribution unlikely distinguish between multple classes. 3. dpFeature dpFeature More sophisticated methods including clustering. 4. Principal component analysis based unsupervised Principal component analysis based unsupervised feature extraction proposed method. ←

Slide 4

Slide 4 text

What is PCA based unsupervised FE? N features Categorical multiclasses In contrast to usual usage of PCA, not samples but features are embedded into Q dimensional space. PCA PC1 samples PC Loadings M samples N × M Matrix X (numerical values) PC2 PC1 PC Score + + + + + + + + + + + + + + + No distinction between classes

Slide 5

Slide 5 text

Synthetic example 10 samples 10 samples 90 features 10 features N(0,1/2) N(m,1/2) [N(m,1/2)+N(0,1/2)]/2 +:Top 10 outliers m=2 Thus, extracting outliers selects features distinct between two classes in an unsupervised way. Accuracy:(100 trials) Accuracy:(100 trials) 89.5% (m=2) 52.6% (m=1) PC1 PC2 Normal μ:mean Distribution ½ :SD

Slide 6

Slide 6 text

Gene expression profles Gene expression profles GEO ID GSE76381. human human embryo ventral midbrain cells between 6 and 11 weeks of gestation, mouse mouse ventral midbrain cells at six developmental stages between E11.5 to E18.5, Th+ neurons at P19–P27, and FACS-sorted putative dopaminergic neurons at P28–P56 from Slc6a3-Cre/tdTomato mice.

Slide 7

Slide 7 text

Results Results 63 53 65 human mouse PCA 53 63+53+65=181 0.29 0.29 124 44 127 humanmouse Highly Variable Genes 44 124+44+127=295 0.15 0.15 Human:13775 Mouse:13362 Top 200 124 76 124 humanmouse 76 0.23 0.23 124+76+124=324 dpFeature Bimodal genes Human:11344 Mouse:10849 Top 200 179 22 179 humanmouse 22 0.06 0.06 179+22+179=380 Highest overlap!

Slide 8

Slide 8 text

Biological validation “MGI Mammalian Phenotype 2017” in Enrichr PCA TOP four brain related Highly Variable Genes No brain related terms in top fve Bimodal genes No signifcally enriched terms dpFeature Only ffth one among top fve is brain related ← ← Best method! Best method!

Slide 9

Slide 9 text

“Jensen TISSUES” by Enrichr PCA Highly Variable Genes No brain related terms in top fve Term Overlap P-value Adjusted P-value Human Embryonic_brain 150/4936 6.42E-51 8.15E-49 Mouse Embryonic_brain 122/4936 8.01E-28 2.36E-26 Bimodal genes

Slide 10

Slide 10 text

Term Overlap P-value Adjusted P-value Human Embryonic_brain 110/4936 3.32E-20 1.14E-18 Mouse Embryonic_brain 122/4936 3.32E-20 1.14E-18 dpFeature Although PCA based unsupervised FE could not outperform either bimodal genes or dpFeature, the ration are still comparative. # of genes overlapped / # of genes uploaded Human Mouse PCA 71/116=0.61 75/118=0.64 Bimodal 150/200=0.75 122/200=0.61 dpFeature 110/200=0.55 122/200=0.61

Slide 11

Slide 11 text

Conclusions Conclusions Four unsupervised feature selection methods were applied to human and mouse brain development scRNA-seq data. The proposed method (PCA based unsupervised FE) could indentifed 1. Highest overlap ratio of gene identifed between human and mouse. 2. The most signifcant in “MGI Mammalian Phenotype 2017” in Enrichr 3. Comparative in Embryonic_brain in “Jensen TISSUES” by Enrichr” More biological validations are available in the paper. https://doi.org/10.1101/312892 (preprint) https://doi.org/10.1007/978-3-319-95933-7_90