Slide 1

Slide 1 text

Principal component analysis and tensor decomposition based unsupervised feature extraction applied to single cell RNA-seq Y-h. Taguchi, Department of Physics, Chuo University Tensor decomposition (TD): Y-h. Taguchi and Turki Turki, Tensor Decomposition-Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis, Front. Genet., (2019) Vol.10, p864 https://doi.org/10.3389/fgene.2019.00864 Papers: Principal component analysis (PCA): Y-h. Taguchi, Principal Component Analysis- Based Unsupervised Feature Extraction Applied to Single-Cell Gene Expression Analysis. ICIC 2018. (2018) Lecture Notes in Computer Science, vol 10955. Springer, Cham https://doi.org/10.1007/978-3-319-95933-7_90 Preprint: https://doi.org/10.1101/312892

Slide 2

Slide 2 text

N×M N M × Genes i Genes i Human j = Human j Human j Genes i Tissues k N M × Genes i Human j = × M Tissues k By decomposition to vectors, we can get “genes” “Humans” “Tissues” vectors and can get “meaning”. Actually, we need not a vector but a set of vectors. What is tensor What is tensor? ?:extension of matrix Matrix:genes i× human j(patients vs healthy control):xij Tensor:genes i× human j(patients vs healthy control)× tissues k: xijk What is PCA and TD? What is PCA and TD? Decompose matrix and tensor into vectors

Slide 3

Slide 3 text

What is PCA based unsupervised FE? N features Categorical multiclasses In contrast to usual usage of PCA, not samples but features are embedded into Q dimensional space. PCA PC1 samples PC Loadings M samples N × M Matrix X (numerical values) PC2 PC1 PC Score + + + + + + + + + + + + + + + No distinction between classes

Slide 4

Slide 4 text

Synthetic example 10 samples 10 samples 90 features 10 features N(0,1/2) N(m,1/2) [N(m,1/2)+N(0,1/2)]/2 +:Top 10 outliers m=2 Thus, extracting outliers selects features distinct between two classes in an unsupervised way. Accuracy:(100 trials) Accuracy:(100 trials) 89.5% (m=2) 52.6% (m=1) PC1 PC2 Normal μ:mean Distribution ½ :SD

Slide 5

Slide 5 text

Human x ij ∈ℝ19531×1977 x ik ∈ℝ24378×1907 Mouse Data set: GSE76381 ScRNA-seq of human and mouse mid brain developments i:Genes j,k:cells Cell nubmers and time points Human: 6w:287cells、7w:131cells、8w:331cells、9w:322cells、10w: 509cells、11w:397cells, in total, 1977cells (w:week) Mouse:E11.5:349cells、E12.5:350cells、E13.5:345cells、E14.5: 308cells、E15.5:356cells、E18.5:142cells、unknown:57cells, in total, 1907cells.

Slide 6

Slide 6 text

Genes without expression are discarded. Variance 1、mean 0 per cell (standardized) PC scores uli are attributed to genes, PC loading, vlj, are attributed to samples by PCA (it differs from usual usage)。 ulis are assumed to obey multiple Gaussian P i =P χ2 [ >∑ l=1 L ( u li σl ) 2 ] Pi: corrected by Benjamini-Hochberg Genes with corrected Pi < 0.01are selected. cf. 演題番号O-17 遺伝子選択のためののためのFDRカットオフ水準検討水準検討 藤澤孝太、宮田龍太 Gene selection Gene selection 63 65 53 53 Human L=2 Mouse L=3 Genes

Slide 7

Slide 7 text

Validation:uploaded to Enrichr(Enrichment server) “MGI Mammalian Phenotype 2017” Top five raked terms Cerebral cortex, Axon, Dentate gyrus, Hippocampus, and Olfactory bulb →all are brain related. Other enrichment analyses are omitted here.

Slide 8

Slide 8 text

Tensor decomposition Tensor decomposition Tensor is generated from product of cells Tensor is generated from product of cells xijk = xij × xik ∈ ℝ13384×1977×1907 Size reduction needed because of too huge tansors xjk: is singular value decomposed vlj: lth human cell sigular value vectors vlk: lth mouse cell sigular value vectors vlj and vlk with any kind of time dependence are selected with categorical regression(ANOVA) v lj =a l +∑ t b lt δjt v lk =a l ' +∑ t b lt ' δkt δjt,δkt: 1 when cells j,k is measured at t otherwise 0 i:Genes j,k:Cells x jk =∑ i x ijk

Slide 9

Slide 9 text

How are selected singular value vectors are common? 12 23 32 32 Human mouse uli are generated from vlj and vlk u li ( j)=∑ j v lj x ij u li (k)=∑ k v lk x ik lth human gene singular value vectors lth mouse gene singular value vectors P-values are attributed to gene singular value vectors by χ2 distribution, corrected by BH criterion, genes with corrcted P <0.01 are selected.

Slide 10

Slide 10 text

151 200 305 305 Human Mouse Selected genes Validation:uploaded to Enrichr (Enrichment server) “Allen Brain Atlas” Top ranked five terms Hypothalamus ∈ mid brain Other enrichment analyses are omitted here.

Slide 11

Slide 11 text

Summary We can select biologically reasonable genes with unsupervised methods using PCA and TD. Because of lack or small labels, this is fitted to scRNA-seq I have published a monograph from Springer. I am happy if you can but it, although it is extremely expensive.