Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised feature extraction (FE) applied to bioinformatics analysis

Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised
feature extraction (FE) applied to bioinformatics analysis Yh. Taguchi Department of Physics, Chuo University, Tokyo, Japan Presentation ORCID (papers)

Introduction Introduction Difficulties of Bioinformatics Difficulties of Bioinformatics (genomic) analysis
(genomic) analysis → “large p small n” problem i.e., number of genes ~ 104 while number of samples ~ 10

Difficulties of conventional methods. Difficulties of conventional methods. Regression :
# of samples < # of variables = no results Supervised Learning: Over fitting. Statistical tests: no positive hits because of adjusted Pvalues considering multiple comparisons (i.e., if the number of variables is 10,000, P = 0.0001 can happen by accident). → new methodology is required.

Usage of embedding method: Usage of embedding method: → →
dimension reduction dimension reduction Single data → principal component analysis (PCA)

N features Categorical multiclasses PCA PC1 samples PC Loadings M
samples N × M Matrix X (numerical values) PC2 PC1 PC Score features + + + + + + + + + + + + + + + No distinction between classes Embedding features instead of samples into lower dim.

Synthetic example 10 samples 10 samples 90 features 10 features
N(0) N() [N()+N(0)]/2 +:Top 10 outliers  Thus, extracting outliers selects features distinct between two classes in an unsupervised way. Accuracy:(100 trials) Accuracy:(100 trials) 89.5% ( 52.6% ( PC1 PC2 Normal μ：mean Distribution ½ :SD

Multiple data sets? Integrated analysis of multiple data Integrated analysis
of multiple data → → tensor decomposition (TD) tensor decomposition (TD) matrix 　　tensor

× x ij x il x ij ×x il x
ijl Tensor decomposition G x ik1 x jk2 x lk3 x ijl =x ij ×x il ≒Σk1,k2,k3 G k1,k2,k3 x ik1 x jk2 x lk3 i:sample j:gene expression l：methylation gene expression methylation

Demonstration using synthetic data set 50 50 1000 +20%ノイズ 50
100%noise No correlations No correlations ＋＋ 50 +20%ノイズ 50×1000 ×1000 tensor Tensor decomposition

x ik1 k 1 =1 1≦i 50 ≦ k 1
=2 k 1 =3 x jk2 k 2 =1 k 2 =2 x lk2 k 3 =1 k 3 =2 1≦j 1000 ≦ 1≦l 1000 ≦ samples Gene expression methylation

Advantages as multiview data analysis tools Advantages as multiview data
analysis tools ・No weights required to integrate multiple views ・Complete unsupervised learning （no model buildings using preknowledge）・smaller computational resources because of linearity Disadvantages.... ・tendency to require more memories Solution：summing up Σi x ij ×x il results in j×l matrix that can be converted back （explains omitted）。・no shared feature or samples result in four mode.

Feature extraction Feature extraction No real data separated well Assume
Gaussian Detect outliers P i =P[ >∑ k ( x ik σ ) 2 ] BenjaminiHochberg corrected P <0.01 Pvalues by χ2 dist P(p) 1p 0

PCA as well as TD based unsupervised FE can extract
features with orders very well.

Application of PCA to detect periodic motion

How to identify outlier genes (Pvalue computation) Assuming that PC
score obeys Gaussian distribution （Null hypothesis used also for probabilistic PCA） →Pvalues attributed to genes using χ2 distribution →Pvalues addjusted by Benjamini–Hochberg →Outlier genes: adjusted Pvalues<0.01 or 0.05

Real Data: I Real Data: Identification of cell cycle regulated
genes of dentification of cell cycle regulated genes of Synchronization is required budding yeast budding yeast Strategy 1： food restriction （metabolic cycle） Scatter plot of PC1 to PC4 loading (time) Numbers are winding number around center Consider PC2/PC3 Are genes selected with PC2/PC3 biologically feasible?

PC scores(genes) PC loading(time) Blackredgreen are selected geness(P<0.01) ribosome mitochondria　→match
with Cell division original paper Differ from sinusoidal wave!

REACTOME (Selected by PC1 to PC4） PCA based unsupervised FE
is better than all fittings using sinusoidal, rectangular and triangular waves in selecting biologically feasible genes. Since PC2 differs from PC3, no periodic fitting can work. fitting Biological feasibility

Take home messages: Gene expression profiles are periodic, but not
sinusoidal. Thus, sinusoidal regression might cause artifacts (But there will be no ways to assume true one a priori！） Limit cycle can be identified without functional forms or period. Biologically important three gene clusters can be identified in unsupervised way.

Synchronization via temperature sensitive mutant Cyclebase Cyclebase: integration of eight
experiments Typical PC loadings Limit cycle: PC2/PC3 outlier genes selected → using PC2/PC3

Another example with disturbed disturbed periodic motions Limit cycle: PC2/PC4
outlier genes selected → using PC2/PC4

Results: for seven out of eight experiments 100 to 200
genes are selecred (P<0.05). 37 genes are selected commonly among six out of seven experiments high consistency →

PCA based unsupervised FE is highly robust and stable!

Applications of TD to multiomics data (breast cancer) mRNA sample1
sample2 sample3 sample4 sample5 miRNA A group B group active active expression interaction x ij ×x il i：161samples, j:13393mRNA, l:755miRNA, (8 groups)

Selection of x ik1 distinct between symptoms k 1 =1
k 1 =2 k 1 =3 k 1 =4 k 1 =5 1≦k 1 5 are symptom dependent ≦ Pvalue

k 2 k 3 k 1 G(k 1 ,k 2
,k 3 ) 1≦k 1 k 2 k 3 5 ≦ k 1 ：sample k 2 ：mRNA k 3 ：miRNA 1≦ k 2 5 ≦ Larger G Smaller G 1≦ k 3 2 ≦ x jk2 x lk3 assume Gaussian Detect outliers BenjaminiHochberg corrected P <0.01 Pvalues by χ2 dist 755miRNA中7miRNA 13393mRNA中427mRNA

mRNA Evaluation by MSigDB → →highly overlapped with breast cancer
genes SMID BREAST CANCER BREAST CANCER LUMINAL B DN SMID BREAST CANCER BREAST CANCER BASAL DN DOANE BREAST CANCER BREAST CANCER ESR1 UP SMID BREAST CANCER BREAST CANCER RELAPSE IN BONE DN SMID BREAST CANCER BREAST CANCER NORMAL LIKE UP FARMER BREAST CANCER BREAST CANCER BASAL VS LULMINAL BREAST CANCER BREAST CANCER UP SMID BREAST CANCER BREAST CANCER BASAL UP SMID BREAST CANCER BREAST CANCER LUMINAL B UP TURASHVILI BREAST DUCTAL CARCINOMA BREAST DUCTAL CARCINOMA VS DUCTAL NORMAL DN

miRNA → evaluation by Dianamirpath

Integrated analysis of mRNA and miRNA using TD based unsupervised
FE can identify biologically highly significant genes

Summary for PCA Summary for PCA ・ Embedding variables instead
of samples into lower dimension enable us to select variables of interest in an unsupervised manner. ・For the synthetic data set, variables distinct between two classes can be selected without labeling information. ・ For real data set (cell division cycle), PCA based unsupervised FE can identify gene with periodic motion without the knowledge of either functional form or period.

Summary for TD Summary for TD ・ As a feature
selection in multi view data, after applying tensor decomposition to a tensor generated by product of matrices, I propose to select features associated with BH corrected Pvalues <0.01 computed by χ2 dist assumed for a mode. ・ As for synthetic data set, apparently uncorrelated variables embedded into noised are decomposed to original orthogonal vectors after identifying correlated variables. ・As for muli omics data set, a few (a few %) intercorrelated and biologically reasonable miRNAs and mRNAs are identified among huge number of mRNAs and miRNAs

Presentation ORCID (papers)

Principal component analysis (PCA) and tensor d...

Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised feature extraction (FE) applied to bioinformatics analysis

Y-h. Taguchi PRO

More Decks by Y-h. Taguchi

Other Decks in Science

Featured

Transcript