Slide 1

Slide 1 text

Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised feature extraction (FE) applied to bioinformatics analysis Y­h. Taguchi Department of Physics, Chuo University, Tokyo, Japan Presentation ORCID (papers)

Slide 2

Slide 2 text

Introduction Introduction Difficulties of Bioinformatics Difficulties of Bioinformatics (genomic) analysis (genomic) analysis → “large p small n” problem i.e., number of genes ~ 104 while number of samples ~ 10

Slide 3

Slide 3 text

Difficulties of conventional methods. Difficulties of conventional methods. Regression : # of samples < # of variables = no results Supervised Learning: Over fitting. Statistical tests: no positive hits because of adjusted P­values considering multiple comparisons (i.e., if the number of variables is 10,000, P = 0.0001 can happen by accident). → new methodology is required.

Slide 4

Slide 4 text

Usage of embedding method: Usage of embedding method: → → dimension reduction dimension reduction Single data → principal component analysis (PCA)

Slide 5

Slide 5 text

N features Categorical multiclasses PCA PC1 samples PC Loadings M samples N × M Matrix X (numerical values) PC2 PC1 PC Score features + + + + + + + + + + + + + + + No distinction between classes Embedding features instead of samples into lower dim.

Slide 6

Slide 6 text

Synthetic example 10 samples 10 samples 90 features 10 features N(0) N() [N()+N(0)]/2 +:Top 10 outliers  Thus, extracting outliers selects features distinct between two classes in an unsupervised way. Accuracy:(100 trials) Accuracy:(100 trials) 89.5% ( 52.6% ( PC1 PC2 Normal μ:mean Distribution ½ :SD

Slide 7

Slide 7 text

Multiple data sets? Integrated analysis of multiple data Integrated analysis of multiple data → → tensor decomposition (TD) tensor decomposition (TD) matrix    tensor

Slide 8

Slide 8 text

× x ij x il x ij ×x il x ijl Tensor decomposition G x ik1 x jk2 x lk3 x ijl =x ij ×x il ≒Σk1,k2,k3 G k1,k2,k3 x ik1 x jk2 x lk3 i:sample j:gene expression l:methylation gene expression methylation

Slide 9

Slide 9 text

Demonstration using synthetic data set 50 50 1000 +20%ノイズ 50 100%noise No correlations No correlations + + 50 +20%ノイズ 50×1000 ×1000 tensor Tensor decomposition

Slide 10

Slide 10 text

x ik1 k 1 =1 1≦i 50 ≦ k 1 =2 k 1 =3 x jk2 k 2 =1 k 2 =2 x lk2 k 3 =1 k 3 =2 1≦j 1000 ≦ 1≦l 1000 ≦ samples Gene expression methylation

Slide 11

Slide 11 text

Advantages as multi­view data analysis tools Advantages as multi­view data analysis tools ・No weights required to integrate multiple views ・Complete unsupervised learning (no model buildings using pre­knowledge) ・smaller computational resources because of linearity Disadvantages.... ・tendency to require more memories Solution:summing up Σi x ij ×x il results in j×l matrix that can be converted back (explains omitted)。 ・no shared feature or samples result in four mode.

Slide 12

Slide 12 text

Feature extraction Feature extraction No real data separated well Assume Gaussian Detect outliers P i =P[ >∑ k ( x ik σ ) 2 ] Benjamini­Hochberg corrected P <0.01 P­values by χ2 dist P(p) 1­p 0

Slide 13

Slide 13 text

PCA as well as TD based unsupervised FE can extract features with orders very well.

Slide 14

Slide 14 text

Application of PCA to detect periodic motion

Slide 15

Slide 15 text

How to identify outlier genes (P­value computation) Assuming that PC score obeys Gaussian distribution (Null hypothesis used also for probabilistic PCA) →P­values attributed to genes using χ2 distribution →P­values addjusted by Benjamini–Hochberg →Outlier genes: adjusted P­values<0.01 or 0.05

Slide 16

Slide 16 text

Real Data: I Real Data: Identification of cell cycle regulated genes of dentification of cell cycle regulated genes of Synchronization is required budding yeast budding yeast Strategy 1: food restriction (metabolic cycle) Scatter plot of PC1 to PC4 loading (time) Numbers are winding number around center Consider PC2/PC3 Are genes selected with PC2/PC3 biologically feasible?

Slide 17

Slide 17 text

PC scores(genes) PC loading(time) Blackredgreen are selected geness(P<0.01) ribosome mitochondria →match with Cell division original paper Differ from sinusoidal wave!

Slide 18

Slide 18 text

REACTOME (Selected by PC1 to PC4) PCA based unsupervised FE is better than all fittings using sinusoidal, rectangular and triangular waves in selecting biologically feasible genes. Since PC2 differs from PC3, no periodic fitting can work. fitting Biological feasibility

Slide 19

Slide 19 text

Take home messages: Gene expression profiles are periodic, but not sinusoidal. Thus, sinusoidal regression might cause artifacts (But there will be no ways to assume true one a priori!) Limit cycle can be identified without functional forms or period. Biologically important three gene clusters can be identified in unsupervised way.

Slide 20

Slide 20 text

Synchronization via temperature sensitive mutant Cyclebase Cyclebase: integration of eight experiments Typical PC loadings Limit cycle: PC2/PC3 outlier genes selected → using PC2/PC3

Slide 21

Slide 21 text

Another example with disturbed disturbed periodic motions Limit cycle: PC2/PC4 outlier genes selected → using PC2/PC4

Slide 22

Slide 22 text

Results: for seven out of eight experiments 100 to 200 genes are selecred (P<0.05). 37 genes are selected commonly among six out of seven experiments high consistency →

Slide 23

Slide 23 text

PCA based unsupervised FE is highly robust and stable!

Slide 24

Slide 24 text

Applications of TD to multi­omics data (breast cancer) mRNA sample1 sample2 sample3 sample4 sample5 miRNA A group B group active active expression interaction x ij ×x il i:161samples, j:13393mRNA, l:755miRNA, (8 groups)

Slide 25

Slide 25 text

Selection of x ik1 distinct between symptoms k 1 =1 k 1 =2 k 1 =3 k 1 =4 k 1 =5 1≦k 1 5 are symptom dependent ≦ P­value

Slide 26

Slide 26 text

k 2 k 3 k 1 G(k 1 ,k 2 ,k 3 ) 1≦k 1 k 2 k 3 5 ≦ k 1 :sample k 2 :mRNA k 3 :miRNA 1≦ k 2 5 ≦ Larger G Smaller G 1≦ k 3 2 ≦ x jk2 x lk3 assume Gaussian Detect outliers Benjamini­Hochberg corrected P <0.01 P­values by χ2 dist 755miRNA中7miRNA 13393mRNA中427mRNA

Slide 27

Slide 27 text

mRNA Evaluation by MSigDB → →highly overlapped with breast cancer genes SMID BREAST CANCER BREAST CANCER LUMINAL B DN SMID BREAST CANCER BREAST CANCER BASAL DN DOANE BREAST CANCER BREAST CANCER ESR1 UP SMID BREAST CANCER BREAST CANCER RELAPSE IN BONE DN SMID BREAST CANCER BREAST CANCER NORMAL LIKE UP FARMER BREAST CANCER BREAST CANCER BASAL VS LULMINAL BREAST CANCER BREAST CANCER UP SMID BREAST CANCER BREAST CANCER BASAL UP SMID BREAST CANCER BREAST CANCER LUMINAL B UP TURASHVILI BREAST DUCTAL CARCINOMA BREAST DUCTAL CARCINOMA VS DUCTAL NORMAL DN

Slide 28

Slide 28 text

miRNA → evaluation by Diana­mirpath

Slide 29

Slide 29 text

Integrated analysis of mRNA and miRNA using TD based unsupervised FE can identify biologically highly significant genes

Slide 30

Slide 30 text

Summary for PCA Summary for PCA ・ Embedding variables instead of samples into lower dimension enable us to select variables of interest in an unsupervised manner. ・For the synthetic data set, variables distinct between two classes can be selected without labeling information. ・ For real data set (cell division cycle), PCA based unsupervised FE can identify gene with periodic motion without the knowledge of either functional form or period.

Slide 31

Slide 31 text

Summary for TD Summary for TD ・ As a feature selection in multi view data, after applying tensor decomposition to a tensor generated by product of matrices, I propose to select features associated with BH­ corrected P­values <0.01 computed by χ2 dist assumed for a mode. ・ As for synthetic data set, apparently uncorrelated variables embedded into noised are decomposed to original orthogonal vectors after identifying correlated variables. ・As for muli omics data set, a few (a few %) inter­correlated and biologically reasonable miRNAs and mRNAs are identified among huge number of mRNAs and miRNAs

Slide 32

Slide 32 text

Presentation ORCID (papers)