Y-h. Taguchi
July 05, 2018
130

# Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised feature extraction (FE) applied to bioinformatics analysis

Presentation at International Conference on Biological Information and Biomedical Engineering , Shanghai, 2017/7/6-8.
http://www.icbibe.org/KS.aspx?#3

July 05, 2018

## Transcript

1. ### Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised

feature extraction (FE) applied to bioinformatics analysis Y­h. Taguchi Department of Physics, Chuo University, Tokyo, Japan Presentation ORCID (papers)
2. ### Introduction Introduction Difficulties of Bioinformatics Difficulties of Bioinformatics (genomic) analysis

(genomic) analysis → “large p small n” problem i.e., number of genes ~ 104 while number of samples ~ 10
3. ### Difficulties of conventional methods. Difficulties of conventional methods. Regression :

# of samples < # of variables = no results Supervised Learning: Over fitting. Statistical tests: no positive hits because of adjusted P­values considering multiple comparisons (i.e., if the number of variables is 10,000, P = 0.0001 can happen by accident). → new methodology is required.
4. ### Usage of embedding method: Usage of embedding method: → →

dimension reduction dimension reduction Single data → principal component analysis (PCA)
5. ### N features Categorical multiclasses PCA PC1 samples PC Loadings M

samples N × M Matrix X (numerical values) PC2 PC1 PC Score features + + + + + + + + + + + + + + + No distinction between classes Embedding features instead of samples into lower dim.
6. ### Synthetic example 10 samples 10 samples 90 features 10 features

N(0) N() [N()+N(0)]/2 +:Top 10 outliers  Thus, extracting outliers selects features distinct between two classes in an unsupervised way. Accuracy:(100 trials) Accuracy:(100 trials) 89.5% ( 52.6% ( PC1 PC2 Normal μ：mean Distribution ½ :SD
7. ### Multiple data sets? Integrated analysis of multiple data Integrated analysis

of multiple data → → tensor decomposition (TD) tensor decomposition (TD) matrix 　 　tensor
8. ### × x ij x il x ij ×x il x

ijl Tensor decomposition G x ik1 x jk2 x lk3 x ijl =x ij ×x il ≒Σk1,k2,k3 G k1,k2,k3 x ik1 x jk2 x lk3 i:sample j:gene expression l：methylation gene expression methylation
9. ### Demonstration using synthetic data set 50 50 1000 +20%ノイズ 50

100%noise No correlations No correlations ＋ ＋ 50 +20%ノイズ 50×1000 ×1000 tensor Tensor decomposition
10. ### x ik1 k 1 =1 1≦i 50 ≦ k 1

=2 k 1 =3 x jk2 k 2 =1 k 2 =2 x lk2 k 3 =1 k 3 =2 1≦j 1000 ≦ 1≦l 1000 ≦ samples Gene expression methylation
11. ### Advantages as multi­view data analysis tools Advantages as multi­view data

analysis tools ・No weights required to integrate multiple views ・Complete unsupervised learning （no model buildings using pre­knowledge） ・smaller computational resources because of linearity Disadvantages.... ・tendency to require more memories Solution：summing up Σi x ij ×x il results in j×l matrix that can be converted back （explains omitted）。 ・no shared feature or samples result in four mode.
12. ### Feature extraction Feature extraction No real data separated well Assume

Gaussian Detect outliers P i =P[ >∑ k ( x ik σ ) 2 ] Benjamini­Hochberg corrected P <0.01 P­values by χ2 dist P(p) 1­p 0
13. ### PCA as well as TD based unsupervised FE can extract

features with orders very well.

15. ### How to identify outlier genes (P­value computation) Assuming that PC

score obeys Gaussian distribution （Null hypothesis used also for probabilistic PCA） →P­values attributed to genes using χ2 distribution →P­values addjusted by Benjamini–Hochberg →Outlier genes: adjusted P­values<0.01 or 0.05
16. ### Real Data: I Real Data: Identification of cell cycle regulated

genes of dentification of cell cycle regulated genes of Synchronization is required budding yeast budding yeast Strategy 1： food restriction （metabolic cycle） Scatter plot of PC1 to PC4 loading (time) Numbers are winding number around center Consider PC2/PC3 Are genes selected with PC2/PC3 biologically feasible?
17. ### PC scores(genes) PC loading(time) Blackredgreen are selected geness(P<0.01) ribosome mitochondria　→match

with Cell division original paper Differ from sinusoidal wave!
18. ### REACTOME (Selected by PC1 to PC4） PCA based unsupervised FE

is better than all fittings using sinusoidal, rectangular and triangular waves in selecting biologically feasible genes. Since PC2 differs from PC3, no periodic fitting can work. fitting Biological feasibility
19. ### Take home messages: Gene expression profiles are periodic, but not

sinusoidal. Thus, sinusoidal regression might cause artifacts (But there will be no ways to assume true one a priori！） Limit cycle can be identified without functional forms or period. Biologically important three gene clusters can be identified in unsupervised way.
20. ### Synchronization via temperature sensitive mutant Cyclebase Cyclebase: integration of eight

experiments Typical PC loadings Limit cycle: PC2/PC3 outlier genes selected → using PC2/PC3
21. ### Another example with disturbed disturbed periodic motions Limit cycle: PC2/PC4

outlier genes selected → using PC2/PC4
22. ### Results: for seven out of eight experiments 100 to 200

genes are selecred (P<0.05). 37 genes are selected commonly among six out of seven experiments high consistency →

24. ### Applications of TD to multi­omics data (breast cancer) mRNA sample1

sample2 sample3 sample4 sample5 miRNA A group B group active active expression interaction x ij ×x il i：161samples, j:13393mRNA, l:755miRNA, (8 groups)
25. ### Selection of x ik1 distinct between symptoms k 1 =1

k 1 =2 k 1 =3 k 1 =4 k 1 =5 1≦k 1 5 are symptom dependent ≦ P­value
26. ### k 2 k 3 k 1 G(k 1 ,k 2

,k 3 ) 1≦k 1 k 2 k 3 5 ≦ k 1 ：sample k 2 ：mRNA k 3 ：miRNA 1≦ k 2 5 ≦ Larger G Smaller G 1≦ k 3 2 ≦ x jk2 x lk3 assume Gaussian Detect outliers Benjamini­Hochberg corrected P <0.01 P­values by χ2 dist 755miRNA中7miRNA 13393mRNA中427mRNA
27. ### mRNA Evaluation by MSigDB → →highly overlapped with breast cancer

genes SMID BREAST CANCER BREAST CANCER LUMINAL B DN SMID BREAST CANCER BREAST CANCER BASAL DN DOANE BREAST CANCER BREAST CANCER ESR1 UP SMID BREAST CANCER BREAST CANCER RELAPSE IN BONE DN SMID BREAST CANCER BREAST CANCER NORMAL LIKE UP FARMER BREAST CANCER BREAST CANCER BASAL VS LULMINAL BREAST CANCER BREAST CANCER UP SMID BREAST CANCER BREAST CANCER BASAL UP SMID BREAST CANCER BREAST CANCER LUMINAL B UP TURASHVILI BREAST DUCTAL CARCINOMA BREAST DUCTAL CARCINOMA VS DUCTAL NORMAL DN

29. ### Integrated analysis of mRNA and miRNA using TD based unsupervised

FE can identify biologically highly significant genes
30. ### Summary for PCA Summary for PCA ・ Embedding variables instead

of samples into lower dimension enable us to select variables of interest in an unsupervised manner. ・For the synthetic data set, variables distinct between two classes can be selected without labeling information. ・ For real data set (cell division cycle), PCA based unsupervised FE can identify gene with periodic motion without the knowledge of either functional form or period.
31. ### Summary for TD Summary for TD ・ As a feature

selection in multi view data, after applying tensor decomposition to a tensor generated by product of matrices, I propose to select features associated with BH­ corrected P­values <0.01 computed by χ2 dist assumed for a mode. ・ As for synthetic data set, apparently uncorrelated variables embedded into noised are decomposed to original orthogonal vectors after identifying correlated variables. ・As for muli omics data set, a few (a few %) inter­correlated and biologically reasonable miRNAs and mRNAs are identified among huge number of mRNAs and miRNAs