Slide 1

Slide 1 text

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools Y-h. Taguchi, Department of Physics, Chuo University, Tokyo, Japan. Turki Turki, King Abdulaziz University, Jeddah, Saudi Arabia. This was rejected by Conference Journal Truck, but can be read This was rejected by Conference Journal Truck, but can be read as a preprint. as a preprint. BioRxiv doi: https://doi.org/10.1101/2020.10.02.324616

Slide 2

Slide 2 text

Table of Contents Purpose of this study Projection pursuit PCA based unsupervised FE TD based unsupervised FE Comparison with PP Rationalization of null hypothesis (Gaussian distribution)

Slide 3

Slide 3 text

I have published a book on this topics from Springer international. I am glad if the audience can buy it and learn my method. Y-h. Taguchi, Unsupervised Feature Extraction Applied to Bioinformatics --- A PCA and TD Based Approach --- Springer International (2020)

Slide 4

Slide 4 text

The purpose of this preprint to rationalize the proposed method, Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised feature selection in detail described in the book mentioned in the previous page.

Slide 5

Slide 5 text

y ∈ℝM: teacher data (e.g. labeling) M: the number of samples vs x ∈ℝN⨉M: given data N: the number of features attributed to M samples. How can we make use of x to explain y?

Slide 6

Slide 6 text

One strategy: Projection pursuit (PP) b= y XT ∈ℝN b can be used to weight which features are important, e.g, ith feature with larger absolute values of b i is regarded to be important.

Slide 7

Slide 7 text

PCA based unsupervised FE XXT ∈ℝN ×N XXT u l =λl u l ∈ℝN v l =XT u l ∈ℝM P i =P χ2 [> (u li σl )2] Generate N ⨉ N matrix Obtain eigen vector u l attributed to feature i Compute eigen vector v l attributed to sample j Identify which v l is biologically intersting Attribute P values to feature i With assuming that u l obeys Gaussian. P i is corrected by Benjamini-Hochberg criterion and is associated with corrected P i <0.01 are selected.

Slide 8

Slide 8 text

TD based unsupervised FE X∈ℝN ×M ×K x ijk =∑ l 1 =1 N G(l 1 l 2 l 3 )u l 1i u l 2 j u l 3 k G∈ℝN ×M×K ,u l 1 i ∈ℝN ×N ,u l 2 j ∈ℝM ×M ,u l 3 k ∈ℝK×K ith feature attributed to samples with jth and kth experimental conditions N: number of features, M,K:number of conditions (samples) Identify biologically interesting l 2 ,l 3 and find l 1 that shares absolutely large G(l 1 ,l 2 ,l 3 ) with identified l 2 ,l 3 .

Slide 9

Slide 9 text

P i =P χ2 [> (u l 1 i σl 1 )2] Attribute P values to feature i with assuming that u l1 obeys Gaussian. P i is corrected by Benjamini-Hochberg criterion and is associated with corrected P i <0.01 are selected.

Slide 10

Slide 10 text

Applying TD based unsupervised FE to cancer data sets Integration of two cancer data sets TCGA: M:324 (253 tumor, 71 normal) mRNA and miRNA GEO: M:34 (17 tumor, 17 normal) mRNA and miRNA

Slide 11

Slide 11 text

x ij ∈ℝN ×M x kj ∈ℝK ×M x ik =∑j x ij x kj ∈ℝN ×K x ik =∑l=1 min( N , K ) λl u li u lk v lj mRNA =∑ i x ij u li ,v lj miRNA =∑ k x kj u l k P i =P χ2 [> (u li σl )2], P k =P χ2 [> (u lk σl )2] N mRNAs K miRNAs 72 mRNAs and 11 miRNAs are selected

Slide 12

Slide 12 text

Comparison with PP y j =− M M T ,1≤ j≤M T y j = M M N , M T < j≤M b i =∑ j x ij y j b k =∑ j x kj y j P i =P χ2 [> ( b i σb )2] P k =P χ2 [> (b k σb )2] M T : number of tumors M N : number of normal kidneys 73 mRNAs and 18 miRNAs are selected

Slide 13

Slide 13 text

Q-Q Plot mRNA miRNA P i and P k obey same distribution between PP and TD based unsupervised FE

Slide 14

Slide 14 text

Confusion matrices mRNA miRNA PP and TD based unsupervised FE select almost same mRNAs and miRNAs

Slide 15

Slide 15 text

Rationalization of null hypothesis (Gaussian distribution)

Slide 16

Slide 16 text

True null distribution was generated with shuffled miRNA and P-values attributed to umiRNA li were computed all miRNAs 1-P 1-P Top 500 miRNAs

Slide 17

Slide 17 text

all mRNAs Top 3000 mRNAs True null distribution was generated with shuffled mRNA and P-values attributed to umRNA li were computed 1-P 1-P

Slide 18

Slide 18 text

Confusion matrices mRNA miRNA Null distribution and TD based unsupervised FE select almost same mRNAs and miRNAs although threshold values differ

Slide 19

Slide 19 text

Conclusion TD based unsupervised FE is equivalent to PP Although null hypothesis of Gaussian distribution is not fulfilled, it is empirically coincident with null distribution generated by shuffling, although threshold P values differ (0.01 for TD based unsupervised FE and 0.1 for null distribution)