Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Projection in genomic analysis: A theoretical basis to rationalize tensor
decomposition and principal component analysis as feature selection tools Y-h. Taguchi, Department of Physics, Chuo University, Tokyo, Japan. Turki Turki, King Abdulaziz University, Jeddah, Saudi Arabia. This was rejected by Conference Journal Truck, but can be read This was rejected by Conference Journal Truck, but can be read as a preprint. as a preprint. BioRxiv doi: https://doi.org/10.1101/2020.10.02.324616

Table of Contents Purpose of this study Projection pursuit PCA
based unsupervised FE TD based unsupervised FE Comparison with PP Rationalization of null hypothesis (Gaussian distribution)

I have published a book on this topics from Springer
international. I am glad if the audience can buy it and learn my method. Y-h. Taguchi, Unsupervised Feature Extraction Applied to Bioinformatics --- A PCA and TD Based Approach --- Springer International (2020)

The purpose of this preprint to rationalize the proposed method,
Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised feature selection in detail described in the book mentioned in the previous page.

y ∈ℝM: teacher data (e.g. labeling) M: the number of
samples vs x ∈ℝN⨉M: given data N: the number of features attributed to M samples. How can we make use of x to explain y?

One strategy: Projection pursuit (PP) b= y XT ∈ℝN b
can be used to weight which features are important, e.g, ith feature with larger absolute values of b i is regarded to be important.

PCA based unsupervised FE XXT ∈ℝN ×N XXT u l
=λl u l ∈ℝN v l =XT u l ∈ℝM P i =P χ2 [> (u li σl )2] Generate N ⨉ N matrix Obtain eigen vector u l attributed to feature i Compute eigen vector v l attributed to sample j Identify which v l is biologically intersting Attribute P values to feature i With assuming that u l obeys Gaussian. P i is corrected by Benjamini-Hochberg criterion and is associated with corrected P i <0.01 are selected.

TD based unsupervised FE X∈ℝN ×M ×K x ijk =∑
l 1 =1 N G(l 1 l 2 l 3 )u l 1i u l 2 j u l 3 k G∈ℝN ×M×K ,u l 1 i ∈ℝN ×N ,u l 2 j ∈ℝM ×M ,u l 3 k ∈ℝK×K ith feature attributed to samples with jth and kth experimental conditions N: number of features, M,K:number of conditions (samples) Identify biologically interesting l 2 ,l 3 and find l 1 that shares absolutely large G(l 1 ,l 2 ,l 3 ) with identified l 2 ,l 3 .

P i =P χ2 [> (u l 1 i σl
1 )2] Attribute P values to feature i with assuming that u l1 obeys Gaussian. P i is corrected by Benjamini-Hochberg criterion and is associated with corrected P i <0.01 are selected.

Applying TD based unsupervised FE to cancer data sets Integration
of two cancer data sets TCGA: M:324 (253 tumor, 71 normal) mRNA and miRNA GEO: M:34 (17 tumor, 17 normal) mRNA and miRNA

x ij ∈ℝN ×M x kj ∈ℝK ×M x ik
=∑j x ij x kj ∈ℝN ×K x ik =∑l=1 min( N , K ) λl u li u lk v lj mRNA =∑ i x ij u li ,v lj miRNA =∑ k x kj u l k P i =P χ2 [> (u li σl )2], P k =P χ2 [> (u lk σl )2] N mRNAs K miRNAs 72 mRNAs and 11 miRNAs are selected

Comparison with PP y j =− M M T ,1≤
j≤M T y j = M M N , M T < j≤M b i =∑ j x ij y j b k =∑ j x kj y j P i =P χ2 [> ( b i σb )2] P k =P χ2 [> (b k σb )2] M T : number of tumors M N : number of normal kidneys 73 mRNAs and 18 miRNAs are selected

Q-Q Plot mRNA miRNA P i and P k obey
same distribution between PP and TD based unsupervised FE

Confusion matrices mRNA miRNA PP and TD based unsupervised FE
select almost same mRNAs and miRNAs

Rationalization of null hypothesis (Gaussian distribution)

True null distribution was generated with shuffled miRNA and P-values
attributed to umiRNA li were computed all miRNAs 1-P 1-P Top 500 miRNAs

all mRNAs Top 3000 mRNAs True null distribution was generated
with shuffled mRNA and P-values attributed to umRNA li were computed 1-P 1-P

Confusion matrices mRNA miRNA Null distribution and TD based unsupervised
FE select almost same mRNAs and miRNAs although threshold values differ

Conclusion TD based unsupervised FE is equivalent to PP Although
null hypothesis of Gaussian distribution is not fulfilled, it is empirically coincident with null distribution generated by shuffling, although threshold P values differ (0.01 for TD based unsupervised FE and 0.1 for null distribution)

Projection in genomic analysis: A theoretical b...

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Y-h. Taguchi

More Decks by Y-h. Taguchi

Other Decks in Science

Featured

Transcript

Projection in genomic analysis: A theoretical basis to rationalize tensor

Table of Contents Purpose of this study Projection pursuit PCA

I have published a book on this topics from Springer

The purpose of this preprint to rationalize the proposed method,

y ∈ℝM: teacher data (e.g. labeling) M: the number of

One strategy: Projection pursuit (PP) b= y XT ∈ℝN b

PCA based unsupervised FE XXT ∈ℝN ×N XXT u l

TD based unsupervised FE X∈ℝN ×M ×K x ijk =∑

P i =P χ2 [> (u l 1 i σl

Applying TD based unsupervised FE to cancer data sets Integration

x ij ∈ℝN ×M x kj ∈ℝK ×M x ik

Comparison with PP y j =− M M T ,1≤

Q-Q Plot mRNA miRNA P i and P k obey

Confusion matrices mRNA miRNA PP and TD based unsupervised FE

Rationalization of null hypothesis (Gaussian distribution)

True null distribution was generated with shuffled miRNA and P-values

all mRNAs Top 3000 mRNAs True null distribution was generated

Confusion matrices mRNA miRNA Null distribution and TD based unsupervised

Conclusion TD based unsupervised FE is equivalent to PP Although