Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Y-h. Taguchi
November 04, 2021

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

Presentation at InCob2021
http://www.incob2021.cn/

The content is published in PLoS ONE
https://doi.org/10.1371/journal.pone.0275472

Y-h. Taguchi

November 04, 2021
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Projection in genomic analysis: A theoretical basis to rationalize tensor

    decomposition and principal component analysis as feature selection tools Y-h. Taguchi, Department of Physics, Chuo University, Tokyo, Japan. Turki Turki, King Abdulaziz University, Jeddah, Saudi Arabia. This was rejected by Conference Journal Truck, but can be read This was rejected by Conference Journal Truck, but can be read as a preprint. as a preprint. BioRxiv doi: https://doi.org/10.1101/2020.10.02.324616
  2. Table of Contents Purpose of this study Projection pursuit PCA

    based unsupervised FE TD based unsupervised FE Comparison with PP Rationalization of null hypothesis (Gaussian distribution)
  3. I have published a book on this topics from Springer

    international. I am glad if the audience can buy it and learn my method. Y-h. Taguchi, Unsupervised Feature Extraction Applied to Bioinformatics --- A PCA and TD Based Approach --- Springer International (2020)
  4. The purpose of this preprint to rationalize the proposed method,

    Principal component analysis (PCA) and tensor decomposition (TD) based unsupervised feature selection in detail described in the book mentioned in the previous page.
  5. y ∈ℝM: teacher data (e.g. labeling) M: the number of

    samples vs x ∈ℝN⨉M: given data N: the number of features attributed to M samples. How can we make use of x to explain y?
  6. One strategy: Projection pursuit (PP) b= y XT ∈ℝN b

    can be used to weight which features are important, e.g, ith feature with larger absolute values of b i is regarded to be important.
  7. PCA based unsupervised FE XXT ∈ℝN ×N XXT u l

    =λl u l ∈ℝN v l =XT u l ∈ℝM P i =P χ2 [> (u li σl )2] Generate N ⨉ N matrix Obtain eigen vector u l attributed to feature i Compute eigen vector v l attributed to sample j Identify which v l is biologically intersting Attribute P values to feature i With assuming that u l obeys Gaussian. P i is corrected by Benjamini-Hochberg criterion and is associated with corrected P i <0.01 are selected.
  8. TD based unsupervised FE X∈ℝN ×M ×K x ijk =∑

    l 1 =1 N G(l 1 l 2 l 3 )u l 1i u l 2 j u l 3 k G∈ℝN ×M×K ,u l 1 i ∈ℝN ×N ,u l 2 j ∈ℝM ×M ,u l 3 k ∈ℝK×K ith feature attributed to samples with jth and kth experimental conditions N: number of features, M,K:number of conditions (samples) Identify biologically interesting l 2 ,l 3 and find l 1 that shares absolutely large G(l 1 ,l 2 ,l 3 ) with identified l 2 ,l 3 .
  9. P i =P χ2 [> (u l 1 i σl

    1 )2] Attribute P values to feature i with assuming that u l1 obeys Gaussian. P i is corrected by Benjamini-Hochberg criterion and is associated with corrected P i <0.01 are selected.
  10. Applying TD based unsupervised FE to cancer data sets Integration

    of two cancer data sets TCGA: M:324 (253 tumor, 71 normal) mRNA and miRNA GEO: M:34 (17 tumor, 17 normal) mRNA and miRNA
  11. x ij ∈ℝN ×M x kj ∈ℝK ×M x ik

    =∑j x ij x kj ∈ℝN ×K x ik =∑l=1 min( N , K ) λl u li u lk v lj mRNA =∑ i x ij u li ,v lj miRNA =∑ k x kj u l k P i =P χ2 [> (u li σl )2], P k =P χ2 [> (u lk σl )2] N mRNAs K miRNAs 72 mRNAs and 11 miRNAs are selected
  12. Comparison with PP y j =− M M T ,1≤

    j≤M T y j = M M N , M T < j≤M b i =∑ j x ij y j b k =∑ j x kj y j P i =P χ2 [> ( b i σb )2] P k =P χ2 [> (b k σb )2] M T : number of tumors M N : number of normal kidneys 73 mRNAs and 18 miRNAs are selected
  13. Q-Q Plot mRNA miRNA P i and P k obey

    same distribution between PP and TD based unsupervised FE
  14. True null distribution was generated with shuffled miRNA and P-values

    attributed to umiRNA li were computed all miRNAs 1-P 1-P Top 500 miRNAs
  15. all mRNAs Top 3000 mRNAs True null distribution was generated

    with shuffled mRNA and P-values attributed to umRNA li were computed 1-P 1-P
  16. Confusion matrices mRNA miRNA Null distribution and TD based unsupervised

    FE select almost same mRNAs and miRNAs although threshold values differ
  17. Conclusion TD based unsupervised FE is equivalent to PP Although

    null hypothesis of Gaussian distribution is not fulfilled, it is empirically coincident with null distribution generated by shuffling, although threshold P values differ (0.01 for TD based unsupervised FE and 0.1 for null distribution)