Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification
deviation applied to differentially expressed genes, DNA methylation and histone modification Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan The contents were published in the following three preprints in bioRxiv. https://doi.org/10.1101/2022.02.18.481115 https://doi.org/10.1101/2022.04.02.486807 https://doi.org/10.1101/2022.04.29.490081
tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.
1. Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors).
well for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. Frequency 0 1 1-P DEG
repsents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) (*)https://www.fda.gov/science-research/bioinformatics-tools/ microarraysequencing-quality-control-maqcseqc
(u 2i −⟨u 2i ⟩)2 N σh = √∑ n<n 0 (h n −⟨h n ⟩)2 N n (n<n 0 ) P i =P χ2 [> (u 2i σ )2] Left: Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1
empirical dispersion relation Δ=⟨ x ij ⟩brain −⟨ x ij ⟩cntl ≃u 2 j LFC=log 2 ⟨x ij ⟩brain ⟨ x ij ⟩cntl =log(1+ Δ ⟨ x ij ⟩cntl ) PCA : naturally satisfied. “Highly expressed genes should be more likely selected” μ σ2 σ2
decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.