Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

WMCB2022 Tensor decomposition based unsupervised feature extraction with optimized standard
deviation applied to differentially expressed genes, DNA methylation and histone modification Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan The contents were published in the following three preprints in bioRxiv. https://doi.org/10.1101/2022.02.18.481115 https://doi.org/10.1101/2022.04.02.486807 https://doi.org/10.1101/2022.04.29.490081

WMCB2022 Basic claims 1. Principal component analysis (PCA) based- and
tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.

WMCB2022 The original (w/o SD optimization) PCA/TD based unsupervised FE
1. Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors).

WMCB2022 Matrix Tensor PCA TD Gene vectors Sample vectors Gene
vectors Sample vectors Tissue vectors Gene Sample Gene Sample Tissue P i =P χ2 [> (u 2i σ )2] Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection

WMCB2022 Although PCA/TD based FE (w/o SD optimization) worked pretty
well for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. Frequency 0 1 1-P DEG

WMCB2022 We have tried to resolve these problems…..

WMCB2022 MAQC(benchmark data set for DEG *) RNA-seq: x ij
repsents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) （＊）https://www.fda.gov/science-research/bioinformatics-tools/ microarraysequencing-quality-control-maqcseqc

WMCB2022 Gene wise mean log x ij Gene wise Log
FC ratio between two classes Density distribution MA plot

WMCB2022 u 1i u 2i Density distribution Gene wise embedding
by PCA

WMCB2022 v 1j v 2j Sapmle wise principal components Corresponds
to mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA

WMCB2022 l=1 l=2 Cumulative contribution Almost 2D embedding PCA

WMCB2022 Null hypothesis: u 2i obeys Gaussian σ= √∑ i
(u 2i −⟨u 2i ⟩)2 N σh = √∑ n<n 0 (h n −⟨h n ⟩)2 N n (n<n 0 ) P i =P χ2 [> (u 2i σ )2] Left: Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1

WMCB2022 σ(μ)2 μ =σ0 2 + σ1 2 μ DESeq2:
empirical dispersion relation Δ=⟨ x ij ⟩brain −⟨ x ij ⟩cntl ≃u 2 j LFC=log 2 ⟨x ij ⟩brain ⟨ x ij ⟩cntl =log(1+ Δ ⟨ x ij ⟩cntl ) PCA : naturally satisfied. “Highly expressed genes should be more likely selected” μ σ2 σ2

WMCB2022 Biological validation PCA vs DESeq2 Tissue specificity

WMCB2022 (A) KEGG (B) GO BP (C) Human gene atlas
PCA vs DESeq2 vs edgeR vs NOISeq vs voom

WMCB2022 16 PCA based unsupervised FE with optimized SD outperformes
various state of art methods, DESeq2, edgeR, NOISeq, voom, with assuming neither empirical dispersion relation nor negative binomial distribution.

WMCB2022 Next, we tried to apply them to DNA methylation
…...

WMCB2022 Application to DMC identification (GSE42308, microarray) DMR: known differentially
methylated regions DHS: known DNase I high sensitivity site

WMCB2022 Application to DMC identification ( EH1072, sequencing) t test
of P-values attributed by PCA between DHS and non-DHS Chromosome

WMCB2022 Various state of art methods were compared, Microarray: ChAMP
and COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA

WMCB2022 COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase
I high sensitivity site DMR: known differentially methylated regions

WMCB2022 DMRcate to EH1072 DSS takes more than a week.
metilene failed identify DMC

WMCB2022 PCA and TD based unsupervised FE with optimized SD
can be applied to identification of DMC without the specific modification

WMCB2022 Finally, we applied them to histone modification….

WMCB2022 Application to differential histone modification Histograms of 1-P i
do not obey Gaussian (double peak) but….

WMCB2022 Comparisons with other methods (H3K9me3, GSE24850) The number of
histone modification experiments overlapped with selected genes

WMCB2022 Other histone modification The number of histone modification experiments
overlapped with selected genes

WMCB2022 PCA and TD based unsupervised FE with optimized SD
can be applied to identification of differential histone modification without the specific modification

WMCB2022 Conclusions 1. Principal component analysis (PCA) based- and tensor
decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.

Tensor decomposition based unsupervised feature...

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

Y-h. Taguchi PRO

More Decks by Y-h. Taguchi

Other Decks in Science

Featured

Transcript

WMCB2022 Tensor decomposition based unsupervised feature extraction with optimized standard

WMCB2022 Basic claims 1. Principal component analysis (PCA) based- and

WMCB2022 The original (w/o SD optimization) PCA/TD based unsupervised FE

WMCB2022 Matrix Tensor PCA TD Gene vectors Sample vectors Gene

WMCB2022 Although PCA/TD based FE (w/o SD optimization) worked pretty

WMCB2022 We have tried to resolve these problems…..

WMCB2022 MAQC(benchmark data set for DEG *) RNA-seq: x ij

WMCB2022 Gene wise mean log x ij Gene wise Log

WMCB2022 u 1i u 2i Density distribution Gene wise embedding

WMCB2022 v 1j v 2j Sapmle wise principal components Corresponds

WMCB2022 l=1 l=2 Cumulative contribution Almost 2D embedding PCA

WMCB2022 Null hypothesis: u 2i obeys Gaussian σ= √∑ i

WMCB2022 σ(μ)2 μ =σ0 2 + σ1 2 μ DESeq2:

WMCB2022 Biological validation PCA vs DESeq2 Tissue specificity

WMCB2022 (A) KEGG (B) GO BP (C) Human gene atlas

WMCB2022 16 PCA based unsupervised FE with optimized SD outperformes

WMCB2022 Next, we tried to apply them to DNA methylation

WMCB2022 Application to DMC identification (GSE42308, microarray) DMR: known differentially

WMCB2022 Application to DMC identification ( EH1072, sequencing) t test

WMCB2022 Various state of art methods were compared, Microarray: ChAMP

WMCB2022 COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase

WMCB2022 DMRcate to EH1072 DSS takes more than a week.

WMCB2022 PCA and TD based unsupervised FE with optimized SD

WMCB2022 Finally, we applied them to histone modification….

WMCB2022 Application to differential histone modification Histograms of 1-P i

WMCB2022 Comparisons with other methods (H3K9me3, GSE24850) The number of

WMCB2022 Other histone modification The number of histone modification experiments

WMCB2022 PCA and TD based unsupervised FE with optimized SD

WMCB2022 Conclusions 1. Principal component analysis (PCA) based- and tensor