Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

Presentation at WMCB2022
https://sites.google.com/view/wmcb2022/

9th June 2022, virtual workshop

Y-h. Taguchi

June 09, 2022
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. WMCB2022 Tensor decomposition based unsupervised feature extraction with optimized standard

    deviation applied to differentially expressed genes, DNA methylation and histone modification Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan The contents were published in the following three preprints in bioRxiv. https://doi.org/10.1101/2022.02.18.481115 https://doi.org/10.1101/2022.04.02.486807 https://doi.org/10.1101/2022.04.29.490081
  2. WMCB2022 Basic claims 1. Principal component analysis (PCA) based- and

    tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.
  3. WMCB2022 The original (w/o SD optimization) PCA/TD based unsupervised FE

    1. Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors).
  4. WMCB2022 Matrix Tensor PCA TD Gene vectors Sample vectors Gene

    vectors Sample vectors Tissue vectors Gene Sample Gene Sample Tissue P i =P χ2 [> (u 2i σ )2] Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection
  5. WMCB2022 Although PCA/TD based FE (w/o SD optimization) worked pretty

    well for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. Frequency 0 1 1-P DEG
  6. WMCB2022 MAQC(benchmark data set for DEG *) RNA-seq: x ij

    repsents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) (*)https://www.fda.gov/science-research/bioinformatics-tools/ microarraysequencing-quality-control-maqcseqc
  7. WMCB2022 Gene wise mean log x ij Gene wise Log

    FC ratio between two classes Density distribution MA plot
  8. WMCB2022 v 1j v 2j Sapmle wise principal components Corresponds

    to mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA
  9. WMCB2022 Null hypothesis: u 2i obeys Gaussian σ= √∑ i

    (u 2i −⟨u 2i ⟩)2 N σh = √∑ n<n 0 (h n −⟨h n ⟩)2 N n (n<n 0 ) P i =P χ2 [> (u 2i σ )2] Left: Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1
  10. WMCB2022 σ(μ)2 μ =σ0 2 + σ1 2 μ DESeq2:

    empirical dispersion relation Δ=⟨ x ij ⟩brain −⟨ x ij ⟩cntl ≃u 2 j LFC=log 2 ⟨x ij ⟩brain ⟨ x ij ⟩cntl =log(1+ Δ ⟨ x ij ⟩cntl ) PCA : naturally satisfied. “Highly expressed genes should be more likely selected” μ σ2 σ2
  11. WMCB2022 (A) KEGG (B) GO BP (C) Human gene atlas

    PCA vs DESeq2 vs edgeR vs NOISeq vs voom
  12. WMCB2022 16 PCA based unsupervised FE with optimized SD outperformes

    various state of art methods, DESeq2, edgeR, NOISeq, voom, with assuming neither empirical dispersion relation nor negative binomial distribution.
  13. WMCB2022 Application to DMC identification (GSE42308, microarray) DMR: known differentially

    methylated regions DHS: known DNase I high sensitivity site
  14. WMCB2022 Application to DMC identification ( EH1072, sequencing) t test

    of P-values attributed by PCA between DHS and non-DHS Chromosome
  15. WMCB2022 Various state of art methods were compared, Microarray: ChAMP

    and COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA
  16. WMCB2022 COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase

    I high sensitivity site DMR: known differentially methylated regions
  17. WMCB2022 PCA and TD based unsupervised FE with optimized SD

    can be applied to identification of DMC without the specific modification
  18. WMCB2022 Comparisons with other methods (H3K9me3, GSE24850) The number of

    histone modification experiments overlapped with selected genes
  19. WMCB2022 PCA and TD based unsupervised FE with optimized SD

    can be applied to identification of differential histone modification without the specific modification
  20. WMCB2022 Conclusions 1. Principal component analysis (PCA) based- and tensor

    decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.