Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to identification of differential gene expression, DNA methylation and histone modification

Y-h. Taguchi
December 07, 2022

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to identification of differential gene expression, DNA methylation and histone modification

Presentation at ISAIC2022
https://www.isaic-conf.com/#/
10th Dec 2022

Y-h. Taguchi

December 07, 2022
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Tensor decomposition based unsupervised feature extraction with optimized standard deviation

    applied to identification of differential gene expression, DNA methylation and histone modification Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan The contents were published in the following paper and two preprints in bioRxiv. https://rdcu.be/c0WE8 (Sci. Rep.) https://doi.org/10.1101/2022.04.02.486807 https://doi.org/10.1101/2022.04.29.490081 google slide ISAIC2022                              1
  2. Basic claims 1. Principal component analysis (PCA) based- and tensor

    decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well. ISAIC2022                              2
  3. The original (w/o SD optimization) PCA/TD based unsupervised FE 1.

    Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors). ISAIC2022                              3
  4. Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors

    Sample vectors Tissue vectors Gene Sample Gene Sample Tissue Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection ISAIC2022                              4
  5. Although PCA/TD based FE (w/o SD optimization) worked pretty well

    for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. Frequency 0 1 1-P DEG ISAIC2022                              5
  6. We have tried to resolve these problems….. ISAIC2022                              6

  7. MAQC(benchmark data set for DEG *) RNA-seq: x ij repsents

    expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) (*) https://www.fda.gov/science-research/bioinformatics-to ols/microarraysequencing-quality-control-maqcseqc ISAIC2022                              7
  8. Gene wise mean log x ij Gene wise Log FC

    ratio between two classes Density distribution MA plot ISAIC2022                              8
  9. u 1i u 2i Density distribution Gene wise embedding by

    PCA ISAIC2022                              9
  10. v 1j v 2j Sample wise principal components Corresponds to

    mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA ISAIC2022                              10
  11. l=1 l=2 Cumulative contribution Almost 2D embedding PCA ISAIC2022                              11

  12. Null hypothesis: u 2i obeys Gaussian Right: optimal σ minimizes

    σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1 Left: ISAIC2022                            12
  13. DESeq2: empirical dispersion relation PCA : naturally satisfied. “Highly expressed

    genes should be more likely selected” μ σ2 σ2 ISAIC2022                              13
  14. Biological validation PCA vs DESeq2 Tissue specificity ISAIC2022                              14

  15. (A) KEGG (B) GO BP (C) Human gene atlas PCA

    vs DESeq2 vs edgeR vs NOISeq vs voom ISAIC2022                              15
  16. PCA based unsupervised FE with optimized SD outperformes various state

    of art methods, DESeq2, edgeR, NOISeq, voom, with assuming neither empirical dispersion relation nor negative binomial distribution. ISAIC2022                              16
  17. Next, we tried to apply them to DNA methylation …...

    ISAIC2022                              17
  18. Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated

    regions DHS: known DNase I high sensitivity site ISAIC2022                              18
  19. Application to DMC identification ( EH1072, sequencing) t test of

    P-values attributed by PCA between DHS and non-DHS Chromosome ISAIC2022                              19
  20. Various state of art methods were compared, Microarray: ChAMP and

    COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA ISAIC2022                              20
  21. COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I

    high sensitivity site DMR: known differentially methylated regions ISAIC2022                              21
  22. DMRcate to EH1072 DSS takes more than a week. metilene

    failed identify DMC ISAIC2022                              22
  23. PCA and TD based unsupervised FE with optimized SD can

    be applied to identification of DMC without the specific modification ISAIC2022                              23
  24. Finally, we applied them to histone modification…. ISAIC2022                              24

  25. Application to differential histone modification Histograms of 1-P i do

    not obey Gaussian (double peak) but…. ISAIC2022                              25
  26. Comparisons with other methods (H3K9me3, GSE24850) The number of histone

    modification experiments overlapped with selected genes ISAC2022                              26
  27. Other histone modification The number of histone modification experiments overlapped

    with selected genes ISAIC2022                              27
  28. PCA and TD based unsupervised FE with optimized SD can

    be applied to identification of differential histone modification without the specific modification ISAIC2022                              28
  29. Conclusions 1. Principal component analysis (PCA) based- and tensor decomposition

    (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well. ISAIC2022                              29