Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advanced unsupervised feature extraction finds ...

Y-h. Taguchi
December 07, 2022

Advanced unsupervised feature extraction finds novel application and software tool to select more reasonable differentially methylated cytosines

Presentation at GIW/ISCB ASIA 2022
https://www.iscb.org/giw-iscb-asia2022
13th Dec 2022

Y-h. Taguchi

December 07, 2022
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. google slide Advanced unsupervised feature extraction finds novel application and

    software tool to select more reasonable differentially methylated cytosines Y-H. Taguchi, Department of Physics, Chuo University, Tokyo 112-8551, Japan Turki Turki, Department of Computer Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia Preprint: doi: https://doi.org/10.1101/2022.04.02.486807 GIW2022 1 1
  2. Motivation Recently, we have proposed the improved tensor decomposition (TD)

    based unsupervised feature extraction (FE), and successfully applied to gene expression. We would like to see if the method is applicable to DNA methylation without methylation specific modification. GIW2022 2
  3. The original (w/o SD optimization) PCA/TD based unsupervised FE 1.

    Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors). GIW2022 4
  4. Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors

    Sample vectors Tissue vectors Gene Sample Gene Sample Tissue Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection GIW2022 5
  5. x ijk G u l1i u l2j u l3k L

    1 L 2 L 3 HOSVD (Higher Order Singular Value Decomposition) N M K x ijk : genes number of N: genes (i), M: samples (j),K: tissues (k) example GIW2022 6
  6. i:genes u l1i tDEG: healthy > patients If G(l 1

    l 2 l 3 )>0 some l 1 has maximum |G(l 1 l 2 l 3 )| healthy < patients tDEG:
  7. Frequency 0 1 1-P DEG u l1i is assumed to

    obey Gaussian Gene selection criterion: P i must be corrected with multiple comparison correction GIW2022 9
  8. Although PCA/TD based FE (w/o SD optimization) worked pretty well

    for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. GIW2022 10
  9. (A) Histogram of 1-P when σ is computed from distribution

    (B) That with optimized σ (C) Grand truth. (A) Original (B) Improved (A) (B) (C) Exclude outliers to compute σ GIW2022 12
  10. Null hypothesis: u 2i obeys Gaussian Right: optimal σ minimizes

    σ h n 0 h n h n n n Select genes with adjusted P i <0.01 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1 Left: GIW2022 13 gene expression
  11. Application to DNA methylation For microarray, all probe data is

    used as it is For NGS, individual site data is used as it is. → Identification of differentially methylated cytosine (DMC). GIW2022 14
  12. Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated

    regions DHS: known DNase I high sensitivity site GIW2022 15 σ h
  13. Application to DMC identification ( EH1072, sequencing) t test of

    P-values attributed by PCA between DHS and non-DHS Chromosome GIW2022 16 σ h
  14. Various state of art methods were compared, Microarray: ChAMP and

    COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA GIW2022 17
  15. COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I

    high sensitivity site DMR: known differentially methylated regions GIW2022 18
  16. DMRcate to EH1072 DSS takes more than a week. metilene

    failed identify DMC GIW2022 19 t test of P-values attributed by PCA between DHS and non-DHS
  17. Conclusion PCA and TD based unsupervised FE with optimized SD

    can be applied to identification of DMC without the specific modification from that used for gene expression. GIW2022 20