Slide 1

Slide 1 text

google slide Advanced unsupervised feature extraction finds novel application and software tool to select more reasonable differentially methylated cytosines Y-H. Taguchi, Department of Physics, Chuo University, Tokyo 112-8551, Japan Turki Turki, Department of Computer Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia Preprint: doi: https://doi.org/10.1101/2022.04.02.486807 GIW2022 1 1

Slide 2

Slide 2 text

Motivation Recently, we have proposed the improved tensor decomposition (TD) based unsupervised feature extraction (FE), and successfully applied to gene expression. We would like to see if the method is applicable to DNA methylation without methylation specific modification. GIW2022 2

Slide 3

Slide 3 text

Original method GIW2022 3

Slide 4

Slide 4 text

The original (w/o SD optimization) PCA/TD based unsupervised FE 1. Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors). GIW2022 4

Slide 5

Slide 5 text

Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors Sample vectors Tissue vectors Gene Sample Gene Sample Tissue Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection GIW2022 5

Slide 6

Slide 6 text

x ijk G u l1i u l2j u l3k L 1 L 2 L 3 HOSVD (Higher Order Singular Value Decomposition) N M K x ijk : genes number of N: genes (i), M: samples (j),K: tissues (k) example GIW2022 6

Slide 7

Slide 7 text

j:sample Healthy controls patients u l2j k:tissue Tissue specific expression u l3k some l 3 some l 2 GIW2022 7

Slide 8

Slide 8 text

i:genes u l1i tDEG: healthy > patients If G(l 1 l 2 l 3 )>0 some l 1 has maximum |G(l 1 l 2 l 3 )| healthy < patients tDEG:

Slide 9

Slide 9 text

Frequency 0 1 1-P DEG u l1i is assumed to obey Gaussian Gene selection criterion: P i must be corrected with multiple comparison correction GIW2022 9

Slide 10

Slide 10 text

Although PCA/TD based FE (w/o SD optimization) worked pretty well for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. GIW2022 10

Slide 11

Slide 11 text

Improved method GIW2022 11

Slide 12

Slide 12 text

(A) Histogram of 1-P when σ is computed from distribution (B) That with optimized σ (C) Grand truth. (A) Original (B) Improved (A) (B) (C) Exclude outliers to compute σ GIW2022 12

Slide 13

Slide 13 text

Null hypothesis: u 2i obeys Gaussian Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.01 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1 Left: GIW2022 13 gene expression

Slide 14

Slide 14 text

Application to DNA methylation For microarray, all probe data is used as it is For NGS, individual site data is used as it is. → Identification of differentially methylated cytosine (DMC). GIW2022 14

Slide 15

Slide 15 text

Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated regions DHS: known DNase I high sensitivity site GIW2022 15 σ h

Slide 16

Slide 16 text

Application to DMC identification ( EH1072, sequencing) t test of P-values attributed by PCA between DHS and non-DHS Chromosome GIW2022 16 σ h

Slide 17

Slide 17 text

Various state of art methods were compared, Microarray: ChAMP and COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA GIW2022 17

Slide 18

Slide 18 text

COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I high sensitivity site DMR: known differentially methylated regions GIW2022 18

Slide 19

Slide 19 text

DMRcate to EH1072 DSS takes more than a week. metilene failed identify DMC GIW2022 19 t test of P-values attributed by PCA between DHS and non-DHS

Slide 20

Slide 20 text

Conclusion PCA and TD based unsupervised FE with optimized SD can be applied to identification of DMC without the specific modification from that used for gene expression. GIW2022 20