Advanced unsupervised feature extraction finds novel application and software tool to select more reasonable differentially methylated cytosines

google slide Advanced unsupervised feature extraction finds novel application and
software tool to select more reasonable differentially methylated cytosines Y-H. Taguchi, Department of Physics, Chuo University, Tokyo 112-8551, Japan Turki Turki, Department of Computer Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia Preprint: doi: https://doi.org/10.1101/2022.04.02.486807 GIW2022 1 1

Motivation Recently, we have proposed the improved tensor decomposition (TD)
based unsupervised feature extraction (FE), and successfully applied to gene expression. We would like to see if the method is applicable to DNA methylation without methylation specific modification. GIW2022 2

Original method GIW2022 3

The original (w/o SD optimization) PCA/TD based unsupervised FE 1.
Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors). GIW2022 4

Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors
Sample vectors Tissue vectors Gene Sample Gene Sample Tissue Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection GIW2022 5

x ijk G u l1i u l2j u l3k L
1 L 2 L 3 HOSVD (Higher Order Singular Value Decomposition) N M K x ijk : genes number of N: genes (i), M: samples (j),K: tissues (k) example GIW2022 6

j:sample Healthy controls patients u l2j k:tissue Tissue specific expression
u l3k some l 3 some l 2 GIW2022 7

i:genes u l1i tDEG: healthy > patients If G(l 1
l 2 l 3 )>0 some l 1 has maximum |G(l 1 l 2 l 3 )| healthy < patients tDEG:

Frequency 0 1 1-P DEG u l1i is assumed to
obey Gaussian Gene selection criterion: P i must be corrected with multiple comparison correction GIW2022 9

Although PCA/TD based FE (w/o SD optimization) worked pretty well
for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. GIW2022 10

Improved method GIW2022 11

（A） Histogram of 1-P when σ is computed from distribution
（B）　That with optimized σ （C）　Grand truth. (A) Original (B) Improved (A) (B) （C） Exclude outliers to compute σ GIW2022 12

Null hypothesis: u 2i obeys Gaussian Right: optimal σ minimizes
σ h n 0 h n h n n n Select genes with adjusted P i <0.01 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1 Left: GIW2022 13 gene expression

Application to DNA methylation For microarray, all probe data is
used as it is For NGS, individual site data is used as it is. → Identification of differentially methylated cytosine (DMC). GIW2022 14

Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated
regions DHS: known DNase I high sensitivity site GIW2022 15 σ h

Application to DMC identification ( EH1072, sequencing) t test of
P-values attributed by PCA between DHS and non-DHS Chromosome GIW2022 16 σ h

Various state of art methods were compared, Microarray: ChAMP and
COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA GIW2022 17

COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I
high sensitivity site DMR: known differentially methylated regions GIW2022 18

DMRcate to EH1072 DSS takes more than a week. metilene
failed identify DMC GIW2022 19 t test of P-values attributed by PCA between DHS and non-DHS

Conclusion PCA and TD based unsupervised FE with optimized SD
can be applied to identification of DMC without the specific modification from that used for gene expression. GIW2022 20

Advanced unsupervised feature extraction finds ...

Advanced unsupervised feature extraction finds novel application and software tool to select more reasonable differentially methylated cytosines

Y-h. Taguchi PRO

More Decks by Y-h. Taguchi

Other Decks in Science

Featured

Transcript

google slide Advanced unsupervised feature extraction finds novel application and

Motivation Recently, we have proposed the improved tensor decomposition (TD)

Original method GIW2022 3

The original (w/o SD optimization) PCA/TD based unsupervised FE 1.

Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors

x ijk G u l1i u l2j u l3k L

j:sample Healthy controls patients u l2j k:tissue Tissue specific expression

i:genes u l1i tDEG: healthy > patients If G(l 1

Frequency 0 1 1-P DEG u l1i is assumed to

Although PCA/TD based FE (w/o SD optimization) worked pretty well

Improved method GIW2022 11

（A） Histogram of 1-P when σ is computed from distribution

Null hypothesis: u 2i obeys Gaussian Right: optimal σ minimizes

Application to DNA methylation For microarray, all probe data is

Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated

Application to DMC identification ( EH1072, sequencing) t test of

Various state of art methods were compared, Microarray: ChAMP and

COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I

DMRcate to EH1072 DSS takes more than a week. metilene

Conclusion PCA and TD based unsupervised FE with optimized SD