Slide 1

Slide 1 text

WMCB2022 Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan The contents were published in the following three preprints in bioRxiv. https://doi.org/10.1101/2022.02.18.481115 https://doi.org/10.1101/2022.04.02.486807 https://doi.org/10.1101/2022.04.29.490081

Slide 2

Slide 2 text

WMCB2022 Basic claims 1. Principal component analysis (PCA) based- and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.

Slide 3

Slide 3 text

WMCB2022 The original (w/o SD optimization) PCA/TD based unsupervised FE 1. Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors).

Slide 4

Slide 4 text

WMCB2022 Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors Sample vectors Tissue vectors Gene Sample Gene Sample Tissue P i =P χ2 [> (u 2i σ )2] Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection

Slide 5

Slide 5 text

WMCB2022 Although PCA/TD based FE (w/o SD optimization) worked pretty well for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. Frequency 0 1 1-P DEG

Slide 6

Slide 6 text

WMCB2022 We have tried to resolve these problems…..

Slide 7

Slide 7 text

WMCB2022 MAQC(benchmark data set for DEG *) RNA-seq: x ij repsents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) (*)https://www.fda.gov/science-research/bioinformatics-tools/ microarraysequencing-quality-control-maqcseqc

Slide 8

Slide 8 text

WMCB2022 Gene wise mean log x ij Gene wise Log FC ratio between two classes Density distribution MA plot

Slide 9

Slide 9 text

WMCB2022 u 1i u 2i Density distribution Gene wise embedding by PCA

Slide 10

Slide 10 text

WMCB2022 v 1j v 2j Sapmle wise principal components Corresponds to mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA

Slide 11

Slide 11 text

WMCB2022 l=1 l=2 Cumulative contribution Almost 2D embedding PCA

Slide 12

Slide 12 text

WMCB2022 Null hypothesis: u 2i obeys Gaussian σ= √∑ i (u 2i −⟨u 2i ⟩)2 N σh = √∑ n (u 2i σ )2] Left: Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1

Slide 13

Slide 13 text

WMCB2022 σ(μ)2 μ =σ0 2 + σ1 2 μ DESeq2: empirical dispersion relation Δ=⟨ x ij ⟩brain −⟨ x ij ⟩cntl ≃u 2 j LFC=log 2 ⟨x ij ⟩brain ⟨ x ij ⟩cntl =log(1+ Δ ⟨ x ij ⟩cntl ) PCA : naturally satisfied. “Highly expressed genes should be more likely selected” μ σ2 σ2

Slide 14

Slide 14 text

WMCB2022 Biological validation PCA vs DESeq2 Tissue specificity

Slide 15

Slide 15 text

WMCB2022 (A) KEGG (B) GO BP (C) Human gene atlas PCA vs DESeq2 vs edgeR vs NOISeq vs voom

Slide 16

Slide 16 text

WMCB2022 16 PCA based unsupervised FE with optimized SD outperformes various state of art methods, DESeq2, edgeR, NOISeq, voom, with assuming neither empirical dispersion relation nor negative binomial distribution.

Slide 17

Slide 17 text

WMCB2022 Next, we tried to apply them to DNA methylation …...

Slide 18

Slide 18 text

WMCB2022 Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated regions DHS: known DNase I high sensitivity site

Slide 19

Slide 19 text

WMCB2022 Application to DMC identification ( EH1072, sequencing) t test of P-values attributed by PCA between DHS and non-DHS Chromosome

Slide 20

Slide 20 text

WMCB2022 Various state of art methods were compared, Microarray: ChAMP and COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA

Slide 21

Slide 21 text

WMCB2022 COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I high sensitivity site DMR: known differentially methylated regions

Slide 22

Slide 22 text

WMCB2022 DMRcate to EH1072 DSS takes more than a week. metilene failed identify DMC

Slide 23

Slide 23 text

WMCB2022 PCA and TD based unsupervised FE with optimized SD can be applied to identification of DMC without the specific modification

Slide 24

Slide 24 text

WMCB2022 Finally, we applied them to histone modification….

Slide 25

Slide 25 text

WMCB2022 Application to differential histone modification Histograms of 1-P i do not obey Gaussian (double peak) but….

Slide 26

Slide 26 text

WMCB2022 Comparisons with other methods (H3K9me3, GSE24850) The number of histone modification experiments overlapped with selected genes

Slide 27

Slide 27 text

WMCB2022 Other histone modification The number of histone modification experiments overlapped with selected genes

Slide 28

Slide 28 text

WMCB2022 PCA and TD based unsupervised FE with optimized SD can be applied to identification of differential histone modification without the specific modification

Slide 29

Slide 29 text

WMCB2022 Conclusions 1. Principal component analysis (PCA) based- and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differenatially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.