Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to identification of differential gene expression, DNA methylation and histone modification

Slide 1

Slide 1 text

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to identification of differential gene expression, DNA methylation and histone modification Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan The contents were published in the following paper and two preprints in bioRxiv. https://rdcu.be/c0WE8 (Sci. Rep.) https://doi.org/10.1101/2022.04.02.486807 https://doi.org/10.1101/2022.04.29.490081 google slide ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 1

Slide 2

Slide 2 text

Basic claims 1. Principal component analysis (PCA) based- and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well. ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 2

Slide 3

Slide 3 text

The original (w/o SD optimization) PCA/TD based unsupervised FE 1. Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors). ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 3

Slide 4

Slide 4 text

Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors Sample vectors Tissue vectors Gene Sample Gene Sample Tissue Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 4

Slide 5

Slide 5 text

Although PCA/TD based FE (w/o SD optimization) worked pretty well for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. Frequency 0 1 1-P DEG ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 5

Slide 6

Slide 6 text

We have tried to resolve these problems….. ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 6

Slide 7

Slide 7 text

MAQC(benchmark data set for DEG *) RNA-seq: x ij repsents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) （＊） https://www.fda.gov/science-research/bioinformatics-to ols/microarraysequencing-quality-control-maqcseqc ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 7

Slide 8

Slide 8 text

Gene wise mean log x ij Gene wise Log FC ratio between two classes Density distribution MA plot ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 8

Slide 9

Slide 9 text

u 1i u 2i Density distribution Gene wise embedding by PCA ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 9

Slide 10

Slide 10 text

v 1j v 2j Sample wise principal components Corresponds to mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 10

Slide 11

Slide 11 text

l=1 l=2 Cumulative contribution Almost 2D embedding PCA ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 11

Slide 12

Slide 12 text

Null hypothesis: u 2i obeys Gaussian Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1 Left: ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　 12

Slide 13

Slide 13 text

DESeq2: empirical dispersion relation PCA : naturally satisfied. “Highly expressed genes should be more likely selected” μ σ2 σ2 ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 13

Slide 14

Slide 14 text

Biological validation PCA vs DESeq2 Tissue specificity ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 14

Slide 15

Slide 15 text

(A) KEGG (B) GO BP (C) Human gene atlas PCA vs DESeq2 vs edgeR vs NOISeq vs voom ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 15

Slide 16

Slide 16 text

PCA based unsupervised FE with optimized SD outperformes various state of art methods, DESeq2, edgeR, NOISeq, voom, with assuming neither empirical dispersion relation nor negative binomial distribution. ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 16

Slide 17

Slide 17 text

Next, we tried to apply them to DNA methylation …... ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 17

Slide 18

Slide 18 text

Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated regions DHS: known DNase I high sensitivity site ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 18

Slide 19

Slide 19 text

Application to DMC identification ( EH1072, sequencing) t test of P-values attributed by PCA between DHS and non-DHS Chromosome ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 19

Slide 20

Slide 20 text

Various state of art methods were compared, Microarray: ChAMP and COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 20

Slide 21

Slide 21 text

COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I high sensitivity site DMR: known differentially methylated regions ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 21

Slide 22

Slide 22 text

DMRcate to EH1072 DSS takes more than a week. metilene failed identify DMC ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 22

Slide 23

Slide 23 text

PCA and TD based unsupervised FE with optimized SD can be applied to identification of DMC without the specific modification ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 23

Slide 24

Slide 24 text

Finally, we applied them to histone modification…. ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 24

Slide 25

Slide 25 text

Application to differential histone modification Histograms of 1-P i do not obey Gaussian (double peak) but…. ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 25

Slide 26

Slide 26 text

Comparisons with other methods (H3K9me3, GSE24850) The number of histone modification experiments overlapped with selected genes ISAC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 26

Slide 27

Slide 27 text

Other histone modification The number of histone modification experiments overlapped with selected genes ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 27

Slide 28

Slide 28 text

PCA and TD based unsupervised FE with optimized SD can be applied to identification of differential histone modification without the specific modification ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 28

Slide 29

Slide 29 text

Conclusions 1. Principal component analysis (PCA) based- and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well. ISAIC2022　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 29