Slide 1

Slide 1 text

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to identification of differential gene expression, DNA methylation and histone modification Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan The contents were published in the following paper and two preprints in bioRxiv. https://rdcu.be/c0WE8 (Sci. Rep.) https://doi.org/10.1101/2022.04.02.486807 https://doi.org/10.1101/2022.04.29.490081 google slide ISAIC2022                              1

Slide 2

Slide 2 text

Basic claims 1. Principal component analysis (PCA) based- and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well. ISAIC2022                              2

Slide 3

Slide 3 text

The original (w/o SD optimization) PCA/TD based unsupervised FE 1. Apply PCA to matrices (e.g., genes ⨉ samples) or TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get vectors attributed separately to genes, samples, or tissues. 2. Select the vectors of interest, attributed to samples and tissues. 3. Select genes whose contribution to corresponding vectors attributed to genes are larger (based upon the null hypothesis of Gaussian distribution of components of vectors). ISAIC2022                              3

Slide 4

Slide 4 text

Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors Sample vectors Tissue vectors Gene Sample Gene Sample Tissue Gaussian Dist. Frequency 0 1 1-P DEG Gene Selection ISAIC2022                              4

Slide 5

Slide 5 text

Although PCA/TD based FE (w/o SD optimization) worked pretty well for various problems, they have some problems. 1. Histogram of 1-P does not fully obey the null hypothesis 2. Too small genes are selected to think that there are no false negatives. Frequency 0 1 1-P DEG ISAIC2022                              5

Slide 6

Slide 6 text

We have tried to resolve these problems….. ISAIC2022                              6

Slide 7

Slide 7 text

MAQC(benchmark data set for DEG *) RNA-seq: x ij repsents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) (*) https://www.fda.gov/science-research/bioinformatics-to ols/microarraysequencing-quality-control-maqcseqc ISAIC2022                              7

Slide 8

Slide 8 text

Gene wise mean log x ij Gene wise Log FC ratio between two classes Density distribution MA plot ISAIC2022                              8

Slide 9

Slide 9 text

u 1i u 2i Density distribution Gene wise embedding by PCA ISAIC2022                              9

Slide 10

Slide 10 text

v 1j v 2j Sample wise principal components Corresponds to mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA ISAIC2022                              10

Slide 11

Slide 11 text

l=1 l=2 Cumulative contribution Almost 2D embedding PCA ISAIC2022                              11

Slide 12

Slide 12 text

Null hypothesis: u 2i obeys Gaussian Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1 Left: ISAIC2022                            12

Slide 13

Slide 13 text

DESeq2: empirical dispersion relation PCA : naturally satisfied. “Highly expressed genes should be more likely selected” μ σ2 σ2 ISAIC2022                              13

Slide 14

Slide 14 text

Biological validation PCA vs DESeq2 Tissue specificity ISAIC2022                              14

Slide 15

Slide 15 text

(A) KEGG (B) GO BP (C) Human gene atlas PCA vs DESeq2 vs edgeR vs NOISeq vs voom ISAIC2022                              15

Slide 16

Slide 16 text

PCA based unsupervised FE with optimized SD outperformes various state of art methods, DESeq2, edgeR, NOISeq, voom, with assuming neither empirical dispersion relation nor negative binomial distribution. ISAIC2022                              16

Slide 17

Slide 17 text

Next, we tried to apply them to DNA methylation …... ISAIC2022                              17

Slide 18

Slide 18 text

Application to DMC identification (GSE42308, microarray) DMR: known differentially methylated regions DHS: known DNase I high sensitivity site ISAIC2022                              18

Slide 19

Slide 19 text

Application to DMC identification ( EH1072, sequencing) t test of P-values attributed by PCA between DHS and non-DHS Chromosome ISAIC2022                              19

Slide 20

Slide 20 text

Various state of art methods were compared, Microarray: ChAMP and COHCAP Sequencing: DMRcate, DSS, and metilene None of them can be better than PCA ISAIC2022                              20

Slide 21

Slide 21 text

COHCAP to GSE42308 ChAMP to GSE42308 DHS: known DNase I high sensitivity site DMR: known differentially methylated regions ISAIC2022                              21

Slide 22

Slide 22 text

DMRcate to EH1072 DSS takes more than a week. metilene failed identify DMC ISAIC2022                              22

Slide 23

Slide 23 text

PCA and TD based unsupervised FE with optimized SD can be applied to identification of DMC without the specific modification ISAIC2022                              23

Slide 24

Slide 24 text

Finally, we applied them to histone modification…. ISAIC2022                              24

Slide 25

Slide 25 text

Application to differential histone modification Histograms of 1-P i do not obey Gaussian (double peak) but…. ISAIC2022                              25

Slide 26

Slide 26 text

Comparisons with other methods (H3K9me3, GSE24850) The number of histone modification experiments overlapped with selected genes ISAC2022                              26

Slide 27

Slide 27 text

Other histone modification The number of histone modification experiments overlapped with selected genes ISAIC2022                              27

Slide 28

Slide 28 text

PCA and TD based unsupervised FE with optimized SD can be applied to identification of differential histone modification without the specific modification ISAIC2022                              28

Slide 29

Slide 29 text

Conclusions 1. Principal component analysis (PCA) based- and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well. ISAIC2022                              29