Slide 1

Slide 1 text

Function COSI ISMB2022 Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification Y-h. Taguchi Ryo Ishibashi Department of Physics, Chuo University, Tokyo, Japan

Slide 2

Slide 2 text

Function COSI ISMB2022 Basic claims 1. Our method applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods when standard deviations (SDs) used to generate the null hypothesis are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.

Slide 3

Slide 3 text

Function COSI ISMB2022 The contents were published in the following three preprints in bioRxiv. • https://doi.org/10.1101/2022.02.18.481115 • https://doi.org/10.1101/2022.04.02.486807 • https://doi.org/10.1101/2022.04.29.490081

Slide 4

Slide 4 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 5

Slide 5 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 6

Slide 6 text

Function COSI ISMB2022 P >> N problem (large P, small N problem) large P Traditionally, the number of variables is large Ex.) genes, epigenomes small N The number of samples is small Ex.) subjects, experimental animals, or cultured cells → difficult to handle computationally.

Slide 7

Slide 7 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 8

Slide 8 text

Function COSI ISMB2022 Why do we choose PCA and TD ? Merits ⚫Highly versatile ⚫Easy interpretation of results ⚫Easy to implement programs Effective for complicated life science data

Slide 9

Slide 9 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 10

Slide 10 text

Function COSI ISMB2022 What is tensor decomposition? xijk G ul1i ul2j ul3k L1 L2 L3 HOSVD (Higher Order Singular Value Decomposition) N M K 𝑥𝑖𝑗𝑘 ≃ ෍ 𝑙1=1 𝐿1 ෍ 𝑙2=1 𝐿2 ෍ 𝑙3=1 𝐿3 𝐺 𝑙1 𝑙2 𝑙3 𝑢𝑙1𝑖 𝑢𝑙2𝑗 𝑢𝑙3𝑘 N: number of genes (i) M: number of samples (j) K: number of tissues (k) xijk : gene expression Example

Slide 11

Slide 11 text

Function COSI ISMB2022 Interpretation of TD (1/2) j: samples Healthy control Patients ul2j For some specific l2 For some specific l3 k: tissues Tissue specific expression ul3k

Slide 12

Slide 12 text

Function COSI ISMB2022 Interpretation of TD (2/2) i:genes ul1i tDEG: tissue specific Differentially Expressed Genes tDEG: Healthy controls < Patients tDEG: Healthy controls > Patients For some specific l1 with max |G(l1 l2 l3 )| If G(l1 l2 l3 )>0 Fixed

Slide 13

Slide 13 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 14

Slide 14 text

Function COSI ISMB2022 Analysis Procedure(1/2) Matrix Tensor PCA TD Gene vectors Sample vectors Gene vectors Sample vectors Tissue vectors Gene Sample Sample Tissue Gene

Slide 15

Slide 15 text

ISMB2022 Analysis Procedure(2/2) Gaussian Dist. Novelty σ used to generate the null hypothesis are optimized. Gene Selection 0 1 1-P DEG 𝜎ℎ = σ𝑛<𝑛0 ℎ𝑛 − ℎ𝑛 2 𝑁𝑛 𝑛 < 𝑛0

Slide 16

Slide 16 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 17

Slide 17 text

Function COSI ISMB2022 MAQC(benchmark data set for DEG *) RNA-seq: x ij represents expression of ith gene at jth sample Samples: seven Universal Human Reference RNA (UHRR) vs seven Human Brain Reference RNA (HBRR) Measured for 40933 genes (done by the presenter) (*)https://www.fda.gov/science-research/bioinformatics- tools/microarraysequencing-quality-control-maqcseqc

Slide 18

Slide 18 text

Function COSI ISMB2022 Gene wise Log FC ratio between two classes Gene wise mean log x ij Density distribution MA plot

Slide 19

Slide 19 text

Function COSI ISMB2022 Sapmle wise principal components v 1j v 2j Corresponds to mean log x ij Corresponds to log FC ratio Corresponds to MA plot PCA

Slide 20

Slide 20 text

Function COSI ISMB2022 u 1i u 2i Density distribution Gene wise embedding by PCA

Slide 21

Slide 21 text

Function COSI ISMB2022 Null hypothesis: u 2i obeys Gaussian Left: Right: optimal σ minimizes σ h n 0 h n h n n n Select genes with adjusted P i <0.1 Cumulative χ2 distribution Histogram 1-P i , h n of nth bin Adjusted P(n 0 )=0.1

Slide 22

Slide 22 text

Function COSI ISMB2022 “Highly expressed genes should be more likely selected” DESeq2: empirical dispersion relation PCA : naturally satisfied μ σ2 σ2

Slide 23

Slide 23 text

Function COSI ISMB2022 Biological validation PCA vs DESeq2 Tissue specificity

Slide 24

Slide 24 text

Function COSI ISMB2022 PCA based unsupervised FE with optimized SD outperforms various state of the art methods while assuming neither empirical dispersion relation nor negative binomial distribution.

Slide 25

Slide 25 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 26

Slide 26 text

Function COSI ISMB2022 Application to DMC identification ( EH1072, sequencing) Chromosome t test of P-values attributed by PCA between DHS and non-DHS

Slide 27

Slide 27 text

Function COSI ISMB2022 PCA and TD based unsupervised FE with optimized SD can be applied to identification of DMC without the specific modification • https://doi.org/10.1101/2022.04.02.486807

Slide 28

Slide 28 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 29

Slide 29 text

Function COSI ISMB2022 Application to differential histone modification Histograms of 1-P i do not obey Gaussian (double peak) but…. H3K4me3 H3K27me3 H3K27ac

Slide 30

Slide 30 text

Function COSI ISMB2022 Comparisons with other methods (H3K9me3, GSE24850) The number of histone modification experiments overlapped with selected genes

Slide 31

Slide 31 text

Function COSI ISMB2022 PCA and TD based unsupervised FE with optimized SD can be applied to identification of differential histone modification without the specific modification • https://doi.org/10.1101/2022.04.29.490081

Slide 32

Slide 32 text

Function COSI ISMB2022 Outlines ⚫Background ◆P >> N problem (large P, small N problem) ◆ Why do we choose PCA and TD ? ⚫Methods ◆What is TD ? ◆Analysis Procedure ⚫Results ◆Benchmark data set for DEG ◆DNA methylation ◆Histone modification ⚫Conclusion For detailed information, refer to here. This book is available from [https://link.springer.com/book/ 10.1007/978-3-030-22456-1]

Slide 33

Slide 33 text

Function COSI ISMB2022 Conclusions 1. Principal component analysis (PCA) based- and tensor decomposition (TD) based- unsupervised feature extraction (FE) applied to identification of differentially expressed genes (DEGs) can outperform various state of art methods including DESeq2, when standard deviations (SDs) used to generate the null hypothesis (Gaussian distribution of principal components) are optimized. 2. They are applicable to identification of differentially methylated cytosine (DMCs) as well as differential histone modification without specific modification as well.