Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to differentially expressed genes, DNA methylation and histone modification

Presentation at WMCB2022
https://sites.google.com/view/wmcb2022/

9th June 2022, virtual workshop

Y-h. Taguchi

June 09, 2022
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. WMCB2022
    Tensor decomposition based unsupervised feature extraction with
    optimized standard deviation applied to differentially expressed
    genes, DNA methylation and histone modification
    Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan
    The contents were published in the following three preprints in
    bioRxiv.
    https://doi.org/10.1101/2022.02.18.481115
    https://doi.org/10.1101/2022.04.02.486807
    https://doi.org/10.1101/2022.04.29.490081

    View full-size slide

  2. WMCB2022
    Basic claims
    1. Principal component analysis (PCA) based- and tensor
    decomposition (TD) based- unsupervised feature extraction (FE)
    applied to identification of differenatially expressed genes (DEGs)
    can outperform various state of art methods including DESeq2,
    when standard deviations (SDs) used to generate the null hypothesis
    (Gaussian distribution of principal components) are optimized.
    2. They are applicable to identification of differentially methylated
    cytosine (DMCs) as well as differential histone modification without
    specific modification as well.

    View full-size slide

  3. WMCB2022
    The original (w/o SD optimization) PCA/TD based unsupervised FE
    1. Apply PCA to matrices (e.g., genes ⨉ samples)
    or TD to tensors (e.g. genes ⨉ samples ⨉ tissues)
    and get vectors attributed separately to genes,
    samples, or tissues.
    2. Select the vectors of interest, attributed to
    samples and tissues.
    3. Select genes whose contribution to
    corresponding vectors attributed to genes are
    larger (based upon the null hypothesis of Gaussian
    distribution of components of vectors).

    View full-size slide

  4. WMCB2022
    Matrix
    Tensor
    PCA
    TD
    Gene vectors
    Sample vectors
    Gene vectors
    Sample vectors
    Tissue vectors
    Gene
    Sample
    Gene
    Sample
    Tissue
    P
    i
    =P
    χ2
    [>
    (u
    2i
    σ
    )2]
    Gaussian
    Dist.
    Frequency
    0 1
    1-P
    DEG
    Gene
    Selection

    View full-size slide

  5. WMCB2022
    Although PCA/TD based FE (w/o SD optimization) worked pretty well
    for various problems, they have some problems.
    1. Histogram of 1-P does not fully obey the null hypothesis
    2. Too small genes are selected to think that there are no false negatives.
    Frequency
    0 1
    1-P
    DEG

    View full-size slide

  6. WMCB2022
    We have tried to resolve these problems…..

    View full-size slide

  7. WMCB2022
    MAQC(benchmark data set for DEG *)
    RNA-seq: x
    ij
    repsents expression of ith gene at jth sample
    Samples: seven Universal Human Reference RNA (UHRR) vs
    seven Human Brain Reference RNA (HBRR)
    Measured for 40933 genes (done by the presenter)
    (*)https://www.fda.gov/science-research/bioinformatics-tools/
    microarraysequencing-quality-control-maqcseqc

    View full-size slide

  8. WMCB2022
    Gene wise mean log x
    ij
    Gene wise Log FC ratio
    between two classes
    Density distribution
    MA plot

    View full-size slide

  9. WMCB2022
    u
    1i
    u
    2i
    Density distribution
    Gene wise embedding by PCA

    View full-size slide

  10. WMCB2022
    v
    1j
    v
    2j
    Sapmle wise principal components
    Corresponds to mean log x
    ij
    Corresponds to log FC ratio
    Corresponds to MA plot
    PCA

    View full-size slide

  11. WMCB2022
    l=1
    l=2
    Cumulative contribution
    Almost 2D embedding
    PCA

    View full-size slide

  12. WMCB2022
    Null hypothesis:
    u
    2i
    obeys Gaussian
    σ=
    √∑
    i
    (u
    2i
    −⟨u
    2i
    ⟩)2
    N
    σh
    =
    √∑
    n0
    (h
    n
    −⟨h
    n
    ⟩)2
    N
    n
    (n0
    )
    P
    i
    =P
    χ2
    [>
    (u
    2i
    σ
    )2]
    Left:
    Right: optimal σ minimizes σ
    h
    n
    0
    h
    n
    h
    n
    n n
    Select genes with adjusted P
    i
    <0.1
    Cumulative χ2
    distribution
    Histogram 1-P
    i
    , h
    n
    of nth bin
    Adjusted P(n
    0
    )=0.1

    View full-size slide

  13. WMCB2022
    σ(μ)2
    μ =σ0
    2
    +
    σ1
    2
    μ
    DESeq2: empirical dispersion relation
    Δ=⟨ x
    ij
    ⟩brain
    −⟨ x
    ij
    ⟩cntl
    ≃u
    2 j
    LFC=log
    2
    ⟨x
    ij
    ⟩brain
    ⟨ x
    ij
    ⟩cntl
    =log(1+ Δ
    ⟨ x
    ij
    ⟩cntl
    )
    PCA : naturally satisfied.
    “Highly expressed genes should be more likely selected”
    μ
    σ2
    σ2

    View full-size slide

  14. WMCB2022
    Biological validation PCA vs DESeq2
    Tissue specificity

    View full-size slide

  15. WMCB2022
    (A) KEGG (B) GO BP (C) Human gene atlas
    PCA vs DESeq2 vs edgeR vs NOISeq vs voom

    View full-size slide

  16. WMCB2022 16
    PCA based unsupervised FE with optimized SD
    outperformes various state of art methods, DESeq2,
    edgeR, NOISeq, voom, with assuming neither empirical
    dispersion relation nor negative binomial distribution.

    View full-size slide

  17. WMCB2022
    Next, we tried to apply them to DNA methylation …...

    View full-size slide

  18. WMCB2022
    Application to DMC identification (GSE42308, microarray)
    DMR: known
    differentially
    methylated regions
    DHS: known DNase
    I high sensitivity site

    View full-size slide

  19. WMCB2022
    Application to DMC identification ( EH1072, sequencing)
    t test of P-values attributed by PCA
    between DHS and non-DHS
    Chromosome

    View full-size slide

  20. WMCB2022
    Various state of art methods were compared,
    Microarray: ChAMP and COHCAP
    Sequencing: DMRcate, DSS, and metilene
    None of them can be better than PCA

    View full-size slide

  21. WMCB2022
    COHCAP to GSE42308 ChAMP to GSE42308
    DHS: known DNase I
    high sensitivity site
    DMR: known differentially
    methylated regions

    View full-size slide

  22. WMCB2022
    DMRcate to EH1072
    DSS takes more than a week.
    metilene failed identify DMC

    View full-size slide

  23. WMCB2022
    PCA and TD based unsupervised FE with
    optimized SD can be applied to identification of
    DMC without the specific modification

    View full-size slide

  24. WMCB2022
    Finally, we applied them to histone modification….

    View full-size slide

  25. WMCB2022
    Application to differential histone modification
    Histograms of 1-P
    i
    do not obey Gaussian (double peak) but….

    View full-size slide

  26. WMCB2022
    Comparisons with other methods (H3K9me3, GSE24850)
    The number of histone modification experiments overlapped with
    selected genes

    View full-size slide

  27. WMCB2022
    Other histone modification
    The number of histone modification experiments overlapped with
    selected genes

    View full-size slide

  28. WMCB2022
    PCA and TD based unsupervised FE with optimized SD
    can be applied to identification of differential histone
    modification without the specific modification

    View full-size slide

  29. WMCB2022
    Conclusions
    1. Principal component analysis (PCA) based- and tensor
    decomposition (TD) based- unsupervised feature extraction (FE)
    applied to identification of differenatially expressed genes (DEGs)
    can outperform various state of art methods including DESeq2,
    when standard deviations (SDs) used to generate the null hypothesis
    (Gaussian distribution of principal components) are optimized.
    2. They are applicable to identification of differentially methylated
    cytosine (DMCs) as well as differential histone modification
    without specific modification as well.

    View full-size slide