Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to identification of differential gene expression, DNA methylation and histone modification

Y-h. Taguchi
December 07, 2022

Tensor decomposition based unsupervised feature extraction with optimized standard deviation applied to identification of differential gene expression, DNA methylation and histone modification

Presentation at ISAIC2022
https://www.isaic-conf.com/#/
10th Dec 2022

Y-h. Taguchi

December 07, 2022
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. Tensor decomposition based unsupervised feature extraction with
    optimized standard deviation applied to identification of differential
    gene expression, DNA methylation and histone modification
    Y-h. Taguci, Departement of Physics, Chuo University, Tokyo, Japan
    The contents were published in the
    following paper and two preprints in bioRxiv.
    https://rdcu.be/c0WE8 (Sci. Rep.)
    https://doi.org/10.1101/2022.04.02.486807
    https://doi.org/10.1101/2022.04.29.490081
    google slide
    ISAIC2022                             
    1

    View full-size slide

  2. Basic claims
    1. Principal component analysis (PCA) based- and tensor
    decomposition (TD) based- unsupervised feature extraction (FE)
    applied to identification of differentially expressed genes (DEGs)
    can outperform various state of art methods including DESeq2,
    when standard deviations (SDs) used to generate the null hypothesis
    (Gaussian distribution of principal components) are optimized.
    2. They are applicable to identification of differentially methylated
    cytosine (DMCs) as well as differential histone modification
    without specific modification as well.
    ISAIC2022                             
    2

    View full-size slide

  3. The original (w/o SD optimization) PCA/TD based unsupervised FE
    1. Apply PCA to matrices (e.g., genes ⨉ samples) or
    TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get
    vectors attributed separately to genes, samples, or
    tissues.
    2. Select the vectors of interest, attributed to samples
    and tissues.
    3. Select genes whose contribution to corresponding vectors
    attributed to genes are larger (based upon the null hypothesis
    of Gaussian distribution of components of vectors).
    ISAIC2022                             
    3

    View full-size slide

  4. Matrix
    Tensor
    PCA
    TD
    Gene vectors
    Sample vectors
    Gene vectors
    Sample vectors
    Tissue vectors
    Gene
    Sample
    Gene
    Sample
    Tissue
    Gaussian
    Dist.
    Frequency
    0 1
    1-P
    DEG
    Gene
    Selection
    ISAIC2022                             
    4

    View full-size slide

  5. Although PCA/TD based FE (w/o SD optimization) worked pretty well
    for various problems, they have some problems.
    1. Histogram of 1-P does not fully obey the null hypothesis
    2. Too small genes are selected to think that there are no false negatives.
    Frequency
    0 1
    1-P
    DEG
    ISAIC2022                             
    5

    View full-size slide

  6. We have tried to resolve these problems…..
    ISAIC2022                             
    6

    View full-size slide

  7. MAQC(benchmark data set for DEG *)
    RNA-seq: x
    ij
    repsents expression of ith gene at jth sample
    Samples: seven Universal Human Reference RNA (UHRR) vs
    seven Human Brain Reference RNA (HBRR)
    Measured for 40933 genes (done by the presenter)
    (*)
    https://www.fda.gov/science-research/bioinformatics-to
    ols/microarraysequencing-quality-control-maqcseqc
    ISAIC2022                             
    7

    View full-size slide

  8. Gene wise mean log x
    ij
    Gene wise Log FC ratio
    between two classes
    Density distribution
    MA plot
    ISAIC2022                             
    8

    View full-size slide

  9. u
    1i
    u
    2i
    Density distribution
    Gene wise embedding by PCA
    ISAIC2022                             
    9

    View full-size slide

  10. v
    1j
    v
    2j
    Sample wise principal components
    Corresponds to mean log x
    ij
    Corresponds to log FC ratio
    Corresponds to MA plot
    PCA
    ISAIC2022                             
    10

    View full-size slide

  11. l=1
    l=2
    Cumulative contribution
    Almost 2D embedding
    PCA
    ISAIC2022                             
    11

    View full-size slide

  12. Null hypothesis:
    u
    2i
    obeys Gaussian
    Right: optimal σ minimizes σ
    h
    n
    0
    h
    n
    h
    n
    n n
    Select genes with adjusted P
    i
    <0.1
    Cumulative χ2
    distribution
    Histogram 1-P
    i
    , h
    n
    of nth
    bin
    Adjusted P(n
    0
    )=0.1
    Left:
    ISAIC2022                           
    12

    View full-size slide

  13. DESeq2: empirical dispersion relation
    PCA : naturally satisfied.
    “Highly expressed genes should be more likely selected”
    μ
    σ2
    σ2
    ISAIC2022                             
    13

    View full-size slide

  14. Biological validation PCA vs DESeq2
    Tissue specificity
    ISAIC2022                             
    14

    View full-size slide

  15. (A) KEGG (B) GO BP (C) Human gene atlas
    PCA vs DESeq2 vs edgeR vs NOISeq vs voom
    ISAIC2022                             
    15

    View full-size slide

  16. PCA based unsupervised FE with optimized SD
    outperformes various state of art methods, DESeq2,
    edgeR, NOISeq, voom, with assuming neither empirical
    dispersion relation nor negative binomial distribution.
    ISAIC2022                             
    16

    View full-size slide

  17. Next, we tried to apply them to DNA methylation …...
    ISAIC2022                             
    17

    View full-size slide

  18. Application to DMC identification (GSE42308, microarray)
    DMR:
    known differentially
    methylated regions
    DHS: known DNase I
    high sensitivity site
    ISAIC2022                             
    18

    View full-size slide

  19. Application to DMC identification ( EH1072, sequencing)
    t test of P-values attributed by PCA
    between DHS and non-DHS
    Chromosome
    ISAIC2022                             
    19

    View full-size slide

  20. Various state of art methods were compared,
    Microarray: ChAMP and COHCAP
    Sequencing: DMRcate, DSS, and metilene
    None of them can be better than PCA
    ISAIC2022                             
    20

    View full-size slide

  21. COHCAP to GSE42308 ChAMP to GSE42308
    DHS: known DNase I high
    sensitivity site
    DMR: known differentially
    methylated regions
    ISAIC2022                             
    21

    View full-size slide

  22. DMRcate to EH1072
    DSS takes more than a week.
    metilene failed identify DMC
    ISAIC2022                             
    22

    View full-size slide

  23. PCA and TD based unsupervised FE with
    optimized SD can be applied to identification of
    DMC without the specific modification
    ISAIC2022                             
    23

    View full-size slide

  24. Finally, we applied them to histone modification….
    ISAIC2022                             
    24

    View full-size slide

  25. Application to differential histone modification
    Histograms of 1-P
    i
    do not obey Gaussian (double peak) but….
    ISAIC2022                             
    25

    View full-size slide

  26. Comparisons with other methods (H3K9me3, GSE24850)
    The number of histone modification experiments overlapped with
    selected genes
    ISAC2022                             
    26

    View full-size slide

  27. Other histone modification
    The number of histone modification experiments overlapped with
    selected genes
    ISAIC2022                             
    27

    View full-size slide

  28. PCA and TD based unsupervised FE with optimized SD
    can be applied to identification of differential histone
    modification without the specific modification
    ISAIC2022                             
    28

    View full-size slide

  29. Conclusions
    1. Principal component analysis (PCA) based- and tensor
    decomposition (TD) based- unsupervised feature extraction (FE)
    applied to identification of differentially expressed genes (DEGs)
    can outperform various state of art methods including DESeq2,
    when standard deviations (SDs) used to generate the null
    hypothesis (Gaussian distribution of principal components) are
    optimized.
    2. They are applicable to identification of differentially methylated
    cytosine (DMCs) as well as differential histone modification
    without specific modification as well.
    ISAIC2022                             
    29

    View full-size slide