Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advanced unsupervised feature extraction finds novel application and software tool to select more reasonable differentially methylated cytosines

Y-h. Taguchi
December 07, 2022

Advanced unsupervised feature extraction finds novel application and software tool to select more reasonable differentially methylated cytosines

Presentation at GIW/ISCB ASIA 2022
https://www.iscb.org/giw-iscb-asia2022
13th Dec 2022

Y-h. Taguchi

December 07, 2022
Tweet

More Decks by Y-h. Taguchi

Other Decks in Science

Transcript

  1. google slide
    Advanced unsupervised feature extraction finds novel application
    and software tool to select more reasonable differentially
    methylated cytosines
    Y-H. Taguchi, Department of Physics,
    Chuo University, Tokyo 112-8551, Japan
    Turki Turki, Department of Computer Science,
    King Abdulaziz University, Jeddah 21589, Saudi Arabia
    Preprint: doi: https://doi.org/10.1101/2022.04.02.486807
    GIW2022 1
    1

    View full-size slide

  2. Motivation
    Recently, we have proposed the improved tensor decomposition
    (TD) based unsupervised feature extraction (FE), and
    successfully applied to gene expression.
    We would like to see if the method is applicable to DNA
    methylation without methylation specific modification.
    GIW2022 2

    View full-size slide

  3. Original method
    GIW2022 3

    View full-size slide

  4. The original (w/o SD optimization) PCA/TD based unsupervised FE
    1. Apply PCA to matrices (e.g., genes ⨉ samples) or
    TD to tensors (e.g. genes ⨉ samples ⨉ tissues) and get
    vectors attributed separately to genes, samples, or
    tissues.
    2. Select the vectors of interest, attributed to samples
    and tissues.
    3. Select genes whose contribution to corresponding vectors
    attributed to genes are larger (based upon the null hypothesis
    of Gaussian distribution of components of vectors).
    GIW2022 4

    View full-size slide

  5. Matrix
    Tensor
    PCA
    TD
    Gene vectors
    Sample vectors
    Gene vectors
    Sample vectors
    Tissue vectors
    Gene
    Sample
    Gene
    Sample
    Tissue
    Gaussian
    Dist.
    Frequency
    0 1
    1-P
    DEG
    Gene
    Selection
    GIW2022 5

    View full-size slide

  6. x
    ijk
    G
    u
    l1i
    u
    l2j
    u
    l3k
    L
    1
    L
    2
    L
    3
    HOSVD (Higher Order Singular Value Decomposition)
    N
    M
    K
    x
    ijk
    : genes number of N: genes (i),
    M: samples (j),K: tissues (k)
    example
    GIW2022 6

    View full-size slide

  7. j:sample
    Healthy
    controls
    patients
    u
    l2j
    k:tissue
    Tissue specific expression
    u
    l3k
    some l
    3
    some l
    2
    GIW2022 7

    View full-size slide

  8. i:genes
    u
    l1i
    tDEG:
    healthy > patients
    If G(l
    1
    l
    2
    l
    3
    )>0
    some l
    1
    has maximum |G(l
    1
    l
    2
    l
    3
    )|
    healthy < patients
    tDEG:

    View full-size slide

  9. Frequency
    0 1
    1-P
    DEG u
    l1i
    is assumed to obey Gaussian
    Gene selection criterion:
    P
    i
    must be corrected with multiple comparison correction
    GIW2022 9

    View full-size slide

  10. Although PCA/TD based FE (w/o SD
    optimization) worked pretty well for various
    problems, they have some problems.
    1. Histogram of 1-P does not fully obey the null
    hypothesis
    2. Too small genes are selected to think that there
    are no false negatives.
    GIW2022 10

    View full-size slide

  11. Improved method
    GIW2022 11

    View full-size slide

  12. (A) Histogram of 1-P when σ is
    computed from distribution
    (B) That with optimized σ
    (C) Grand truth.
    (A) Original
    (B) Improved
    (A) (B) (C)
    Exclude outliers
    to compute σ
    GIW2022 12

    View full-size slide

  13. Null hypothesis:
    u
    2i
    obeys Gaussian
    Right: optimal σ minimizes σ
    h
    n
    0
    h
    n
    h
    n
    n n
    Select genes with adjusted P
    i
    <0.01
    Cumulative χ2
    distribution
    Histogram 1-P
    i
    , h
    n
    of nth
    bin
    Adjusted P(n
    0
    )=0.1
    Left:
    GIW2022 13
    gene expression

    View full-size slide

  14. Application to DNA methylation
    For microarray, all probe data is used as it is
    For NGS, individual site data is used as it is.
    → Identification of differentially methylated
    cytosine (DMC).
    GIW2022 14

    View full-size slide

  15. Application to DMC identification (GSE42308, microarray)
    DMR:
    known differentially
    methylated regions
    DHS: known DNase I
    high sensitivity site
    GIW2022 15
    σ
    h

    View full-size slide

  16. Application to DMC identification ( EH1072, sequencing)
    t test of P-values attributed by PCA
    between DHS and non-DHS
    Chromosome
    GIW2022 16
    σ
    h

    View full-size slide

  17. Various state of art methods were compared,
    Microarray: ChAMP and COHCAP
    Sequencing: DMRcate, DSS, and metilene
    None of them can be better than PCA
    GIW2022 17

    View full-size slide

  18. COHCAP to GSE42308 ChAMP to GSE42308
    DHS: known DNase I high
    sensitivity site
    DMR: known differentially
    methylated regions
    GIW2022 18

    View full-size slide

  19. DMRcate to EH1072
    DSS takes more than a week.
    metilene failed identify DMC
    GIW2022 19
    t test of P-values attributed by PCA
    between DHS and non-DHS

    View full-size slide

  20. Conclusion
    PCA and TD based unsupervised FE with optimized SD can be
    applied to identification of DMC without the specific modification
    from that used for gene expression.
    GIW2022 20

    View full-size slide